Unicode in VOTable

Wed Aug 13 11:18:56 PDT 2014

Mark, hello.

On 2014 Aug 13, at 17:05, Mark Taylor <m.b.taylor at bristol.ac.uk> wrote:

> On Wed, 13 Aug 2014, Norman Gray wrote:
> 
>> The underlying problem is that the count in FIELD/@arraysize, and the run-length in the BINARY encoding, have slightly different interpretations; in particular, they have different units.
>> 
>> Or, put another way: the idea of the 'length of the primitive' makes sense for the non-character types in Sect. 2.1, which have only a single fixed-length encoding, but it is really meaningless for  character strings.  _Except_, that is, for the ASCII encoding.
> 
> Or put another way: it makes sense for all of the data types defined
> by the VOTable document.  There is no inconsistency in VOTable as
> currently defined, since the fixed byte count per element is
> explicitly mandated for all types in sec 2.1.  It may be meaningless
> for UTF8-encoded character strings, but those are not addressed
> by the current version of VOTable.

Sorry -- I didn't mean to imply that there was an inconsistency.  When I was thinking about this, I realised that the 'problem' here is that there's a distinction between the model and the encoding which is completely ignorable in the current spec, because the encoding step is trivial for each of the current types.  But the distinction isn't ignorable as soon as one involves a variable-length encoding.

>> 'utf8' would be a had name, since UTF-8 is a detail of an encoding, and nothing to do with the data model.  Perhaps this should be called "char/utf8" at least.
>> 
>> One problem is that this datatype should presumably be legal in an XML-encoded table, even though it's fairly unlikely to appear there:
>> 
>> <TABLE>
>>  <FIELD ID="aString" datatype="char/utf8" arraysize="10"/>
>>  <DATA><TABLEDATA>
>>  <TR>
>>   <TD>Apple</TD> 
>> </TR></TABLE>
>> 
>> In this context, what could the FIELD/@arraysize mean?  It can't mean bytes, because this is XML, and all notion of bytes has been left behind in the lexer.
> 
> I don't see why it can't mean "bytes that would be occupied by the string
> if it were to be encoded as utf8".  The implication is that client
> applications who care about the arraysize in this context have to
> read the sequence of unicode code points from the XML document,
> encode them as UTF-8, and then work with the resulting byte array.

So a processor which parsed the above XML fragment would take the characters it has, encode that string as a byte array, and work with that?  That seems the wrong direction somehow, but since applications possibly wouldn't be processing the thus-encoded contents as a string, but just passing them around, then this makes sense.

In that case, though, 'bytes/utf8' might be a more intuitive name for this datatype.

>> The other problem is that this shares the problem at (**) above.  If the char/utf8 is part of a fixed-size array of characters (ie FIELD/@arraysize is not "*"), then the encoded value will occupy a variable number of bytes, but has no length prefix, and so can't be skipped as desired.
> 
> Under the scheme I've suggested above it occupies a fixed number of
> bytes, so this problem doesn't arise.

Indeed.

> If a fixed-length column has a variable encoding length in bytes you
> lose some of the benefits - you need to do some reads to calculate
> the offset of subequent columns and maybe rows. 

Ah, yes -- _that's_ the point I hadn't remembered.  The issue is not about knowing how many bytes to skip in a scan, but about making each 'row' the same number of bytes long so you can calculate offsets.  Got it.

> Use of a type in which the run length and arraysize are both utf8 byte
> counts allows columns (and hence possibly rows) with fixed-length
> encodings.  It also keeps the equivalence between the arraysize and the
> run length counts.

It does indeed.

All the best,

Norman

-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK