Unicode in VOTable

Wed Aug 13 09:05:58 PDT 2014

On Wed, 13 Aug 2014, Norman Gray wrote:

> On 2014 Aug 12, at 13:36, Mark Taylor <m.b.taylor at bristol.ac.uk> wrote:
> 
> > That does present a complication, since the text of the standard
> > everywhere refers to these length declarations as array sizes
> > (element counts) not byte counts, so for instance a 3-element
> > array of 32-bit integers is declared with a size of 3 not 12.
> > Also, client code wanting to write or read VOTables with
> > char-array columns may be using strings with fixed character
> > counts rather than fixed UTF-8 encoding sizes
> > (e.g. non-ASCII CHAR(x) columns in a database),
> > so that overflows/truncations might result when some characters
> > expand to multiple bytes.  This issue might not be a showstopper,
> > but it certainly requires careful thought.
> 
> The underlying problem is that the count in FIELD/@arraysize, and the run-length in the BINARY encoding, have slightly different interpretations; in particular, they have different units.
> 
> Or, put another way: the idea of the 'length of the primitive' makes sense for the non-character types in Sect. 2.1, which have only a single fixed-length encoding, but it is really meaningless for  character strings.  _Except_, that is, for the ASCII encoding.

Or put another way: it makes sense for all of the data types defined
by the VOTable document.  There is no inconsistency in VOTable as
currently defined, since the fixed byte count per element is
explicitly mandated for all types in sec 2.1.  It may be meaningless
for UTF8-encoded character strings, but those are not addressed
by the current version of VOTable.

> > This suggests the addition of a new primitive datatype named (say)
> > "utf8" rather than repurposing the existing "char".  Array sizes
> > declared (by means 1 or 2 above) for fields with datatype="utf8"
> > would then indicate the number of bytes in the field, and the
> > number of characters (unicode code points) is not explicitly coded.
> 
> 'utf8' would be a had name, since UTF-8 is a detail of an encoding, and nothing to do with the data model.  Perhaps this should be called "char/utf8" at least.
> 
> One problem is that this datatype should presumably be legal in an XML-encoded table, even though it's fairly unlikely to appear there:
> 
> <TABLE>
>   <FIELD ID="aString" datatype="char/utf8" arraysize="10"/>
>   <DATA><TABLEDATA>
>   <TR>
>    <TD>Apple</TD> 
> </TR></TABLE>
> 
> In this context, what could the FIELD/@arraysize mean?  It can't mean bytes, because this is XML, and all notion of bytes has been left behind in the lexer.

I don't see why it can't mean "bytes that would be occupied by the string
if it were to be encoded as utf8".  The implication is that client
applications who care about the arraysize in this context have to
read the sequence of unicode code points from the XML document,
encode them as UTF-8, and then work with the resulting byte array.

> One possibility would be to say that "char/utf8" must not be used other than with a BINARY-encoded TABLE, but that's getting intricate.

no, let's not do that.

> The other problem is that this shares the problem at (**) above.  If the char/utf8 is part of a fixed-size array of characters (ie FIELD/@arraysize is not "*"), then the encoded value will occupy a variable number of bytes, but has no length prefix, and so can't be skipped as desired.

Under the scheme I've suggested above it occupies a fixed number of
bytes, so this problem doesn't arise.

> Two possible resolutions:
> 
>   1. Create a datatype "char/utf8", which is equivalent to type "char" in every sense (including the meaning of FIELD/@arraysize) except that it has a different BINARY-encoding.
> 
>   2. Leave the datatype as "char" but add a new attribute encoding="utf8".  This is ignored when the table content is XML, but indicates the encoding of any BINARY-encoded content. 
> 
> Option 2 might be ruled out on the grounds that a pre-1.4 client might read datatype='char', ignore encoding='utf8', and confuse itself -- I don't know how bad that would be.
> 
> In each case, I think the encoding in question should be UTF-8 _plus_ a run-length prefix in units of bytes, and that this prefix should be present for fixed- and variable-length strings.

If a fixed-length column has a variable encoding length in bytes you
lose some of the benefits - you need to do some reads to calculate
the offset of subequent columns and maybe rows.  If you're using
an external BINARY/BINARY2 random-access file (rather than an
inline stream) that could be a big disadvantage, though admittedly
that form of serialization does not seem to be much used.

Use of a type in which the run length and arraysize are both utf8 byte
counts allows columns (and hence possibly rows) with fixed-length
encodings.  It also keeps the equivalence between the arraysize and the
run length counts.

Mark

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776  http://www.star.bris.ac.uk/~mbt/