Unicode in VOTable

Thu Aug 14 01:17:03 PDT 2014

Dear VOTable fiddlers,

On Wed, Aug 13, 2014 at 10:39:36PM +0100, Mark Taylor wrote:
> On Wed, 13 Aug 2014, Norman Gray wrote:
> > >> In this context, what could the FIELD/@arraysize mean?  It
> > >> can't mean bytes, because this is XML, and all notion of bytes
> > >> has been left behind in the lexer.
> > > 
> > > I don't see why it can't mean "bytes that would be occupied by the string
> > > if it were to be encoded as utf8".  The implication is that client
> > > applications who care about the arraysize in this context have to
> > > read the sequence of unicode code points from the XML document,
> > > encode them as UTF-8, and then work with the resulting byte array.

So -- although it sounds horrible, I like it, except for the
"char/utf-8" or "utf-8" or whatever type name.

> One important case when you would have to do that is if you have
> a multi-dimensional character array, e.g.
> 
>    <FIELD datatype="utf8" arraysize="5*6"/>
>    <DATA><TABLEDATA>
>    <TR>
>      <TD>AlphaBeta.Gamma&#x394;...E....Zeta.</TD>
>      <!--12345123451234512     3451234512345-->
>    </TR>
>    </TABLEDATA></DATA>
> 
> You'd need to convert the string to UTF8 and step through the bytes,
> not the characters, in steps of the fastest-varying array dimension.
> With the existing "char" type you'd need to count the characters
> instead (and the cell value in the above example would have an
> extra character in it).

Yes, that is one of the ugly consequences.  I'd be prepared to live
with it.  I wonder how many VOTable implementations got nd-arrays
right anyway, in particular when variable-length arrays come into
play...

> I'm not necessarily pushing this as a good idea.  But I think it
> could work without too much complication, and my current feeling
> is that other ways to get UTF-8 encoded unicode into VOTable
> are more painful.

Now, if we go this way: Why have a new type at all?  I'd maintain no
existing valid VOTable would break if we just said something essentially
like:

  VOTable considers char as byte streams that can be decoded from utf-8
  for presentation purposes.   TABLEDATA encoding is presentation.
  arraysize refers to the length of the bytestream always, never to
  the length of any unicode code sequence decodeable from the byte
  stream.

And then we'd have go on to the ghastly array considerations ("To
decode multidimensional arrays coming from tabledata serialised
tables, first create a bytestream by encoding as canonical utf-8 and
then...").

The worst that'd happen is that char[]s with non-ASCII would be
garbled.  But then the authors of such VOTables didn't have any right
to expect any specific result anyway, given that char so far is ASCII
exclusively.

Cheers,

          Markus