Unicode in VOTable

Wed Aug 13 14:39:36 PDT 2014

On Wed, 13 Aug 2014, Norman Gray wrote:

> >> One problem is that this datatype should presumably be legal in an XML-encoded table, even though it's fairly unlikely to appear there:
> >> 
> >> <TABLE>
> >>  <FIELD ID="aString" datatype="char/utf8" arraysize="10"/>
> >>  <DATA><TABLEDATA>
> >>  <TR>
> >>   <TD>Apple</TD> 
> >> </TR></TABLE>
> >> 
> >> In this context, what could the FIELD/@arraysize mean?  It can't mean bytes, because this is XML, and all notion of bytes has been left behind in the lexer.
> > 
> > I don't see why it can't mean "bytes that would be occupied by the string
> > if it were to be encoded as utf8".  The implication is that client
> > applications who care about the arraysize in this context have to
> > read the sequence of unicode code points from the XML document,
> > encode them as UTF-8, and then work with the resulting byte array.
> 
> So a processor which parsed the above XML fragment would take the characters it has, encode that string as a byte array, and work with that?  That seems the wrong direction somehow, but since applications possibly wouldn't be processing the thus-encoded contents as a string, but just passing them around, then this makes sense.

It doesn't have to do that explicit translation to bytes, and in
most cases (e.g. presenting the string to the user somehow in a
unicode-sensitive way) it would just treat it like the string
it's read from the XML.  But if for some reason it has to
understand the (fixed or variable) array size/string length
then it needs to convert it to utf8 and count the bytes.

One important case when you would have to do that is if you have
a multi-dimensional character array, e.g.

   <FIELD datatype="utf8" arraysize="5*6"/>
   <DATA><TABLEDATA>
   <TR>
     <TD>AlphaBeta.Gamma&#x394;...E....Zeta.</TD>
     <!--12345123451234512     3451234512345-->
   </TR>
   </TABLEDATA></DATA>

You'd need to convert the string to UTF8 and step through the bytes,
not the characters, in steps of the fastest-varying array dimension.
With the existing "char" type you'd need to count the characters
instead (and the cell value in the above example would have an
extra character in it).

I'm not necessarily pushing this as a good idea.  But I think it
could work without too much complication, and my current feeling
is that other ways to get UTF-8 encoded unicode into VOTable
are more painful.

Mark

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776  http://www.star.bris.ac.uk/~mbt/