Unicode in VOTable

Wed Oct 15 16:53:19 CEST 2014

Dear Apps,

I want to follow up the issue of Unicode and VOTable on this list
with a couple of comments for those that were, and were not,
at the recent interop in Banff.

At the interop, I attempted to summarise the problem and possible
solutions, with a presentation you can see here:

   http//wiki.ivoa.net/internal/IVOA/InterOpOct2014Applications/vot-unicode.pdf

This ends with the following three proposals for the way forward
as regards representing character array length:

   P1: Define both arraysize and binary run-length as "number of code points"
   P2: Define arraysize as "number of code points" and
       binary run-length as "number of bytes"
   P3: Define both arraysize and binary run-length as "number of bytes
       the characters would take in UTF-8"

(for more detail, see the PDF).

An important point I failed to make during the presentation is that
adopting P1 or P2 would require changes to the code for handling
the existing char type in existing parsers, since they would represent
a change in the actual binary serialization from what we have at
VOTable 1.3.  Adopting P3, if we are redefining the char datatype
rather than introducing a new one, would require changes to
the language of the standard, but not to the code of most(?)
VOTable parsers (the exception would be parsers that currently
really are doing 7-bit ASCII processing, either because they are
implemented in non-Unicode-friendly languages or because they are
explicitly checking for non-ASCII).

In the light of that, I favour P3, at least if we go the way of
repurposing char rather than introducing a new unicode-friendly
datatype.

Mark

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776  http://www.star.bris.ac.uk/~mbt/