Unicode in VOTable

Rob Seaman seaman at noao.edu
Sat Mar 8 05:21:44 PST 2014


On Mar 7, 2014, at 11:26 PM, Walter Landry <wlandry at caltech.edu> wrote:

> Mark Taylor <M.B.Taylor at bristol.ac.uk> wrote:
> 
>> In practice I think the answer is to use the unicodeChar type for
>> unicode text.  If it's in the body of the XML (e.g. a TD element
>> or PARAM value) then encode it however the VOTable itself is encoded.
>> If it's in BINARY or FITS form, I'd encode it using UTF-16.
>> Since UTF-16 is the same as UCS-2 for code points from 0x0 to 0xFFFF
>> it should be OK for not-too-weird characters.  It may well
>> work outside that range as well, since it's likely that processing
>> software will be using UTF-16 rather than UCS-2 - as far as I can
>> tell that's more or less what Java does (whose char primitive type
>> is 16 bits).  If you try that and run into interesting problems,
>> it may be worth reporting them back here.
> 
> It sounds like you are saying that UTF-16 is the defacto standard
> anyway.  If that is so, then we should update the standard to reflect
> reality.

It sounds more like unicode continues to evolve and perhaps it might be premature to pick a single horse to bet on.

> In that vein, it would also be nice if ASCII was replaced with UTF-8.
> For some systems, UTF-8 is more natural than UTF-16.  There is also no
> byte order issues.
> 
>> In any case, this is something that should be revisited if there is
>> a future revision of the VOTable standard.

Whatever the "natural" encoding for a VOTable, ASCII is unlikely to vanish from the world and should be preserved as an option for backwards compatibility.

More fundamentally, the actual representation of strings as well as other data types should be presumed to be transformed from whatever standards whether due to compression or other logistical storage or workflow transformations.  See, for instance:

	http://www.unicode.org/reports/tr6/

or various references on XML compression options.

One could imagine a representation for VOTable loosely similar to FITS tile-compression for tables:

	http://arxiv.org/abs/1201.1340

Which is to say that the "natural state" of a string, however the characters are encoded, is to be variable length in its representation both for the string as a whole as well as individual characters (or perhaps at the token level).  However a string is encoded or delimited for the task at hand, the number of characters and the number of bytes needed to represent them should be assumed to be unrelated.

Rob Seaman
NOAO



More information about the apps mailing list