Unicode in VOTable

Walter Landry wlandry at caltech.edu
Fri Mar 7 22:26:16 PST 2014


Mark Taylor <M.B.Taylor at bristol.ac.uk> wrote:
> Walter,
> 
> you're right.  I'm not a unicode expert, but as I understand it it
> used to be true (pre-Unicode 2) that you could use UCS-2 to encode
> Unicode in a fixed 16-bits per character, but that doesn't cover
> all the unicode 2 code points so UCS-2 is no longer a fully capable
> Unicode encoding.
> 
> In practice I think the answer is to use the unicodeChar type for
> unicode text.  If it's in the body of the XML (e.g. a TD element
> or PARAM value) then encode it however the VOTable itself is encoded.
> If it's in BINARY or FITS form, I'd encode it using UTF-16.
> Since UTF-16 is the same as UCS-2 for code points from 0x0 to 0xFFFF
> it should be OK for not-too-weird characters.  It may well
> work outside that range as well, since it's likely that processing
> software will be using UTF-16 rather than UCS-2 - as far as I can
> tell that's more or less what Java does (whose char primitive type
> is 16 bits).  If you try that and run into interesting problems,
> it may be worth reporting them back here.

It sounds like you are saying that UTF-16 is the defacto standard
anyway.  If that is so, then we should update the standard to reflect
reality.

In that vein, it would also be nice if ASCII was replaced with UTF-8.
For some systems, UTF-8 is more natural than UTF-16.  There is also no
byte order issues.

> In any case, this is something that should be revisited if there is
> a future revision of the VOTable standard.
> 
> Mark
> 
> (PS if anybody with more of a clue about unicode than me wants to
> add to or contradict any of the above, please do!)

This is a good read about Unicode.

  http://www.joelonsoftware.com/articles/Unicode.html

Cheers,
Walter Landry


More information about the apps mailing list