Unicode in VOTable

Mark Taylor M.B.Taylor at bristol.ac.uk
Fri Mar 7 16:59:50 PST 2014


Walter,

you're right.  I'm not a unicode expert, but as I understand it it
used to be true (pre-Unicode 2) that you could use UCS-2 to encode
Unicode in a fixed 16-bits per character, but that doesn't cover
all the unicode 2 code points so UCS-2 is no longer a fully capable
Unicode encoding.

In practice I think the answer is to use the unicodeChar type for
unicode text.  If it's in the body of the XML (e.g. a TD element
or PARAM value) then encode it however the VOTable itself is encoded.
If it's in BINARY or FITS form, I'd encode it using UTF-16.
Since UTF-16 is the same as UCS-2 for code points from 0x0 to 0xFFFF
it should be OK for not-too-weird characters.  It may well
work outside that range as well, since it's likely that processing
software will be using UTF-16 rather than UCS-2 - as far as I can
tell that's more or less what Java does (whose char primitive type
is 16 bits).  If you try that and run into interesting problems,
it may be worth reporting them back here.

In any case, this is something that should be revisited if there is
a future revision of the VOTable standard.

Mark

(PS if anybody with more of a clue about unicode than me wants to
add to or contradict any of the above, please do!)

On Fri, 7 Mar 2014, Walter Landry wrote:

> Hello Everyone,
> 
> I tried sending this to votable at ivoa.net, but that mailing list seems
> unattended and the message never went through.  In any case, in the
> VOTable Format Definition Version 1.3, there are the statements
> 
>    VOTables support two kinds of characters: ASCII 1-byte characters
>    and Unicode (UCS-2) 2-byte characters.  Unicode is a way to
>    represent characters that is an alternative to ASCII. It uses two
>    bytes per character instead of one, it is strongly supported by XML
>    tools, and it can handle a large variety of international
>    alphabets.
> 
> This is not actually true.  Unicode, in general, requires 4 bytes per
> character.  There are encodings, such as UTF-16, which often only
> require 2 bytes, but even UTF-16 sometimes requires more than 2 bytes
> to express a character.
> 
> So, how would I express a generic unicode character in a VOTable?  Do
> I encode it as UTF 8 and disguise it as ASCII?
> 
> Thanks,
> Walter Landry
> 

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776  http://www.star.bris.ac.uk/~mbt/


More information about the apps mailing list