Unicode in VOTable

Tue Aug 19 04:07:58 PDT 2014

Mark, hello.

On 2014 Aug 18, at 22:10, Mark Taylor <M.B.Taylor at bristol.ac.uk> wrote:

> I reckon that would make more sense - allowing UTF-16 alongside
> UTF-8 doesn't seem to buy you much, so usage in a new VOTable version
> that provided a well-defined way to use UTF-8 should be deprecated.

I think that's sensible.

Just for completeness, I'll mention that just about the only real rationale for preferring UTF-16 to UTF-8 is if a significant fraction of the characters you're encoding would be encoded in three bytes in UTF-8, rather than two in UTF-16.  That's everything above U+07ff.  If a lot of the text were in the upper quarter of CJK or in other east Asian scripts, then there would be a space benefit to UTF-16.  Also, two bytes of UTF-16 _might_ be faster to decode than a character encoded in two bytes of UTF-8 (eg cyrillic or greek).

Since both cases are presumably unlikely in VOTables, and no big deal if they do occur, there seems no real case for permitting UTF-16 at all.

> So one possibility
> would be to keep on unicodeChar as a backwardly compatible way of
> writing BMP characters, and just disallow anything that would
> require UTF-16 surrogates.  That may be too messy to be worth
> it though.

That would require careful language in the spec, which people _might_ not read (ahem), and extra application code to do the check, so yes -- way too messy to be worth it, I think.

See you,

Norman

-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK