Unicode in VOTable
Norman Gray
norman at astro.gla.ac.uk
Tue Aug 19 04:07:58 PDT 2014
Mark, hello.
On 2014 Aug 18, at 22:10, Mark Taylor <M.B.Taylor at bristol.ac.uk> wrote:
> I reckon that would make more sense - allowing UTF-16 alongside
> UTF-8 doesn't seem to buy you much, so usage in a new VOTable version
> that provided a well-defined way to use UTF-8 should be deprecated.
I think that's sensible.
Just for completeness, I'll mention that just about the only real rationale for preferring UTF-16 to UTF-8 is if a significant fraction of the characters you're encoding would be encoded in three bytes in UTF-8, rather than two in UTF-16. That's everything above U+07ff. If a lot of the text were in the upper quarter of CJK or in other east Asian scripts, then there would be a space benefit to UTF-16. Also, two bytes of UTF-16 _might_ be faster to decode than a character encoded in two bytes of UTF-8 (eg cyrillic or greek).
Since both cases are presumably unlikely in VOTables, and no big deal if they do occur, there seems no real case for permitting UTF-16 at all.
> So one possibility
> would be to keep on unicodeChar as a backwardly compatible way of
> writing BMP characters, and just disallow anything that would
> require UTF-16 surrogates. That may be too messy to be worth
> it though.
That would require careful language in the spec, which people _might_ not read (ahem), and extra application code to do the check, so yes -- way too messy to be worth it, I think.
See you,
Norman
--
Norman Gray : http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK
More information about the apps
mailing list