Unicode in VOTable

Mark Taylor M.B.Taylor at bristol.ac.uk
Fri Mar 14 15:10:33 PDT 2014


Walter,

On Fri, 14 Mar 2014, Walter Landry wrote:

> Hi Norman,
> 
> Norman Gray <norman at astro.gla.ac.uk> wrote:
> > The only place (I think) where there's any need for discussing a
> > unicode serialisation is within BINARY blobs.  I doubt there's even
> > a need for discussing it within FITS blobs, since their internal
> > encoding is already specified elsewhere.
> 
> I am sorry if I gave the impression otherwise, but for this discussion
> I have always only been interested in BINARY2 blobs.  In particular, I
> want to know how to read and write Unicode characters into BINARY2
> blobs.  Is it OK to put UTF-8 into an "ASCII Character" array, or
> UTF-16 into a "Unicode Character" array?  The current standard says
> no.  Can we all agree that it should say yes?

My opinion: I do not think it's a good idea to put UTF-8 into
datatype="char" arrays as far as the existing version of VOTable goes.
Software following the letter or spirit of the current standard should
treat char arrays as having one character per array element, so a char
with the high bit set should be interpreted as a character from an extended
ASCII-like set rather than a UTF-8 surrogate character.
It's possible that revisiting this in a future version of the standard
might change that, though for reasons of backward compatibility that
might be problematic.

Having said that, I wouldn't be too surprised to find that sloppily
coded VOTable readers (possibly including mine, I haven't checked)
in unicode-friendly languages might actually not do that, and treat
such arrays as UTF-8 strings because the language byte array
handling naturally makes such interpretations.

Since unicodeChar is supposed to contain unicode strings, the same
reasoning doesn't apply to datatype="unicodeChar".  Using UTF-16
in unicodeChar follows the spirit and letter of the standard
in the (overwhelmingly common?) case that none of the characters
require surrogates.  If surrogate pairs are required, there is
a fair chance it will work anyway.  So if you want to put unicode
into a BINARY2 serialized VOTable, I think you should use
unicodeChar arrays with a UTF-16 or maybe UCS-2 encoding.

Mark

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776  http://www.star.bris.ac.uk/~mbt/


More information about the apps mailing list