Unicode in VOTable

Mon Mar 17 10:57:24 PDT 2014

Mark Taylor <M.B.Taylor at bristol.ac.uk> wrote:
> Walter,
> 
> On Fri, 14 Mar 2014, Walter Landry wrote:
> 
>> Hi Norman,
>> 
>> Norman Gray <norman at astro.gla.ac.uk> wrote:
>> > The only place (I think) where there's any need for discussing a
>> > unicode serialisation is within BINARY blobs.  I doubt there's even
>> > a need for discussing it within FITS blobs, since their internal
>> > encoding is already specified elsewhere.
>> 
>> I am sorry if I gave the impression otherwise, but for this discussion
>> I have always only been interested in BINARY2 blobs.  In particular, I
>> want to know how to read and write Unicode characters into BINARY2
>> blobs.  Is it OK to put UTF-8 into an "ASCII Character" array, or
>> UTF-16 into a "Unicode Character" array?  The current standard says
>> no.  Can we all agree that it should say yes?
> 
> My opinion: I do not think it's a good idea to put UTF-8 into
> datatype="char" arrays as far as the existing version of VOTable goes.
> Software following the letter or spirit of the current standard should
> treat char arrays as having one character per array element, so a char
> with the high bit set should be interpreted as a character from an extended
> ASCII-like set rather than a UTF-8 surrogate character.

The current standard says ASCII, not ISO-8859-1, Windows-1250, or JIS
X 0201.  So 8-bit extended ASCII characters in a 'char' array are
already disallowed.  Do you have examples of VOTables in the wild that
use some form of extended ASCII?

> It's possible that revisiting this in a future version of the standard
> might change that, though for reasons of backward compatibility that
> might be problematic.
> 
> Having said that, I wouldn't be too surprised to find that sloppily
> coded VOTable readers (possibly including mine, I haven't checked)
> in unicode-friendly languages might actually not do that, and treat
> such arrays as UTF-8 strings because the language byte array
> handling naturally makes such interpretations.

What I would like is a revision to the standard.  It sounds like you
are agreeing with me that UTF-8 is, to some degree, existing usage.
In that case, specifying UTF-8 would be removing ambiguities and
codifying existing practice, not inventing new usage.

> Since unicodeChar is supposed to contain unicode strings, the same
> reasoning doesn't apply to datatype="unicodeChar".  Using UTF-16
> in unicodeChar follows the spirit and letter of the standard
> in the (overwhelmingly common?) case that none of the characters
> require surrogates.  If surrogate pairs are required, there is
> a fair chance it will work anyway.  So if you want to put unicode
> into a BINARY2 serialized VOTable, I think you should use
> unicodeChar arrays with a UTF-16 or maybe UCS-2 encoding.

I can always write UTF-16 characters for my own consumption.  What I
want is to be able to demand other readers to understand it as well,
in the same way that I can demand other readers to understand boolean
or floatComplex.

What I want is revisions to the standard to make, for example, VOTAble
1.4.  The first step towards that is to get consensus here that the
revision is a good idea.  Do you (or anyone else) agree these are good
revisions, or do you still have some doubts?

Thanks,
Walter Landry