Unicode in VOTable

Mon Aug 18 14:10:10 PDT 2014

On Mon, 18 Aug 2014, Walter Landry wrote:

> >> This sounds a lot like what I proposed back in March, so I like it
> >> too ;)  It would be good if we could do the same thing for unicodeChar
> >> and UTF-16.
> > 
> > Maybe.  UCS-2, though it's archaic (obsolete?) does retain the
> > assurance that the number of characters can be determined from
> > the arraysize.
> 
> I do not know if you can even create UCS-2 these days without going
> through gymnastics.  For example, Java
> 
>   http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html
> 
> only supports ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16LE, and
> UTF-16BE.  So what is probably happening is that no one is actually
> writing UCS-2.  They are writing UTF-16 and not noticing the
> difference.

OK, on following up that reference I admit that's what I've been doing.
To defend what remains of my point: if you were writing a VOTable
library in a non-UTF-16-aware environment it would be easier to
do it if unicodeChar were based on UCS-2 than on UTF-16.
But (a) probably nobody will ever write that library and (b) probably
nobody will ever write a non-BMP character in a VOTable, so it's
not much of an argument.

> > If you can do UTF-8 in char then it could be worth retaining what's
> > currently unicodeChar for that purpose, especially since it's not
> > likely to be used for any other reason when theres a UTF-8
> > alternative.
> 
> Some languages or environments (Java, C#, powershell) work more
> naturally in UTF-16.  But they can also handle UTF-8, so if we wanted
> to deprecate unicodeChar, that would also be fine with me.

I reckon that would make more sense - allowing UTF-16 alongside
UTF-8 doesn't seem to buy you much, so usage in a new VOTable version
that provided a well-defined way to use UTF-8 should be deprecated.
The only benefit of unicodeChar would be as a way of writing
most Unicode code points in a way that works the same in existing
and possible future versions of the standard.  So one possibility
would be to keep on unicodeChar as a backwardly compatible way of
writing BMP characters, and just disallow anything that would
require UTF-16 surrogates.  That may be too messy to be worth
it though.

My reading of the discussions so far is that redefining char as 
UTF-8 in a future version of VOTable with some rewording of
the standard to make clear what's going on is the most popular
option and *probably* not going to cause significant backward
compatibility problems, so I'm not going to spend any more time
voicing my concerns about it, especially since other people seem
to be better informed about unicode than I am (thanks Walter
for persisting with this).  Writing UTF-8 into char fields of
VOTables declared as per existing versions of the standard has a
reasonable chance of doing what you'd want it to, though if you
try that you ought not to be too surprised if it doesn't work.

Mark

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776  http://www.star.bris.ac.uk/~mbt/