Unicode in VOTable

Mon Aug 18 11:08:21 PDT 2014

Mark Taylor <M.B.Taylor at bristol.ac.uk> wrote:
> On Fri, 15 Aug 2014, Walter Landry wrote:
> 
>> Mark Taylor <m.b.taylor at bristol.ac.uk> wrote:
>> > On Thu, 14 Aug 2014, Markus Demleitner wrote:
>> > 
>> >> Now, if we go this way: Why have a new type at all?  I'd maintain no
>> >> existing valid VOTable would break if we just said something essentially
>> >> like:
>> >> 
>> >>   VOTable considers char as byte streams that can be decoded from utf-8
>> >>   for presentation purposes.   TABLEDATA encoding is presentation.
>> >>   arraysize refers to the length of the bytestream always, never to
>> >>   the length of any unicode code sequence decodeable from the byte
>> >>   stream.
>> > 
>> > Yes, I think that would work.  "TABLEDATA encoding is presentation"
>> > seems like a rather radical statement in terms of the way one
>> > usually thinks about VOTable, but I can't think of any actual
>> > negative consequences.
>> 
>> This sounds a lot like what I proposed back in March, so I like it
>> too ;)  It would be good if we could do the same thing for unicodeChar
>> and UTF-16.
> 
> Maybe.  UCS-2, though it's archaic (obsolete?) does retain the
> assurance that the number of characters can be determined from
> the arraysize.

I do not know if you can even create UCS-2 these days without going
through gymnastics.  For example, Java

  http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html

only supports ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16LE, and
UTF-16BE.  So what is probably happening is that no one is actually
writing UCS-2.  They are writing UTF-16 and not noticing the
difference.

> If you can do UTF-8 in char then it could be worth retaining what's
> currently unicodeChar for that purpose, especially since it's not
> likely to be used for any other reason when theres a UTF-8
> alternative.

Some languages or environments (Java, C#, powershell) work more
naturally in UTF-16.  But they can also handle UTF-8, so if we wanted
to deprecate unicodeChar, that would also be fine with me.

Cheers,
Walter Landry