Unicode in VOTable

Sat Mar 8 10:06:10 PST 2014

Rob Seaman <seaman at noao.edu> wrote:
> On Mar 7, 2014, at 11:26 PM, Walter Landry <wlandry at caltech.edu> wrote:
> 
>> Mark Taylor <M.B.Taylor at bristol.ac.uk> wrote:
>> 
>>> In practice I think the answer is to use the unicodeChar type for
>>> unicode text.  If it's in the body of the XML (e.g. a TD element
>>> or PARAM value) then encode it however the VOTable itself is encoded.
>>> If it's in BINARY or FITS form, I'd encode it using UTF-16.
>>> Since UTF-16 is the same as UCS-2 for code points from 0x0 to 0xFFFF
>>> it should be OK for not-too-weird characters.  It may well
>>> work outside that range as well, since it's likely that processing
>>> software will be using UTF-16 rather than UCS-2 - as far as I can
>>> tell that's more or less what Java does (whose char primitive type
>>> is 16 bits).  If you try that and run into interesting problems,
>>> it may be worth reporting them back here.
>> 
>> It sounds like you are saying that UTF-16 is the defacto standard
>> anyway.  If that is so, then we should update the standard to reflect
>> reality.
> 
> It sounds more like unicode continues to evolve and perhaps it might
> be premature to pick a single horse to bet on.

The switch from 2 to 4 byte representations happened in 1996.
Upgrading to that would not be premature.

>> In that vein, it would also be nice if ASCII was replaced with UTF-8.
>> For some systems, UTF-8 is more natural than UTF-16.  There is also no
>> byte order issues.
>> 
>>> In any case, this is something that should be revisited if there is
>>> a future revision of the VOTable standard.
> 
> Whatever the "natural" encoding for a VOTable, ASCII is unlikely to
> vanish from the world and should be preserved as an option for
> backwards compatibility.

That is one reason I suggested UTF-8.  Plain ASCII is valid UTF-8.  It
is 100% backwards compatible.

> More fundamentally, the actual representation of strings as well as
> other data types should be presumed to be transformed from whatever
> standards whether due to compression or other logistical storage or
> workflow transformations.  See, for instance:
> 
> 	http://www.unicode.org/reports/tr6/
> 
> or various references on XML compression options.

If VOTable were to support SCSU (Standard Compression Scheme for
Unicode), I would imagine it as another encoding, along with UTF-8 and
UTF-16.  So instead of "char" or "unicodeChar", it could be
"scsuChar".

But that would be an optimization.  At this time, UTF-8 and UTF-16 are
the dominant encodings in use.  Qt, Java, C#, Python, and ICU all use
UTF-16 internally.  On the other hand, UTF-8 is compatible with
C-style strings, does not have an endianness ambiguity, and sorts
properly lexicographically.

That is why I think the VOTable standard should be changed to specify
that "char" should be interpreted as UTF-8 and "unicodeChar" as
UTF-16.  It would be completely backwards compatible and it sounds
like what people are doing anyway.

Cheers,
Walter Landry