Unicode in VOTable
Walter Landry
wlandry at caltech.edu
Sat Mar 8 10:06:10 PST 2014
Rob Seaman <seaman at noao.edu> wrote:
> On Mar 7, 2014, at 11:26 PM, Walter Landry <wlandry at caltech.edu> wrote:
>
>> Mark Taylor <M.B.Taylor at bristol.ac.uk> wrote:
>>
>>> In practice I think the answer is to use the unicodeChar type for
>>> unicode text. If it's in the body of the XML (e.g. a TD element
>>> or PARAM value) then encode it however the VOTable itself is encoded.
>>> If it's in BINARY or FITS form, I'd encode it using UTF-16.
>>> Since UTF-16 is the same as UCS-2 for code points from 0x0 to 0xFFFF
>>> it should be OK for not-too-weird characters. It may well
>>> work outside that range as well, since it's likely that processing
>>> software will be using UTF-16 rather than UCS-2 - as far as I can
>>> tell that's more or less what Java does (whose char primitive type
>>> is 16 bits). If you try that and run into interesting problems,
>>> it may be worth reporting them back here.
>>
>> It sounds like you are saying that UTF-16 is the defacto standard
>> anyway. If that is so, then we should update the standard to reflect
>> reality.
>
> It sounds more like unicode continues to evolve and perhaps it might
> be premature to pick a single horse to bet on.
The switch from 2 to 4 byte representations happened in 1996.
Upgrading to that would not be premature.
>> In that vein, it would also be nice if ASCII was replaced with UTF-8.
>> For some systems, UTF-8 is more natural than UTF-16. There is also no
>> byte order issues.
>>
>>> In any case, this is something that should be revisited if there is
>>> a future revision of the VOTable standard.
>
> Whatever the "natural" encoding for a VOTable, ASCII is unlikely to
> vanish from the world and should be preserved as an option for
> backwards compatibility.
That is one reason I suggested UTF-8. Plain ASCII is valid UTF-8. It
is 100% backwards compatible.
> More fundamentally, the actual representation of strings as well as
> other data types should be presumed to be transformed from whatever
> standards whether due to compression or other logistical storage or
> workflow transformations. See, for instance:
>
> http://www.unicode.org/reports/tr6/
>
> or various references on XML compression options.
If VOTable were to support SCSU (Standard Compression Scheme for
Unicode), I would imagine it as another encoding, along with UTF-8 and
UTF-16. So instead of "char" or "unicodeChar", it could be
"scsuChar".
But that would be an optimization. At this time, UTF-8 and UTF-16 are
the dominant encodings in use. Qt, Java, C#, Python, and ICU all use
UTF-16 internally. On the other hand, UTF-8 is compatible with
C-style strings, does not have an endianness ambiguity, and sorts
properly lexicographically.
That is why I think the VOTable standard should be changed to specify
that "char" should be interpreted as UTF-8 and "unicodeChar" as
UTF-16. It would be completely backwards compatible and it sounds
like what people are doing anyway.
Cheers,
Walter Landry
More information about the apps
mailing list