Unicode in VOTable

Sat Mar 8 11:51:00 PST 2014

Walter and all, hello.

I'm replying to Walter's email, here, but also to Mark's and Rob's remarks.

Yesterday, I started a brief response to Walter which, as I think often happens near Unicode, inflated substantially with subtleties, and most but not all of which ends up not disagreeing materially with Mark's and Rob's subsequently arriving remarks.

Therefore you can probably skip the majority of the following (sigh), and go to the bottom ('VOTable REC text'), where I suggest some modified text for the VOTable REC.  Perhaps this is an opportunity to exercise the 'erratum' mechanism which Markus is proposing.

On 2014 Mar 7, at 19:00, Walter Landry <wlandry at caltech.edu> wrote:

> in the
> VOTable Format Definition Version 1.3, there are the statements
> 
>   VOTables support two kinds of characters: ASCII 1-byte characters
>   and Unicode (UCS-2) 2-byte characters.  Unicode is a way to
>   represent characters that is an alternative to ASCII. It uses two
>   bytes per character instead of one, it is strongly supported by XML
>   tools, and it can handle a large variety of international
>   alphabets.
> 
> This is not actually true.  Unicode, in general, requires 4 bytes per
> character.  There are encodings, such as UTF-16, which often only
> require 2 bytes, but even UTF-16 sometimes requires more than 2 bytes
> to express a character.

I'm somewhat embarrassed I didn't notice this during the discussion of VOTable 1.3.

Herewith a brief tutorial about the parts of Unicode relevant to this issue, for anyone watching who's so far innocent of them.  There are various subtleties (human ingenuity being limitless in its capacity for generating exceptions and general weirdness), but the main points are pretty simple.

There are two things, in Unicode, which are often confused and conflated: (i) unicode is defined in terms of (mathematical) integers; (ii) that's distinct from how those integers are serialised to bytes.

Unicode is a mapping of integers to glyphs.  Glyphs are idealised single characters: the letter 'a' is a glyph, independent of the font, size or shape it's represented in.  Ligatures -- for example the 'fl' font item that will sometimes be substituted for separate 'f' and 'l' characters -- are not glyphs.

There are infinitely many integers, therefore there are infinitely many glyph-mappings potentially definable in Unicode (although at present the range is limited to numbers 0x20 to 0x10fff).

A Unicode string is a sequence of integers, each representing a glyph (I'll just refer to these, less pedantically, as characters from now on).  Thus the character-length of a Unicode string is perfectly well-defined.

The codepoints 0--0x1f are, and will remain, unmapped; the codepoints 0x20--0x7f are identical to the corresponding mappings in ASCII.

How this sequence of integers is represented in memory, or in a file, is nothing to do with Unicode.  Thus, it makes no sense to say that 'Unicode requires n bits, or m bytes' per character.

KEY POINT----> The Unicode consortium has defined a number of ways of serialising a sequence of integers (ie, a 'Unicode string') to bytes.  These are 'encodings'.

The three main encodings are UTF-8, UTF-16, and UTF-32.

UTF-32 is a sequence of 4-byte integers, directly representing the codepoints.

UTF-16 is the same for most of the codepoints between 0x20 and 0xffff, but with a system of escapes ('surrogate characters') which allow one to represent characters at and above 0x10000 via a pair of 2-byte integers (ie UTF-16 _is_ variable-length: an n-character string will not necessarily turn into a 2n-byte encoded value).

UTF-8 is a variable-length encoding, which encodes an integer in 1--4 bytes.  It's designed so that integers in 0x20--0x1f are encoded identically (ie, in 1 byte) in UTF-8 and ASCII (ie, an ASCII file is a valid UTF-8 file).

UTF-16 and UTF-32 have various byte-order variants.  UTF-8 has no endianness issues.

There are some other encodings, but the only notable one is UCS-2.

UCS-2 is deprecated.  It encodes the sequence of codepoint integers directly as 2-byte integers.  This is _not_ the same as UTF-16, because UCS-2 doesn't have the 'surrogate pairs' escapes, so it cannot represent characters above 0xffff.  It's a fixed-length encoding, and _was_ (may still be?) the encoding used internally by Java and ?Python.

A UCS-2 file is readable by a UTF-16 decoder, because the UCS-2 file will not include any of the surrogate pairs (they aren't assigned to characters by the Unicode standard, so should never appear in a UCS-2 bytestring), but a UCS-2 decoder would get (possibly terminally) confused if it read a UTF-16 bytestring which did use surrogate pairs.  UCS-2 is big-endian only.

The only characters above 0xffff which are at all likely to appear in a VOTable are some special-case mathematical characters <http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols>.  So this isn't likely to be a massive problem in practice.

The Unicode consortium continues to assign more mappings, but does so in a backward-compatible way.  They don't expect to run out of space in the range below 0x10ffff.  UTF-8, -16 and -32 can cope with integers beyond this range, in any case.

VOTable and Unicode
----------------------------

As Mark points out, if a string appears in the body of a VOTable, then it is necessarily encoded in the same encoding as the whole XML file.

XML files can be in any encoding (they're not even restricted to IANA-registered encodings), with the only restriction being that an XML processor must support both UTF-8 and UTF-16.  Thus if a VOTable is encoded in 'PC8-Turkish' (to pick an example at random), that's legitimate (though a particular XML parser is allowed to barf on this).  If the VOTable spec wants to forbid this, and so break away from XML, then it should make a lot more noise about it.

If the string appears in BINARY or FITS form, then there really needs to be some way to indicate (possibly within the standard, as a single possibility) what encoding it's in.

If the VOTable spec wants to mandate a single encoding for this case, then UTF-8 is probably the best one.  It's directly compatible with ASCII for the great majority of cases, it doesn't have any endianness problems, it's easy to spot when a UTF-8 string contains non-ASCII characters (bit 8 is set), and easy to resynchronise (skip bytes until bit 8 is clear).  UTF-8 can support all of the currently anticipated Unicode characters.  I believe it's deemed unlikely that Unicode will have to be extended beyond codepoint 0x10ffff.

VOTable REC text
----------------------

The text currently in the VOTable REC, which Walter quoted, doesn't make a lot of sense as it stands.

An alternative might be to list both 'char' and 'unicodeChar' in the table as 'n/a' under 'bytes', and change the quoted paragraph to something like:

> VOTables support two kinds of characters: ASCII characters and Unicode codepoints. Unicode is a way to represent characters that is an alternative to ASCII (though ASCII is a subset of the Unicode character repertoire).  XML files (and therefore any strings within such files) are defined in terms of a sequence of Unicode codepoints, and the Unicode definition can handle a large variety of international alphabets. 
> 
> The ASCII characters 0x20 to 0x1f (inclusive) correspond exactly to the same-numbered Unicode codepoints.  Thus in the VOTable data model, the Unicode codepoints in this range may be regarded as being of type 'char' rather than their supertype 'unicodeChar'.

(this might need some refinement to handle the XML subtleties surrounding characters 0x1--0x1f).

The definition of the BINARY{,2} encoding might be adjusted to suggest that when a field labelled as type 'char' is serialised, it's a fixed-length field, with each character encoded by its ASCII byte; and when a field of type 'unicodeChar' is serialised, it's a variable-length field, with the Unicode string encoded in UTF-8.

All the best,

Norman

-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK