Unicode in VOTable

Mon Mar 10 00:38:43 PDT 2014

Norman Gray <norman at astro.gla.ac.uk> wrote:
> VOTable REC text
> ----------------------
> 
> The text currently in the VOTable REC, which Walter quoted, doesn't
> make a lot of sense as it stands.
> 
> An alternative might be to list both 'char' and 'unicodeChar' in the
> table as 'n/a' under 'bytes', and change the quoted paragraph to
> something like:
> 
>> VOTables support two kinds of characters: ASCII characters and
>> Unicode codepoints. Unicode is a way to represent characters that
>> is an alternative to ASCII (though ASCII is a subset of the Unicode
>> character repertoire).  XML files (and therefore any strings within
>> such files) are defined in terms of a sequence of Unicode
>> codepoints, and the Unicode definition can handle a large variety
>> of international alphabets.
>> 
>> The ASCII characters 0x20 to 0x1f (inclusive) correspond exactly to
>> the same-numbered Unicode codepoints.  Thus in the VOTable data
>> model, the Unicode codepoints in this range may be regarded as
>> being of type 'char' rather than their supertype 'unicodeChar'.

I am confused by what you are specifying.  It seems to allow
unicodeChar's to be UCS-2, UTF-16, or UTF-32.  It would be helpful if
it said UTF-8 or UTF-16 somewhere, much like the current document uses
UCS-2.  I would prefer a wording like

  VOTables support two kinds of characters.  A "char" is a single byte
  of a UTF-8 string (ASCII strings are valid UTF-8 strings).  A
  "unicodeChar" is two bytes of a UTF-16BE string.

This makes it clear that you can just use ASCII.  For people who
actually care about unicode, they will probably know whether they want
UTF-8 or UTF-16.

Cheers,
Walter Landry