Unicode in VOTable

Norman Gray norman at astro.gla.ac.uk
Wed Mar 12 07:46:59 PDT 2014


Walter, hello.

On 2014 Mar 10, at 08:38, Walter Landry <wlandry at caltech.edu> wrote:

> Norman Gray <norman at astro.gla.ac.uk> wrote:
>>> VOTables support two kinds of characters: ASCII characters and
>>> Unicode codepoints. Unicode is a way to represent characters that
>>> is an alternative to ASCII (though ASCII is a subset of the Unicode
>>> character repertoire).  XML files (and therefore any strings within
>>> such files) are defined in terms of a sequence of Unicode
>>> codepoints, and the Unicode definition can handle a large variety
>>> of international alphabets.
>>> 
>>> The ASCII characters 0x20 to 0x1f (inclusive) correspond exactly to
>>> the same-numbered Unicode codepoints.  Thus in the VOTable data
>>> model, the Unicode codepoints in this range may be regarded as
>>> being of type 'char' rather than their supertype 'unicodeChar'.
> 
> I am confused by what you are specifying.  It seems to allow
> unicodeChar's to be UCS-2, UTF-16, or UTF-32.  

No, these are orthogonal concepts (unless I'm misunderstanding the intention of this section).  The relevant text is part of Section 2, Data Model: that is, it's the conceptual model after parsing.  'UTF-8', 'UTF-32', and so on, are properties of the serialisation as bytes, in the XML file (or the BINARY or FITS blob).  Once the bytestring has been deserialised, then the encoding is irrelevant.

In fact_all_ of the 'bytes' column in the associated table is moot, because this is talking about the post-deserialisation model.

This section doesn't have to say anything about unicode strings within the VOTable, because the seriailsation is necessarily the same as the serialisation of the whole XML file, which is indicated elsewhere.  That is, if you put a UTF-32 string inside an XML file (declared as being) encoded as UTF-32, then the file and string are valid right now.

Indeed you can, for example, already include strings encoded in ISO-2022-JP (Japanese) into an XML file which starts with <?xml version='1.0' encoding='ISO-2022-JP'?>.  This is a valid XML file, and a valid VOTable (although your XML parser is allowed to barf on the encoding).

The only place (I think) where there's any need for discussing a unicode serialisation is within BINARY blobs.  I doubt there's even a need for discussing it within FITS blobs, since their internal encoding is already specified elsewhere.

All the best,

Norman


-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK



More information about the apps mailing list