Unicode in VOTable

Mon Aug 18 12:00:30 PDT 2014

Mark (and Walter), hello.

On 2014 Aug 18, at 17:55, Mark Taylor <m.b.taylor at bristol.ac.uk> wrote:

> Maybe.  UCS-2, though it's archaic (obsolete?) does retain the
> assurance that the number of characters can be determined from
> the arraysize.

Going further than Walter, I think one can regard UCS-2 as obsolete enough that it would be reasonable to forbid it in VOTables.  It was removed in v2.0 of the Unicode standard (we're now at v7.0; the text at <http://www.unicode.org/faq//utf_bom.html#utf16-11> is germane but slightly confusing).

Wouldn't it be possible to just use the arraysize for UTF-16 in the same way as you're proposing for UTF-8?  If the arraysize is the number of bytes the string would encode to in UTF-16 (as opposed to the number of words, as you suggest on the issues wiki page), then applications can use that in exactly the same way as it uses the attribute when parsing/skipping/indexing-into UTF-8, and the only difference would be that it would have to know to decode it as UTF-16 rather than UTF-8.

It could potentially discover that from an 'encoding' attribute (which could default to 'us-ascii' or 'utf-8'), or by deciding that 'char' means 'string-encoded-as-utf-8-bytes' and 'unicodeChar' means 'string-encoded-as-utf-16'-bytes.

Again, a naive/old VOTable reader reading a UTF-16-encoded BINARY 'char' field which is @arraysize  bytes long would probably display a rather weird string, but it wouldn't break.

All the best,

Norman

-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK