Unicode in VOTable

Mon Aug 18 14:16:31 PDT 2014

On Mon, 18 Aug 2014, Norman Gray wrote:

> Wouldn't it be possible to just use the arraysize for UTF-16 in the same way as you're proposing for UTF-8?  If the arraysize is the number of bytes the string would encode to in UTF-16 (as opposed to the number of words, as you suggest on the issues wiki page), then applications can use that in exactly the same way as it uses the attribute when parsing/skipping/indexing-into UTF-8, and the only difference would be that it would have to know to decode it as UTF-16 rather than UTF-8.

That's quite possible, but as per my other message, I don't think
there's much point expending effort or adding complication to
provide a UTF-16 alternative within VOTable to a UTF-8 encoding
if one is available.

> It could potentially discover that from an 'encoding' attribute (which could default to 'us-ascii' or 'utf-8'), or by deciding that 'char' means 'string-encoded-as-utf-8-bytes' and 'unicodeChar' means 'string-encoded-as-utf-16'-bytes.

Again, possible, but I am not in favour of either providing more
character encoding options or of splitting the information about how
characters are encoded between multiple attributes.

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776  http://www.star.bris.ac.uk/~mbt/