Unicode in VOTable

Thu Oct 16 14:46:11 CEST 2014

Hi,

On Thu, Oct 16, 2014 at 01:04:27PM +0100, Mark Taylor wrote:
> On Wed, 15 Oct 2014, Dave Morris wrote:
> > On 2014-10-15 15:53, Mark Taylor wrote:
> > > 
> > >    P3: Define both arraysize and binary run-length as "number of bytes
> > 
> > P3 does not work.
> 
> I feel that "does not work" is overstating it.
> 
> Your discussion below demonstrates that for the case of writing
> streamed VOTable output from a database with fixed-length unicode
> columns whose content may include non-ASCII characters, P3 does not
> allow you to write VOTable fields with fixed-arraysize columns.

Let me be a bit provocative here: char doesn't work (perfectly) for
you if what you want to stuff in there isn't actually chars but
codepoints.  That's hardly surprising -- I've been moaning about our
practice of stuffing geometries into char(*) fields off and on, too
(albeit for slightly different reasons).  But allowances have to be
made, and we've not introduced geometry type in VOTable (yet),
either.

Given char is *not* codepoints (at least not in C, where with modern
machines it universally is "octet").  I'd argue all confusion comes
from TABLEDATA, where you deal with codepoints rather than chars when
parsing.  *That* practice does indeed become problematic as we enter
the realm of codepoints that cannot be universally represented in a
single char (a.k.a. non-ASCII).

> > P3 gives us all the disadvantages of a 'string' data type (every char column
> > is variable length) without the advantage of explicitly defining a 'string'
> > data type (arrays of strings).
> 
> I admit that BINARY serialization of arrays of strings
> (multi-dimensional arrays of characters) becomes problematic with P3,
> since in this case you must supply the length (in P3, UTF-8 byte count)
> of the longest string - unlike for 1-d char arrays, variable-length
> is not an option.  That may be a killer argument.

Not if you take "char" for what it is.  P3 changes exactly nothing
except the *interpretation* of the byte stream.  So, I'd argue it's
string arrays in tabledata that become a problem, because what a
parser would need to do is get the codepoints from the XML parser,
encode them to UTF-8, and then perform element segmentation.

Not pretty, but not a desaster, either.  And to me it sounds better
than double encoding (dump utf-8 into the XML serialiser, having the
stuff be encoded anew).

Finally -- whether there's actually a case for a string type that
Dave wants to serialise sequences of codepoints I don't know.  But
saying what to do with the bytes in VOTable char that have their
highest bit set would definitely be a Very Good Thing Indeed.

Cheers,

        Markus