Unicode in VOTable

Wed Aug 27 02:24:09 PDT 2014

Hi Dave,

I share your uneasiness, but I really think anything but saying "char
arrays are sequences of bytes that, for presentation, are to be
interpreted as UTF-8" is much worse.

On Tue, Aug 26, 2014 at 12:55:27AM +0100, Dave Morris wrote:
> >I don't see why it can't mean "bytes that would be occupied by the
> >string if it were to be encoded as utf8".

> FIELD/@arraysize is in the header and applies to the whole table, not
> to a specific row.

Sure -- but then it's "number of chars (=bytes) in the array" --
doesn't seem unreasonable to me.

> It can't mean "bytes that would be occupied by the string ..."
> because we don't have _a_ single string, we have a different string
> in each row.

> If the data came from a database, then we might know the maximum
> number of characters for that column
> 
>     CREATE TABLE aTable (
>         aString CHAR(5)
>         );

That's part of the crux of the matter: We don't now what CHAR is *in
the database*.  Is it ASCII?  Some east-asian encoding?  Do you get
back (essentially) unicode codepoints?  That's not even constant for
a given platform (postgres, say), as that's configurable.

> Based on that we could calculate the maximum number of bytes needed
> to encode a value for that column
> 
>     max byte count = max size of encoded character * number of
> characters
> 
>     4 * 5 = 20

Hm -- and what happens if RFC 3629's limitation to codepoints below
U+10FFFF (i.e., 4-byte sequences) is dropped?

> If we want to have fixed length columns we would be to pad each value
> to the same length.
> 
> For CHAR(5) encoded as "char/utf8" that would be

...and of course I *really* dislike the type name char/utf8, not the
least as these names would turn up in TABLEDATA serialisation, too,
and there that's probably an outright lie (although you *could*, of
course, encode the utf-8 bytesteam in the document encoding --
*shudder*).

>     The BINARY serialization of a fixed size "char/utf8"
>     field consists of an array of bytes with enough space
>     for the most complex encoded character sequence for
>     that field.

That would require scanning all data that's going into the VOTable
before serialisation again -- and we just massaged NULL value
generation in VOTable 1.3 to avoid having to do that.

I think to arrive at a good solution we'll have to map out our
situation:

I believe much of our trouble is down to the question: What's a char?
I don't believe anyone disputes a VOTable char is and will be a
sequence of octets (8-bit bytes).  The interpretation of bytes
32...127 (and perhaps 10, 13, and a few others) is given by ASCII
encoding, all others currently have no defined semantics.

So, now people say: "We want to be able to represent *codepoints*".
That's an entriely different concept, and many people have struggled
with going from bytes+ASCII (or latin1, say) interpretation to
codepoints before -- the evolution of python2 unicode strings vs.
byte strings is fairly instructive, in particular as regards
confusion resulting from the move.

In VOTable, additional horror results from the fact that one of our
serialisations, namely TABLEDATA, has been dealing with codepoints
instead of chars all along, by virtue of using XML that already
presents streams of codepoints rather than streams of bytes to
clients in all APIs I'm aware of (which discounts PHP:-).  The other
serialisations don't do that.

So: Confusion will reign one way or another -- once you're leaving
ASCII, the count can't be right for both TABLEDATA and BINARY* (and
other non-XML serialisations) at the same time (except if we went for
*shudder* above).

And that's why I propose to declare that chars remain chars and don't
represent codepoints.  Hence, the arraysize is the number of chars
(=bytes).  This means that *if* you have codepoints in your database,
it's your job to make sure things fit (and you should consider
storing utf-8 there in the first place).  And yes, it also means that
arraysize may be too large in TABLEDATA ("TABLEDATA is
presentation"), or -- worse -- arraysize derived from TABLEDATA
codepoint sequences will be too small.

But I'd still much prefer that to the current situation where people
dump stuff into char[*] that can't legally be represented there, and
also to the situation when a new "char/utf-8" type (which I'd much
rather call "codepoint" if it came to it), would slowly be phased in
and people would continue doing it wrong for a long, long time.
After all, few are using unicodeChar today.

If the "TABLEDATA is presentation" approach really seems unpalatable
to everyone, instead of a codepoint type I'd still prefer giving
VOTable finally an actual "string" atomic type -- modelling strings
as arrays is a pain anyway.  That, then, could have serialisation
rules of its own, with extra conventions for TABLEDATA such that we
can finally have variable-length arrays of strings (without having to
pad).

But again, my vote still is strongly on defining semantics for
non-ASCII chars (as UTF-8).

Cheers,

        Markus