Unicode in VOTable
Russ Allbery
eagle at eyrie.org
Wed Jun 11 18:35:47 CEST 2025
Mark Taylor via apps <apps at ivoa.net> writes:
> 1. Redefine datatype="char" to mean UTF-8 in the BINARY/BINARY2 encoding,
> and document-encoded unicode in TABLEDATA.
> The arraysize attribute and the BINARY/BINARY2 byte count are both
> equal to the number of bytes in the UTF-8 encoded value (not the
> number of characters/codepoints in the string).
> This won't break anything which is already correct, since you're only
> supposed to put 7-bit ASCII (whose UTF-8 representation is identical)
> into char fields.
> The downside is that a FIELD with datatype="char" arraysize="8"
> can't store an 8-character string if those characters are emojis.
> Personally, I think that's OK, if you want to declare fixed-length
> char fields, you will now have to think in UTF-8 terms not
> code-point terms.
I was trying to decide if this would cause a problem for TAP table upload
given that database schemas generally specify limits in terms of
characters, not bytes, for CHAR and VARCHAR data types, but I think I'm
convinced that this isn't a concern in that direction. If one takes the
approach of blindly translating the arraysize parameter of the type to the
length of the field, the result would be a database column that is "too
large" for the VOTable data type for non-ASCII strings, but I don't think
that causes problems as long as the TAP service knows the original type
and can reflect it on query results.
This does feel like it's going to increase the existing schism between
underlying database types and VOTable types because there will be no clear
translation of arraysize for char fields between VOTable semantics and
common database semantics. I'm not sure there's any way to avoid that, but
it feels awkward.
For example, suppose that one has a column in the database that is defined
as CHAR(8) with a Unicode character set. What should the corresponding
arraysize in the TAP_SCHEMA entry be for this column? 8 seems obviously
wrong and will truncate valid data. 48 is safe but seems weird.
("Don't use fixed-width char fields for anything other than
single-character ASCII flags; this is a false optimization for modern
databases" is probably the correct answer in most cases, but we all know
database schemas are hard to change.)
Also, a probably obvious point, but worth stating explicitly: Suppose that
the VOTable schema for a column is datatype="char" arraysize="8" but the
database column value is two Unicode characters whose UTF-8 representation
totals 12 bytes. The TAP server I think needs to truncate at the last
character that fits into the size when converted to UTF-8, and then pad.
It definitely should not take the naive approach of converting to UTF-8
and then truncating at 8 bytes, since that will result in corrupt UTF-8
that should be rejected by any UTF-8 decoder.
> 2. Deprecate datatype="unicodeChar". Anybody who wants to write
> non-ASCII text should use UTF-8 in datatype="char" instead.
While in general I am in favor of using Unicode everywhere, do we lose
anything by no longer having a way of marking fields as containing simple
one-byte-per-character results that don't require any special processing?
I suppose the alternative is to introduce yet another datatype, though,
which seems even more unappealing.
> 3. Just to remove mention of the obsolete UCS-2 from the standard,
> change the text to say that BINARY/BINARY2 unicodeChar is to be
> interpreted as UTF-16, but that behaviour is undefined where it
> contains characters outside of the UCS-2 subset of UTF-16.
> Then the BINARY/BINARY2 byte count for unicodeChar arrays is
> 2*arraysize.
> That's somewhat nasty, but I claim OK since (a) unicodeChar
> only used to be allowed for UCS-2 so it won't break any existing
> code/data[*], and (b) unicodeChar will now be deprecated so nobody
> should write new code/data that encounters this.
So basically telling implementors that to fully support Unicode you should
ignore both UTF-16 and unicodeChar and just use char with UTF-8, which can
handle the entire character set. This part seems reasonable to me.
--
Russ Allbery (eagle at eyrie.org) <https://www.eyrie.org/~eagle/>
More information about the apps
mailing list