Unicode in VOTable

Thu Jun 12 11:52:18 CEST 2025

Dear Colleagues,

On Wed, Jun 11, 2025 at 09:35:47AM -0700, Russ Allbery wrote:
> Mark Taylor via apps <apps at ivoa.net> writes:
> >     The downside is that a FIELD with datatype="char" arraysize="8"
> >     can't store an 8-character string if those characters are emojis.
> >     Personally, I think that's OK, if you want to declare fixed-length
> >     char fields, you will now have to think in UTF-8 terms not
> >     code-point terms.
>
> This does feel like it's going to increase the existing schism between
> underlying database types and VOTable types because there will be no clear
> translation of arraysize for char fields between VOTable semantics and
> common database semantics. I'm not sure there's any way to avoid that, but
> it feels awkward.

I think there is a fundamental mismatch between the expectation of
constant-length records (which admittedly underlies most of BINARY2
and in particular the fixed-length char arrays) and variable-length
encodings, be they UTF-8 or UTF-16.  So, I'm farily sure there is no
way to avoid deepening the schism.

My conclusion from these considertaions is that the best we can do is
to give some guidance as to avoid non-ASCII if you want reliable
lengths.  Yes, that's very awkward, but after thinking about the
alternatives for many years I'm convinced it's the least awkward.

> For example, suppose that one has a column in the database that is defined
> as CHAR(8) with a Unicode character set. What should the corresponding
> arraysize in the TAP_SCHEMA entry be for this column? 8 seems obviously
> wrong and will truncate valid data. 48 is safe but seems weird.
[...]
> ("Don't use fixed-width char fields for anything other than
> single-character ASCII flags; this is a false optimization for modern
> databases" is probably the correct answer in most cases, but we all know

I'm wondering what the right place of that sentence is.  But people
need to read it.

> It definitely should not take the naive approach of converting to UTF-8
> and then truncating at 8 bytes, since that will result in corrupt UTF-8
> that should be rejected by any UTF-8 decoder.

Given the way UTF-8 is done, I think the simplest way to say that in
normative language is: The last element in a char array must not have
its highest bit set.  Bonus points if we find language that also
covers the case of a single char, which also must not have its
highest bit set.

> >  2. Deprecate datatype="unicodeChar".  Anybody who wants to write
> >     non-ASCII text should use UTF-8 in datatype="char" instead.
>
> While in general I am in favor of using Unicode everywhere, do we lose
> anything by no longer having a way of marking fields as containing simple
> one-byte-per-character results that don't require any special processing?

Yes, but I think we win more by dumping unicodeChar on the long run
and allowing utf-8 in char.

If there is a clear use case for whatever we're losing, we could
consider an xtype "ascii" that might fill in for at least several of
these cases.

Thanks,

         Markus