Unicode in VOTable

Thu Jun 12 14:30:24 CEST 2025

On Wed, 11 Jun 2025, Russ Allbery wrote:

> For example, suppose that one has a column in the database that is defined
> as CHAR(8) with a Unicode character set. What should the corresponding
> arraysize in the TAP_SCHEMA entry be for this column? 8 seems obviously
> wrong and will truncate valid data. 48 is safe but seems weird.

32, no?  The wikipedia UTF-8 page says "a variable-width encoding of
one to four one-byte (8-bit) code units".

> While in general I am in favor of using Unicode everywhere, do we lose
> anything by no longer having a way of marking fields as containing simple
> one-byte-per-character results that don't require any special processing?

It's a fair question, but IMO we don't lose enough to make it a worry.
In most string-processing contexts these days the default processing
is UTF-8 anyway and it's the one-byte-per-character strings that require
special measures (e.g. in java if you write a sloppy VOTable parser
it will probably decode char arrays as UTF-8 strings already unless
you try hard to stop it doing that).

Also, if people want to use single bytes, there's still the
unsignedByte datatype.

Mark

--
Mark Taylor  Astronomical Programmer  Physics, Bristol University, UK
m.b.taylor at bristol.ac.uk          https://www.star.bristol.ac.uk/mbt/