Unicode in VOTable
Mark Taylor
m.b.taylor at bristol.ac.uk
Thu Jun 12 14:30:24 CEST 2025
On Wed, 11 Jun 2025, Russ Allbery wrote:
> For example, suppose that one has a column in the database that is defined
> as CHAR(8) with a Unicode character set. What should the corresponding
> arraysize in the TAP_SCHEMA entry be for this column? 8 seems obviously
> wrong and will truncate valid data. 48 is safe but seems weird.
32, no? The wikipedia UTF-8 page says "a variable-width encoding of
one to four one-byte (8-bit) code units".
> While in general I am in favor of using Unicode everywhere, do we lose
> anything by no longer having a way of marking fields as containing simple
> one-byte-per-character results that don't require any special processing?
It's a fair question, but IMO we don't lose enough to make it a worry.
In most string-processing contexts these days the default processing
is UTF-8 anyway and it's the one-byte-per-character strings that require
special measures (e.g. in java if you write a sloppy VOTable parser
it will probably decode char arrays as UTF-8 strings already unless
you try hard to stop it doing that).
Also, if people want to use single bytes, there's still the
unsignedByte datatype.
Mark
--
Mark Taylor Astronomical Programmer Physics, Bristol University, UK
m.b.taylor at bristol.ac.uk https://www.star.bristol.ac.uk/mbt/
More information about the apps
mailing list