Unicode in VOTable

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Wed Jun 11 14:13:25 CEST 2025


Dear Apps,

I've recently given a talk on the state of unicode in VOTable:
<https://wiki.ivoa.net/internal/IVOA/InterOpJune2025Apps/unicode-notes.pdf>.

In consequence, there are now two bugs against VOTable.  The first
one, <https://github.com/ivoa-std/VOTable/issues/69>, I thought was
simple ("just write UTF-16 whereever we had UCS-2 before"), and I've
even created a PR for it:
<https://github.com/ivoa-std/VOTable/pull/68>.

As usual, it's not that simple.  The problem is that in BINARY2, we
need to know how many bytes there are to a unicodeChar[n].  With
UCS-2, it was always 2n.  With UTF-16, it's no longer as simple.
*If* we say n is the number of codepoints ("characters"), then it
could be any even number between 2n and 4n inclusive.

See the PR for ways out.  Opinions are solicited.

Me, I'd prefer to say "n is the half the number of bytes of the
UTF-16 representation of a string"; that, in particular because I'd
like to do a similar thing with char.

However, if nobody speaks up I think we need to do the least invasive
thing and just outlaw surrogate pairs (in effect: restrict ourselves
to the UCS-2 subset of UTF-16).

And then there is <https://github.com/ivoa-std/VOTable/issues/55>.
For this, I'm suggesting allowing UTF-8 in chars.  That sounds
straightforward, but there's again the problem that, if is to work at
all, the n in char[n] needs to mean "the number of bytes in the UTF-8
representation of a string" rather than the "the number of
'characters' [in any sensible interpretaion of that word]" that I
will readily admit will feel a lot more logical to just about
everyone.  Also, perhaps worse: a single char can only be ASCII.
See the bug report for details.  Again, opinions are most welcome.

And, well, we could of course also move away from all that and
finally define a proper string/text type for VOTable.

Thoughts?  Opinions?  I'm truly grateful for any sort of feedback.

       -- Markus



More information about the apps mailing list