Unicode in VOTable
Mark Taylor
m.b.taylor at bristol.ac.uk
Wed Jun 11 15:09:37 CEST 2025
Markus et al.,
There are several possible places for this discussion (github issues/PRs
68, 69, 55) but for widest visibility and since the answers relate
to each other I'll post initially here as a followup to Markus's message.
For the record, this was discussed also in Banff 2014
(https://wiki.ivoa.net/internal/IVOA/InterOpOct2014Applications/vot-unicode.pdf)but discussion on the Apps list on that occasion fizzled out following
arguments about how to record the string length/arraysize.
I agree in most points with Markus. IMO we should simultaneously
(VOTable 1.6) do the following:
1. Redefine datatype="char" to mean UTF-8 in the BINARY/BINARY2 encoding,
and document-encoded unicode in TABLEDATA.
The arraysize attribute and the BINARY/BINARY2 byte count are both
equal to the number of bytes in the UTF-8 encoded value (not the
number of characters/codepoints in the string).
This won't break anything which is already correct, since you're only
supposed to put 7-bit ASCII (whose UTF-8 representation is identical)
into char fields.
The downside is that a FIELD with datatype="char" arraysize="8"
can't store an 8-character string if those characters are emojis.
Personally, I think that's OK, if you want to declare fixed-length
char fields, you will now have to think in UTF-8 terms not
code-point terms.
2. Deprecate datatype="unicodeChar". Anybody who wants to write
non-ASCII text should use UTF-8 in datatype="char" instead.
3. Just to remove mention of the obsolete UCS-2 from the standard,
change the text to say that BINARY/BINARY2 unicodeChar is to be
interpreted as UTF-16, but that behaviour is undefined where it
contains characters outside of the UCS-2 subset of UTF-16.
Then the BINARY/BINARY2 byte count for unicodeChar arrays is
2*arraysize.
That's somewhat nasty, but I claim OK since (a) unicodeChar
only used to be allowed for UCS-2 so it won't break any existing
code/data[*], and (b) unicodeChar will now be deprecated so nobody
should write new code/data that encounters this.
Mark
On Wed, 11 Jun 2025, Markus Demleitner via apps wrote:
> Dear Apps,
>
> I've recently given a talk on the state of unicode in VOTable:
> <https://wiki.ivoa.net/internal/IVOA/InterOpJune2025Apps/unicode-notes.pdf>.
>
> In consequence, there are now two bugs against VOTable. The first
> one, <https://github.com/ivoa-std/VOTable/issues/69>, I thought was
> simple ("just write UTF-16 whereever we had UCS-2 before"), and I've
> even created a PR for it:
> <https://github.com/ivoa-std/VOTable/pull/68>.
>
> As usual, it's not that simple. The problem is that in BINARY2, we
> need to know how many bytes there are to a unicodeChar[n]. With
> UCS-2, it was always 2n. With UTF-16, it's no longer as simple.
> *If* we say n is the number of codepoints ("characters"), then it
> could be any even number between 2n and 4n inclusive.
>
> See the PR for ways out. Opinions are solicited.
>
> Me, I'd prefer to say "n is the half the number of bytes of the
> UTF-16 representation of a string"; that, in particular because I'd
> like to do a similar thing with char.
>
> However, if nobody speaks up I think we need to do the least invasive
> thing and just outlaw surrogate pairs (in effect: restrict ourselves
> to the UCS-2 subset of UTF-16).
>
> And then there is <https://github.com/ivoa-std/VOTable/issues/55>.
> For this, I'm suggesting allowing UTF-8 in chars. That sounds
> straightforward, but there's again the problem that, if is to work at
> all, the n in char[n] needs to mean "the number of bytes in the UTF-8
> representation of a string" rather than the "the number of
> 'characters' [in any sensible interpretaion of that word]" that I
> will readily admit will feel a lot more logical to just about
> everyone. Also, perhaps worse: a single char can only be ASCII.
> See the bug report for details. Again, opinions are most welcome.
>
> And, well, we could of course also move away from all that and
> finally define a proper string/text type for VOTable.
>
> Thoughts? Opinions? I'm truly grateful for any sort of feedback.
>
> -- Markus
>
>
--
Mark Taylor Astronomical Programmer Physics, Bristol University, UK
m.b.taylor at bristol.ac.uk https://www.star.bristol.ac.uk/mbt/
More information about the apps
mailing list