Unicode in VOTable

Mark Taylor m.b.taylor at bristol.ac.uk
Wed Jun 11 15:09:37 CEST 2025


Markus et al.,

There are several possible places for this discussion (github issues/PRs
68, 69, 55) but for widest visibility and since the answers relate 
to each other I'll post initially here as a followup to Markus's message.

For the record, this was discussed also in Banff 2014
(https://wiki.ivoa.net/internal/IVOA/InterOpOct2014Applications/vot-unicode.pdf)but discussion on the Apps list on that occasion fizzled out following
arguments about how to record the string length/arraysize.

I agree in most points with Markus.  IMO we should simultaneously 
(VOTable 1.6) do the following:

 1. Redefine datatype="char" to mean UTF-8 in the BINARY/BINARY2 encoding,
    and document-encoded unicode in TABLEDATA.
    The arraysize attribute and the BINARY/BINARY2 byte count are both 
    equal to the number of bytes in the UTF-8 encoded value (not the
    number of characters/codepoints in the string).
    This won't break anything which is already correct, since you're only
    supposed to put 7-bit ASCII (whose UTF-8 representation is identical)
    into char fields.
    The downside is that a FIELD with datatype="char" arraysize="8"
    can't store an 8-character string if those characters are emojis.
    Personally, I think that's OK, if you want to declare fixed-length
    char fields, you will now have to think in UTF-8 terms not 
    code-point terms.

 2. Deprecate datatype="unicodeChar".  Anybody who wants to write
    non-ASCII text should use UTF-8 in datatype="char" instead.

 3. Just to remove mention of the obsolete UCS-2 from the standard, 
    change the text to say that BINARY/BINARY2 unicodeChar is to be 
    interpreted as UTF-16, but that behaviour is undefined where it
    contains characters outside of the UCS-2 subset of UTF-16.
    Then the BINARY/BINARY2 byte count for unicodeChar arrays is
    2*arraysize.
    That's somewhat nasty, but I claim OK since (a) unicodeChar 
    only used to be allowed for UCS-2 so it won't break any existing 
    code/data[*], and (b) unicodeChar will now be deprecated so nobody
    should write new code/data that encounters this.

Mark

On Wed, 11 Jun 2025, Markus Demleitner via apps wrote:

> Dear Apps,
> 
> I've recently given a talk on the state of unicode in VOTable:
> <https://wiki.ivoa.net/internal/IVOA/InterOpJune2025Apps/unicode-notes.pdf>.
> 
> In consequence, there are now two bugs against VOTable.  The first
> one, <https://github.com/ivoa-std/VOTable/issues/69>, I thought was
> simple ("just write UTF-16 whereever we had UCS-2 before"), and I've
> even created a PR for it:
> <https://github.com/ivoa-std/VOTable/pull/68>.
> 
> As usual, it's not that simple.  The problem is that in BINARY2, we
> need to know how many bytes there are to a unicodeChar[n].  With
> UCS-2, it was always 2n.  With UTF-16, it's no longer as simple.
> *If* we say n is the number of codepoints ("characters"), then it
> could be any even number between 2n and 4n inclusive.
> 
> See the PR for ways out.  Opinions are solicited.
> 
> Me, I'd prefer to say "n is the half the number of bytes of the
> UTF-16 representation of a string"; that, in particular because I'd
> like to do a similar thing with char.
> 
> However, if nobody speaks up I think we need to do the least invasive
> thing and just outlaw surrogate pairs (in effect: restrict ourselves
> to the UCS-2 subset of UTF-16).
> 
> And then there is <https://github.com/ivoa-std/VOTable/issues/55>.
> For this, I'm suggesting allowing UTF-8 in chars.  That sounds
> straightforward, but there's again the problem that, if is to work at
> all, the n in char[n] needs to mean "the number of bytes in the UTF-8
> representation of a string" rather than the "the number of
> 'characters' [in any sensible interpretaion of that word]" that I
> will readily admit will feel a lot more logical to just about
> everyone.  Also, perhaps worse: a single char can only be ASCII.
> See the bug report for details.  Again, opinions are most welcome.
> 
> And, well, we could of course also move away from all that and
> finally define a proper string/text type for VOTable.
> 
> Thoughts?  Opinions?  I'm truly grateful for any sort of feedback.
> 
>        -- Markus
> 
> 

--
Mark Taylor  Astronomical Programmer  Physics, Bristol University, UK
m.b.taylor at bristol.ac.uk          https://www.star.bristol.ac.uk/mbt/


More information about the apps mailing list