Moving forward with modern Unicode / UTF-8

Tue Jul 15 08:38:03 CEST 2025

I was sadly unable to attend the June meeting.

I've been reading Markus' slides/notes from the meeting about support of Unicode.  Unfortunately I haven't been able to find Etherpad-like notes to go along with it, so I don't know what was said in the room at the time.  Have I been looking in the wrong place?

However, I'm going to guess that we all agree that the present situation is unsustainable and that we have to have a serious solution for transporting Unicode strings in the future, which at this point can't mean anything other than UTF-8.

Was there anything like a consensus in the room to move forward with something concrete, though?

Without having seen that, my fractional-currency contribution is:
* I agree with Markus' suggestion that the longer-term solution may be to add to VOTable a rigorously correct way of marking a string-valued column as containing Unicode data, with a UTF-8 representation both in TABLEDATA and in BINARY2 (where fixed-length-in-octets would not be allowed).  It seems likely that this needs to be a new primitive type, so that `arraysize` has a new and rigorous definition for such strings, but I would like to leave room for client implementers to still talk that over and suggest an alternative VOTable solution (e.g., it's at least conceivable that something COOSYS/TIMESYS-like could be used).
* I'm inclined to agree with what I think was Markus' position, which is that it's not realistic to retcon `unicodeChar` to do this job, because it's so strongly tied to two-octet characters.
* I do expect that there will be concerns expressed about backward compatibility if we do add something to VOTable.
* In the mean time, I would suggest that we write a "best practices" [Endorsed?] Note for how best to work with UTF-8 represented as `char`.  E.g., "do not use fixed-length strings, as their meaning will be ambiguous"; "avoid using `char` without an `arraysize` specifier at all, since one-octet UTF-8 strings are not a safe concept".
* The development of the Note should also including looking closely at PyVO and astroquery, and at the community's TAP libraries in particular, to see whether there's some polishing to be done to make UTF-8-in-`char` a safer (if not yet entirely rigorous) thing to use.

Best wishes,
Gregory