Moving forward with modern Unicode / UTF-8

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Thu Jul 17 10:42:14 CEST 2025


Dear Gregory,

On Tue, Jul 15, 2025 at 06:38:03AM +0000, Dubois-Felsmann, Gregory P. via apps wrote:
> I've been reading Markus' slides/notes from the meeting about
> support of Unicode.  Unfortunately I haven't been able to find
> Etherpad-like notes to go along with it, so I don't know what was
> said in the room at the time.  Have I been looking in the wrong
> place?

I don't remember whether we had a note taker.  If all fails, I
suppose there's still the recording (and possibly transcripts)

But here's my take on things:

> Was there anything like a consensus in the room to move forward
> with something concrete, though?

Consensus is probably too strong a word.

I think most people who did have an opinion agreed that we should
replace UCS-2 with UTF-16 for unicodeChar.

Of course, even that step raises what was probably the fundamental
question: "Does this ruin the meaning of arraysize?".  And it is true
that it is somewhat ugly that a unicodeChar[6] array can only keep
three Emojis.

I'd have said "aw, don't have fixed-size strings in the first place",
but then these are too nice to let go because they make for
fixed-size records, which, if nothing else, make memory-mapped FITS
arrays efficient; and VOTable should certainly remain a superset of
FITS bintables.

This problem is of course even more severe when we somehow imply
utf-8 in char arrays, and concerns that arraysize would become
something like "storage size" rather than "number of elements" when
we go that way were too strong for me to happily go to work.

> * I agree with Markus' suggestion that the longer-term solution may
> be to add to VOTable a rigorously correct way of marking a
> string-valued column as containing Unicode data, with a UTF-8
> representation both in TABLEDATA and in BINARY2 (where
> fixed-length-in-octets would not be allowed).  It seems likely that

I note in passing that a true string type would further reduce
round-trippability between VOTable and FITS binary tables.  I don't
think that's necessarily a counter-argument, but I thought I'd
mention it.

> this needs to be a new primitive type, so that `arraysize` has a
> new and rigorous definition for such strings, but I would like to

I'd say it would be the number of strings.  arraysize="1" would mean
"a single variable-length string".

> * I do expect that there will be concerns expressed about backward
> compatibility if we do add something to VOTable.

That, I think, is a minor concern in this case.  I think if we said
"if you encounter chars with a high 8th bit and you need to display
the string, decode it as utf-8 first if you can", that will break
essentially nothing.

> * In the mean time, I would suggest that we write a "best
> practices" [Endorsed?] Note for how best to work with UTF-8
> represented as `char`.  E.g., "do not use fixed-length strings, as
> their meaning will be ambiguous"; "avoid using `char` without an
> `arraysize` specifier at all, since one-octet UTF-8 strings are not
> a safe concept".

Hm, no, I think that needs to be an update in VOTable itself because
right now, characters >=128 are forbidden for char by the spec.

But whereever we specify it, the main thing to consider is: "Are we
setting a bad precendent for arraysize?".

I'd argue "no", because I'd phrase our plan as "new rules for
*displaying* character strings", but I give you there are so many
snags (like high 8th bit single char being effectively outlawed) that
that is somewhere in uncomfortable proximity to the land of
weaseling.  Quite likely it's already across the border.

As to concrete next steps: I'd say two PRs (one UTF-16 in
unicodeChar, the other UTF-8 in char) against VOTable would be great,
and then we can see how much pushback we have against the possible
weakening of arraysize.

I *could* see myself volunteering for that if there's really nobody
else wanting to do that.  But I'd need a few Newtons of gentle
nudging.

Thanks,

            Markus



More information about the apps mailing list