Moving forward with modern Unicode / UTF-8
Mark Taylor
m.b.taylor at bristol.ac.uk
Thu Jul 17 11:15:49 CEST 2025
On Thu, 17 Jul 2025, Markus Demleitner via apps wrote:
> This problem is of course even more severe when we somehow imply
> utf-8 in char arrays, and concerns that arraysize would become
> something like "storage size" rather than "number of elements" when
> we go that way were too strong for me to happily go to work.
I don't think that problem is all that bad. We just redefine the
char datatype to mean an octet of UTF-8 storage rather than a
character as such (this is completely backwardly compatible with
current usage), then arraysize makes sense without special casing.
That does mean you can't define a column containing a fixed number
of unicode characters (unless you happen to know that only ASCII
is permitted, which may well be the case e.g. ISO-8601 datestamps),
but I don't see that as much of an inconvenience.
> As to concrete next steps: I'd say two PRs (one UTF-16 in
> unicodeChar, the other UTF-8 in char) against VOTable would be great,
> and then we can see how much pushback we have against the possible
> weakening of arraysize.
>
> I *could* see myself volunteering for that if there's really nobody
> else wanting to do that. But I'd need a few Newtons of gentle
> nudging.
I'd be willing to have a go at such PRs, implementing the proposals
(more or less matching what Markus says above) that I made on the
apps list last month:
http://mail.ivoa.net/pipermail/apps/2025-June/001765.html
There was some discussion following that post, but nothing that
convinced me I was on the wrong track (it's possible that others
disagree).
I won't get to that right away, so there are at least a couple of
weeks for people to object here that PRs along those lines wouldn't
be the right thing to do (or for somebody else to out-volunteer me).
Mark
--
Mark Taylor Astronomical Programmer Physics, Bristol University, UK
m.b.taylor at bristol.ac.uk https://www.star.bristol.ac.uk/mbt/
More information about the apps
mailing list