Unicode in VOTable

Thu Jun 12 20:46:45 CEST 2025

Contributing that “unsliced UTF-8 truncation to octets” function to Astropy’s VO tools would be most useful.

Yes please.

There are quite a few other shortcomings in the astropy votable module, and I would love to see that gap being patched over and improved.
At this point I can promise helping out with the code review and ensure the PRs are in before the releases, but don't see that I personally will have enough time to contribute big PRs myself.

I will also advocate for using astropy resources to fund some of this work, but even if that receives a positive reception, it will take some to get to fruition and contract out tasks.

Cheers,
 Brigitta
________________________________
From: apps <apps-bounces at ivoa.net> on behalf of Dubois-Felsmann, Gregory P. via apps <apps at ivoa.net>
Sent: 12 June 2025 09:59
To: Russ Allbery <eagle at eyrie.org>; Mark Taylor via apps <apps at ivoa.net>
Subject: Re: Unicode in VOTable

Contributing that “unsliced UTF-8 truncation to octets” function to Astropy’s VO tools would be most useful.
-Gregory

Get Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: apps <apps-bounces at ivoa.net> on behalf of Russ Allbery via apps <apps at ivoa.net>
Sent: Thursday, June 12, 2025 8:15:04 AM
To: Mark Taylor via apps <apps at ivoa.net>
Subject: Re: Unicode in VOTable

Mark Taylor via apps <apps at ivoa.net> writes:
> On Wed, 11 Jun 2025, Russ Allbery wrote:

>> For example, suppose that one has a column in the database that is
>> defined as CHAR(8) with a Unicode character set. What should the
>> corresponding arraysize in the TAP_SCHEMA entry be for this column? 8
>> seems obviously wrong and will truncate valid data. 48 is safe but
>> seems weird.

> 32, no?  The wikipedia UTF-8 page says "a variable-width encoding of
> one to four one-byte (8-bit) code units".

Oh, yes, sorry. The last time I wrote a UTF-8 decoder was in the RFC 2279
days when the encoding was specified through six bytes, but RFC 3629 made
it clear that no character is longer than four bytes. I keep forgetting
that.

> It's a fair question, but IMO we don't lose enough to make it a worry.
> In most string-processing contexts these days the default processing is
> UTF-8 anyway and it's the one-byte-per-character strings that require
> special measures (e.g. in java if you write a sloppy VOTable parser it
> will probably decode char arrays as UTF-8 strings already unless you try
> hard to stop it doing that).

> Also, if people want to use single bytes, there's still the
> unsignedByte datatype.

This sounds good to me. I also agree with Markus's point that the right
thing to do is just warn people about truncation issues with fixed-width
strings and steer them away from either the fixed width or use of
non-ASCII characters, depending on the situation.

The truncation edge cases feel nasty to me, not in the sense that it's
hard to specify what should happen (we can pick something reasonable), but
in the sense that it feels likely that someone who doesn't read the
specification carefully is going to do the wrong thing since there are a
lot of "attractive nuisances," so to speak. But I don't have any better
idea and this looks like a good way forward.

The other edge case that occurred to me last night that may be worth
mentioning explicitly is this one:

1. A VOTable field is defined as arraysize="8*".

2. Some client, looking at that VOTable, checks to see if the data is
   exactly eight characters long. If so, it decides that the data may have
   been truncated; if it is less than eight characters long, it concludes
   that it was not truncated. It may take different actions depending on
   that belief.

3. The underlying database column contains "abcdefgó", which is eight
   *characters* long but whose UTF-8 representation is nine *octets*.

4. The TAP server, respecting the schema definition of arraysize="8*" and
   the VOTable definition of the length as meaning octets, truncates the
   data to "abcdefg" because the last character will not fit.

5. The client sees that the contents of the field is seven characters,
   which is less than eight, and therefore assumes no truncation happened.

It may also be worth nothing that I'm not sure how common the operation
"truncate so that the UTF-8 representation fits into N octets" is in
various programming languages. I don't know of a primitive that does that
in Python off the top of my head and would therefore probably write a
custom routine that converts from str to bytes, checks if len > N, and if
so, removes the last character of the str and repeats until the result has
len <= N. (There is an obvious optimization for the first truncation that
may or may not be worth bothering with.) I think people are used to
reaching for the standard libraries to do this sort of thing, so unless
I'm just unaware of standard libraries that support this, it might be
worth stressing that truncation may require some custom code like that.

--
Russ Allbery (eagle at eyrie.org)             <https://www.eyrie.org/~eagle/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/apps/attachments/20250612/1f29cb41/attachment-0001.htm>