<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
</head>
<body>
<div>
<div dir="ltr">Contributing that “unsliced UTF-8 truncation to octets” function to Astropy’s VO tools would be most useful. </div>
<div dir="ltr">-Gregory </div>
</div>
<div id="ms-outlook-mobile-body-separator-line" dir="auto"><br>
</div>
<div id="ms-outlook-mobile-signature">Get <a href="https://aka.ms/o0ukef">Outlook for iOS</a></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> apps <apps-bounces@ivoa.net> on behalf of Russ Allbery via apps <apps@ivoa.net><br>
<b>Sent:</b> Thursday, June 12, 2025 8:15:04 AM<br>
<b>To:</b> Mark Taylor via apps <apps@ivoa.net><br>
<b>Subject:</b> Re: Unicode in VOTable</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">Mark Taylor via apps <apps@ivoa.net> writes:<br>
> On Wed, 11 Jun 2025, Russ Allbery wrote:<br>
<br>
>> For example, suppose that one has a column in the database that is<br>
>> defined as CHAR(8) with a Unicode character set. What should the<br>
>> corresponding arraysize in the TAP_SCHEMA entry be for this column? 8<br>
>> seems obviously wrong and will truncate valid data. 48 is safe but<br>
>> seems weird.<br>
<br>
> 32, no? The wikipedia UTF-8 page says "a variable-width encoding of<br>
> one to four one-byte (8-bit) code units".<br>
<br>
Oh, yes, sorry. The last time I wrote a UTF-8 decoder was in the RFC 2279<br>
days when the encoding was specified through six bytes, but RFC 3629 made<br>
it clear that no character is longer than four bytes. I keep forgetting<br>
that.<br>
<br>
> It's a fair question, but IMO we don't lose enough to make it a worry.<br>
> In most string-processing contexts these days the default processing is<br>
> UTF-8 anyway and it's the one-byte-per-character strings that require<br>
> special measures (e.g. in java if you write a sloppy VOTable parser it<br>
> will probably decode char arrays as UTF-8 strings already unless you try<br>
> hard to stop it doing that).<br>
<br>
> Also, if people want to use single bytes, there's still the<br>
> unsignedByte datatype.<br>
<br>
This sounds good to me. I also agree with Markus's point that the right<br>
thing to do is just warn people about truncation issues with fixed-width<br>
strings and steer them away from either the fixed width or use of<br>
non-ASCII characters, depending on the situation.<br>
<br>
The truncation edge cases feel nasty to me, not in the sense that it's<br>
hard to specify what should happen (we can pick something reasonable), but<br>
in the sense that it feels likely that someone who doesn't read the<br>
specification carefully is going to do the wrong thing since there are a<br>
lot of "attractive nuisances," so to speak. But I don't have any better<br>
idea and this looks like a good way forward.<br>
<br>
The other edge case that occurred to me last night that may be worth<br>
mentioning explicitly is this one:<br>
<br>
1. A VOTable field is defined as arraysize="8*".<br>
<br>
2. Some client, looking at that VOTable, checks to see if the data is<br>
exactly eight characters long. If so, it decides that the data may have<br>
been truncated; if it is less than eight characters long, it concludes<br>
that it was not truncated. It may take different actions depending on<br>
that belief.<br>
<br>
3. The underlying database column contains "abcdefgó", which is eight<br>
*characters* long but whose UTF-8 representation is nine *octets*.<br>
<br>
4. The TAP server, respecting the schema definition of arraysize="8*" and<br>
the VOTable definition of the length as meaning octets, truncates the<br>
data to "abcdefg" because the last character will not fit.<br>
<br>
5. The client sees that the contents of the field is seven characters,<br>
which is less than eight, and therefore assumes no truncation happened.<br>
<br>
It may also be worth nothing that I'm not sure how common the operation<br>
"truncate so that the UTF-8 representation fits into N octets" is in<br>
various programming languages. I don't know of a primitive that does that<br>
in Python off the top of my head and would therefore probably write a<br>
custom routine that converts from str to bytes, checks if len > N, and if<br>
so, removes the last character of the str and repeats until the result has<br>
len <= N. (There is an obvious optimization for the first truncation that<br>
may or may not be worth bothering with.) I think people are used to<br>
reaching for the standard libraries to do this sort of thing, so unless<br>
I'm just unaware of standard libraries that support this, it might be<br>
worth stressing that truncation may require some custom code like that.<br>
<br>
-- <br>
Russ Allbery (eagle@eyrie.org) <<a href="https://www.eyrie.org/~eagle/">https://www.eyrie.org/~eagle/</a>><br>
</div>
</span></font></div>
</body>
</html>