<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">

</head>

<body>

<div>

<div dir="ltr">Contributing that “unsliced UTF-8 truncation to octets” function to Astropy’s VO tools would be most useful. </div>

<div dir="ltr">-Gregory </div>

</div>

<div id="ms-outlook-mobile-body-separator-line" dir="auto"><br>

</div>

<div id="ms-outlook-mobile-signature">Get <a href="https://aka.ms/o0ukef">Outlook for iOS</a></div>

<hr style="display:inline-block;width:98%" tabindex="-1">

<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> apps <apps-bounces@ivoa.net> on behalf of Russ Allbery via apps <apps@ivoa.net><br>

<b>Sent:</b> Thursday, June 12, 2025 8:15:04 AM<br>

<b>To:</b> Mark Taylor via apps <apps@ivoa.net><br>

<b>Subject:</b> Re: Unicode in VOTable</font>

<div> </div>

</div>

<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">

<div class="PlainText">Mark Taylor via apps <apps@ivoa.net> writes:<br>

> On Wed, 11 Jun 2025, Russ Allbery wrote:<br>

<br>

>> For example, suppose that one has a column in the database that is<br>

>> defined as CHAR(8) with a Unicode character set. What should the<br>

>> corresponding arraysize in the TAP_SCHEMA entry be for this column? 8<br>

>> seems obviously wrong and will truncate valid data. 48 is safe but<br>

>> seems weird.<br>

<br>

> 32, no?  The wikipedia UTF-8 page says "a variable-width encoding of<br>

> one to four one-byte (8-bit) code units".<br>

<br>

Oh, yes, sorry. The last time I wrote a UTF-8 decoder was in the RFC 2279<br>

days when the encoding was specified through six bytes, but RFC 3629 made<br>

it clear that no character is longer than four bytes. I keep forgetting<br>

that.<br>

<br>

> It's a fair question, but IMO we don't lose enough to make it a worry.<br>

> In most string-processing contexts these days the default processing is<br>

> UTF-8 anyway and it's the one-byte-per-character strings that require<br>

> special measures (e.g. in java if you write a sloppy VOTable parser it<br>

> will probably decode char arrays as UTF-8 strings already unless you try<br>

> hard to stop it doing that).<br>

<br>

> Also, if people want to use single bytes, there's still the<br>

> unsignedByte datatype.<br>

<br>

This sounds good to me. I also agree with Markus's point that the right<br>

thing to do is just warn people about truncation issues with fixed-width<br>

strings and steer them away from either the fixed width or use of<br>

non-ASCII characters, depending on the situation.<br>

<br>

The truncation edge cases feel nasty to me, not in the sense that it's<br>

hard to specify what should happen (we can pick something reasonable), but<br>

in the sense that it feels likely that someone who doesn't read the<br>

specification carefully is going to do the wrong thing since there are a<br>

lot of "attractive nuisances," so to speak. But I don't have any better<br>

idea and this looks like a good way forward.<br>

<br>

The other edge case that occurred to me last night that may be worth<br>

mentioning explicitly is this one:<br>

<br>

1. A VOTable field is defined as arraysize="8*".<br>

<br>

2. Some client, looking at that VOTable, checks to see if the data is<br>

   exactly eight characters long. If so, it decides that the data may have<br>

   been truncated; if it is less than eight characters long, it concludes<br>

   that it was not truncated. It may take different actions depending on<br>

   that belief.<br>

<br>

3. The underlying database column contains "abcdefgó", which is eight<br>

   *characters* long but whose UTF-8 representation is nine *octets*.<br>

<br>

4. The TAP server, respecting the schema definition of arraysize="8*" and<br>

   the VOTable definition of the length as meaning octets, truncates the<br>

   data to "abcdefg" because the last character will not fit.<br>

<br>

5. The client sees that the contents of the field is seven characters,<br>

   which is less than eight, and therefore assumes no truncation happened.<br>

<br>

It may also be worth nothing that I'm not sure how common the operation<br>

"truncate so that the UTF-8 representation fits into N octets" is in<br>

various programming languages. I don't know of a primitive that does that<br>

in Python off the top of my head and would therefore probably write a<br>

custom routine that converts from str to bytes, checks if len > N, and if<br>

so, removes the last character of the str and repeats until the result has<br>

len <= N. (There is an obvious optimization for the first truncation that<br>

may or may not be worth bothering with.) I think people are used to<br>

reaching for the standard libraries to do this sort of thing, so unless<br>

I'm just unaware of standard libraries that support this, it might be<br>

worth stressing that truncation may require some custom code like that.<br>

<br>

-- <br>

Russ Allbery (eagle@eyrie.org)             <<a href="https://www.eyrie.org/~eagle/">https://www.eyrie.org/~eagle/</a>><br>

</div>

</span></font></div>

</body>

</html>