Unicode in VOTable

F.-X. Pineau francois-xavier.pineau at astro.unistra.fr
Thu Jun 12 11:16:17 CEST 2025


Markus, Mark, Russ et al.,

I support the ideas of deprecating "unicodeChar" and allowing UTF-8 in 
"char" with
"arraysize" being the length of the string in bytes.
(But also, what about adding comments about the possible meaning of 
"width" in such cases, see below)

It seems that  in (recent versions of?) some databases, 'char(n)' do 
means 'n-bytes length strings', see e.g.:
* Microsoft SQL Server doc here: 
https://learn.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql?view=sql-server-ver16
* JavaDB: https://docs.oracle.com/javadb/10.10.1.2/ref/rrefsqlj13733.html

PostgreSQL warns against the usage of fixed length strings, see the 
"Tip" in https://www.postgresql.org/docs/current/datatype-character.html
A PSQL type CHAR(8) with a character set different from ASCII 
(https://www.postgresql.org/docs/current/multibyte.html)
could be transformed into arraysize="8*" .
And a fixed arraysize="x" could be transformer into VARCHAR((x+3)/4).

For ASCII or fixed number of characters, we may enforce the usage of the 
"width" attribute
since "width" is "the number of characters to be used for input or 
output of the quantity".
Am I right if I say that datatype="char" + arraysize="x" + width="x" 
could mean fixed length ASCII string?

We may find corner cases, but, FWIW, the Rust VOTable library 
(https://github.com/cds-astro/cds-votable-rust) and vot-cli 
(https://github.com/cds-astro/cds-votable-rust/tree/main/crates/cli) 
seem to already support UTF-8. E.g. the (non-standard) VOTable:

<?xml version="1.0" encoding="UTF-8"?> <VOTABLE version="1.5" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xmlns="http://www.ivoa.net/xml/VOTable/v1.3" > <RESOURCE> <TABLE> <FIELD 
name="a" datatype="char" arraysize="3*"/> <FIELD name="b" 
datatype="unicodeChar" arraysize="*"/> <FIELD name="c" datatype="char" 
arraysize="8"/> <FIELD name="d" datatype="unicodeChar" arraysize="3"/> 
<DATA><TABLEDATA> 
<TR><TD>é😀è</TD><TD>éàè</TD><TD>é😀è</TD><TD>éàè</TD></TR> 
</TABLEDATA></DATA> </TABLE> </RESOURCE> </VOTABLE>

can be converted back and forth in DATATABLE / BINARY / BINARY2 (+ JSON 
/ TOML / YAML) using the following commands:

vot convert -i ${in} -o ${in}.bin.xml -f xml-bin vot convert -i ${in} -o 
${in}.bin2.xml -f xml-bin2 vot convert -i ${in} -o ${in}.json -f json 
--pretty vot convert -i ${in} -o ${in}.toml -f toml --pretty vot convert 
-i ${in} -o ${in}.yaml -f yaml vot convert -i ${in}.bin.xml -f xml-td 
vot convert -i ${in}.bin2.xml -f xml-td vot convert -i ${in}.json -f 
xml-td vot convert -i ${in}.toml -f xml-td vot convert -i ${in}.yaml -f 
xml-td

fx

Le 11/06/2025 à 18:35, Russ Allbery via apps a écrit :
> Mark Taylor via apps<apps at ivoa.net> writes:
>
>>   1. Redefine datatype="char" to mean UTF-8 in the BINARY/BINARY2 encoding,
>>      and document-encoded unicode in TABLEDATA.
>>      The arraysize attribute and the BINARY/BINARY2 byte count are both
>>      equal to the number of bytes in the UTF-8 encoded value (not the
>>      number of characters/codepoints in the string).
>>      This won't break anything which is already correct, since you're only
>>      supposed to put 7-bit ASCII (whose UTF-8 representation is identical)
>>      into char fields.
>>      The downside is that a FIELD with datatype="char" arraysize="8"
>>      can't store an 8-character string if those characters are emojis.
>>      Personally, I think that's OK, if you want to declare fixed-length
>>      char fields, you will now have to think in UTF-8 terms not
>>      code-point terms.
> I was trying to decide if this would cause a problem for TAP table upload
> given that database schemas generally specify limits in terms of
> characters, not bytes, for CHAR and VARCHAR data types, but I think I'm
> convinced that this isn't a concern in that direction. If one takes the
> approach of blindly translating the arraysize parameter of the type to the
> length of the field, the result would be a database column that is "too
> large" for the VOTable data type for non-ASCII strings, but I don't think
> that causes problems as long as the TAP service knows the original type
> and can reflect it on query results.
>
> This does feel like it's going to increase the existing schism between
> underlying database types and VOTable types because there will be no clear
> translation of arraysize for char fields between VOTable semantics and
> common database semantics. I'm not sure there's any way to avoid that, but
> it feels awkward.
>
> For example, suppose that one has a column in the database that is defined
> as CHAR(8) with a Unicode character set. What should the corresponding
> arraysize in the TAP_SCHEMA entry be for this column? 8 seems obviously
> wrong and will truncate valid data. 48 is safe but seems weird.
>
> ("Don't use fixed-width char fields for anything other than
> single-character ASCII flags; this is a false optimization for modern
> databases" is probably the correct answer in most cases, but we all know
> database schemas are hard to change.)
>
> Also, a probably obvious point, but worth stating explicitly: Suppose that
> the VOTable schema for a column is datatype="char" arraysize="8" but the
> database column value is two Unicode characters whose UTF-8 representation
> totals 12 bytes. The TAP server I think needs to truncate at the last
> character that fits into the size when converted to UTF-8, and then pad.
> It definitely should not take the naive approach of converting to UTF-8
> and then truncating at 8 bytes, since that will result in corrupt UTF-8
> that should be rejected by any UTF-8 decoder.
>
>>   2. Deprecate datatype="unicodeChar".  Anybody who wants to write
>>      non-ASCII text should use UTF-8 in datatype="char" instead.
> While in general I am in favor of using Unicode everywhere, do we lose
> anything by no longer having a way of marking fields as containing simple
> one-byte-per-character results that don't require any special processing?
>
> I suppose the alternative is to introduce yet another datatype, though,
> which seems even more unappealing.
>
>>   3. Just to remove mention of the obsolete UCS-2 from the standard,
>>      change the text to say that BINARY/BINARY2 unicodeChar is to be
>>      interpreted as UTF-16, but that behaviour is undefined where it
>>      contains characters outside of the UCS-2 subset of UTF-16.
>>      Then the BINARY/BINARY2 byte count for unicodeChar arrays is
>>      2*arraysize.
>>      That's somewhat nasty, but I claim OK since (a) unicodeChar
>>      only used to be allowed for UCS-2 so it won't break any existing
>>      code/data[*], and (b) unicodeChar will now be deprecated so nobody
>>      should write new code/data that encounters this.
> So basically telling implementors that to fully support Unicode you should
> ignore both UTF-16 and unicodeChar and just use char with UTF-8, which can
> handle the entire character set. This part seems reasonable to me.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/apps/attachments/20250612/2ad4fb8d/attachment-0001.htm>


More information about the apps mailing list