Questions about UTF-8 in VOTable

Fri Jun 19 15:02:34 CEST 2026

VOTable 1.6 proposes to change the definition of the *char * datatype from
ascii to utf-8. I really think *it is not a good idea*, and a new datatype
able to handle utf-8 strings should be preferred if the exchange of tables
containing non-ascii data is required. This is why I introduced the
unicodeChar in the first version of VOTable: open a possibility of
exchanging textual data not limited to ascii-only characters. Unicode was
in active development at that time (2002), and choosing Unicode for the
expansion of textual data seemed the obvious way, as opposed to a choice of
a *charset* which enlarges the alphabet to a very limited set.

Currently virtually 100% of non-numeric data existing in astronomical
tables consist in a sequence of *restricted ascii characters* as defined in
FITS (bytes with decimal values between 32 and 126, excluding therefore
control characters). Considering the importance of such non-numerical data,
It seems fundamental that  a <FIELD> made of  *restricted ascii* characters
continues to exist in VOTable.

Notice that Unicode and its UTF-8 serialisation is much more complex than
just an extension of the basic alphabet used in English to "characters"
existing in other languages. What a language like Java defines as a "
*Character*" is in fact a *Unicode code point,* which is not necessarily
what we could call a "character", a "letter", a "symbol" or a "glyph".
Unicode code points may be invisible (have a zero width), may represent a
part of a symbol (e.g. an accent), or have a double width. For instance the
UTF-8 string &#x2648;&#xFE0E; which represents the Aries constellation, is
made of 6 bytes containing 2 Unicode code points: the first is &#x2648;
which has a width of 2, and the second is &#xFE0E; which has a width of 0
and has just a role of preventing from rendering the Aries symbol as an
emoji (♈).

There are many other traps in Unicode and its UTF-8 serialisation, such as
several ways of writing a unique symbol like Ω as a 2-byte greek letter
(&#x3A9;) or as the 3-byte Ohm unit (&#x2126;); similarly letters with an
accent (e.g. Ô) may be coded with a 2-byte code point (&#xD4;),  or with
two code points in 3 bytes (O#x302;) etc. etc. see e.g.
https://utf8everywhere.org/ <https://utf8everywhere.org/.> . As a
consequence, even the comparison of 2 UTF-8 strings for equality is *not*
an easy operation.

Rather than a drastic change in the definition of the *char* data type, I
believe it would be much better to introduce a *String* datatype in
VOTable, which would be defined as *a UTF-8 sequence of Unicode code
points, excluding the *�* (null) code point*.  Such a datatype would be
more flexible, without having to define what is a "character" or requiring
an *arraysize* attribute; it would moreover become possible to define
arrays of strings, which is currently problematic.

In the TABLEDATA serialization, the representation of a *String* is
straightforward — there is however a possible problem with the &-symbols :
while the &#-symbols are easily interpretable (numerical values like &#x26;
or <), what about alphabetic symbols like & or < ? If these
alphabetic symbols related to ascii characters can (and should) be
enumerated as it was in VOTable 1.5, what about the ever-growing list of
Unicode symbols like ⥫ (⥫) or 𝕏 (𝕏) ? Should these be
explicitely excluded or accepted?

The BINARY serialization would not be a problem, since the String would
just be a stream of bytes ending with a *null*; there would be no need to
 specify a length preceding the stream of bytes, removing the requirement
of a maximal
length (number of bytes, or of code points, of glyphs or whatever size)

The FITS serialization would be a problem, since this type does not (yet)
exist in FITS; there where several discussions about adding UTF-8 in FITS,
and an obvious possibility would be to save the string contents in the
heap, while the binary table row would contain just a  pointer to the
location of the string in the heap.

Finally shouldn't the introduction of UTF-8 in VOTable also specify whether
UTF-8 would be acceptable as attribute values ? Could the *name* or value
attribute of a <FIELD>, <INFO>, <PARAM> contain "characters" outside the
restricted-ascii set ?

Sorry for being a bit long, but I think the radical change of transforming
ascii into UTF-8 is worth thinking about the multiple implications involved.

Cheers,
François Ochsenbein
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/apps/attachments/20260619/4a510be3/attachment.htm>