Questions about UTF-8 in VOTable
Francois Ochsenbein
francois.ochsenbein at gmail.com
Fri Jun 19 15:02:34 CEST 2026
VOTable 1.6 proposes to change the definition of the *char * datatype from
ascii to utf-8. I really think *it is not a good idea*, and a new datatype
able to handle utf-8 strings should be preferred if the exchange of tables
containing non-ascii data is required. This is why I introduced the
unicodeChar in the first version of VOTable: open a possibility of
exchanging textual data not limited to ascii-only characters. Unicode was
in active development at that time (2002), and choosing Unicode for the
expansion of textual data seemed the obvious way, as opposed to a choice of
a *charset* which enlarges the alphabet to a very limited set.
Currently virtually 100% of non-numeric data existing in astronomical
tables consist in a sequence of *restricted ascii characters* as defined in
FITS (bytes with decimal values between 32 and 126, excluding therefore
control characters). Considering the importance of such non-numerical data,
It seems fundamental that a <FIELD> made of *restricted ascii* characters
continues to exist in VOTable.
Notice that Unicode and its UTF-8 serialisation is much more complex than
just an extension of the basic alphabet used in English to "characters"
existing in other languages. What a language like Java defines as a "
*Character*" is in fact a *Unicode code point,* which is not necessarily
what we could call a "character", a "letter", a "symbol" or a "glyph".
Unicode code points may be invisible (have a zero width), may represent a
part of a symbol (e.g. an accent), or have a double width. For instance the
UTF-8 string ♈︎ which represents the Aries constellation, is
made of 6 bytes containing 2 Unicode code points: the first is ♈
which has a width of 2, and the second is ︎ which has a width of 0
and has just a role of preventing from rendering the Aries symbol as an
emoji (♈).
There are many other traps in Unicode and its UTF-8 serialisation, such as
several ways of writing a unique symbol like Ω as a 2-byte greek letter
(Ω) or as the 3-byte Ohm unit (Ω); similarly letters with an
accent (e.g. Ô) may be coded with a 2-byte code point (Ô), or with
two code points in 3 bytes (O#x302;) etc. etc. see e.g.
https://utf8everywhere.org/ <https://utf8everywhere.org/.> . As a
consequence, even the comparison of 2 UTF-8 strings for equality is *not*
an easy operation.
Rather than a drastic change in the definition of the *char* data type, I
believe it would be much better to introduce a *String* datatype in
VOTable, which would be defined as *a UTF-8 sequence of Unicode code
points, excluding the ** (null) code point*. Such a datatype would be
more flexible, without having to define what is a "character" or requiring
an *arraysize* attribute; it would moreover become possible to define
arrays of strings, which is currently problematic.
In the TABLEDATA serialization, the representation of a *String* is
straightforward — there is however a possible problem with the &-symbols :
while the &#-symbols are easily interpretable (numerical values like &
or <), what about alphabetic symbols like & or < ? If these
alphabetic symbols related to ascii characters can (and should) be
enumerated as it was in VOTable 1.5, what about the ever-growing list of
Unicode symbols like ⥫ (⥫) or 𝕏 (𝕏) ? Should these be
explicitely excluded or accepted?
The BINARY serialization would not be a problem, since the String would
just be a stream of bytes ending with a *null*; there would be no need to
specify a length preceding the stream of bytes, removing the requirement
of a maximal
length (number of bytes, or of code points, of glyphs or whatever size)
The FITS serialization would be a problem, since this type does not (yet)
exist in FITS; there where several discussions about adding UTF-8 in FITS,
and an obvious possibility would be to save the string contents in the
heap, while the binary table row would contain just a pointer to the
location of the string in the heap.
Finally shouldn't the introduction of UTF-8 in VOTable also specify whether
UTF-8 would be acceptable as attribute values ? Could the *name* or value
attribute of a <FIELD>, <INFO>, <PARAM> contain "characters" outside the
restricted-ascii set ?
Sorry for being a bit long, but I think the radical change of transforming
ascii into UTF-8 is worth thinking about the multiple implications involved.
Cheers,
François Ochsenbein
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/apps/attachments/20260619/4a510be3/attachment.htm>
More information about the apps
mailing list