Questions about UTF-8 in VOTable

Mon Jun 29 14:02:51 CEST 2026

François,

On Fri, 19 Jun 2026, Francois Ochsenbein via apps wrote:

> VOTable 1.6 proposes to change the definition of the *char * datatype from
> ascii to utf-8. I really think *it is not a good idea*, and a new datatype
> able to handle utf-8 strings should be preferred if the exchange of tables
> containing non-ascii data is required. This is why I introduced the
> unicodeChar in the first version of VOTable: open a possibility of
> exchanging textual data not limited to ascii-only characters. Unicode was
> in active development at that time (2002), and choosing Unicode for the
> expansion of textual data seemed the obvious way, as opposed to a choice of
> a *charset* which enlarges the alphabet to a very limited set.
>
> Currently virtually 100% of non-numeric data existing in astronomical
> tables consist in a sequence of *restricted ascii characters* as defined in
> FITS (bytes with decimal values between 32 and 126, excluding therefore
> control characters). Considering the importance of such non-numerical data,
> It seems fundamental that  a <FIELD> made of  *restricted ascii* characters
> continues to exist in VOTable.

The benefit of repurposing the char datatype to carry Unicode
instead of ASCII is to minimise the change required to software,
and to reduce the impact in the VO where different components
may be updated to VOTable 1.6 on significantly different timescales.

By doing it this way, (a) all VOTable 1.6 software will be able to
read pre-VOTable 1.6 content without any special arrangements in
the code, and (b) most pre-VOTable 1.6 software will be able to read
most VOTable 1.6 content without even noticing the difference. 
Admittedly (b) is not 100% true, since ASCII-expecting code
encountering Unicode/UTF-8 bytes may behave strangely, but
(i) in many cases this will result in only slightly garbled output,
or even in output as intended in the case that Unicode rather
than ASCII machinery is in fact used to decode the byte stream; and
(ii) given that most textual content is likely to continue to fall
within the ASCII range, such issues will probably only affect
a small minority of the text encountered.

If we introduce a new datatype for Unicode, then software writing
character data to VOTable text will need to decide for each
column whether to write a char column which is unable to carry
non-ASCII content but can (probably) be read by all VOTable readers,
or a unicodeString(?) column which can only be read by V1.6-aware
VOTable readers.

Making that decision at write time, i.e. knowing whether string
content is ASCII-only, is typically not easy, since in a
programming environment where strings are natively Unicode
not ASCII, unless output code has additional information about
character data values (for instance that it originated from FITS, 
or comprises ISO-8601 timestamps) it can't assume ASCII and can only
safely decide to write it as Unicode.  Alternatively it could make
an additional pass through the data to check for ASCIIness but that
adds expense and inhibits streaming.  If it writes a new
unicodeString type then pre-VOTable1.6 readers have no chance
to make sense of it.  Note also that in the case of BINARY/BINARY2
encoding, a pre-VOTable1.6 reader would not only fail to understand
the content of unicodeString columns, but would be unable to read
any of the data stream for a table containing a unicodeString field,
because it wouldn't know how to count bytes to skip over
the Unicode parts.

For code that reads ASCII or Unicode strings, in most cases it
won't in any case treat the content differently - it's a string.
Admittedly there may be exceptions; for instance software might want
to refuse to write a non-ASCII column to a FITS BINTABLE A-format
column.  But typically for such situations (at least it's what
I'd do) it would be reasonable to make a best-efforts attempt
and just transform non-ASCII characters to a '?' or similar,
in which case knowing that it's ASCII doesn't buy you much.

Basically: use of Unicode is normal for text these days, ASCII is
the special case.  The effect of having separate arrangements to
process these formats would be (I claim) more to increase
complication than to allow for simplified processing in some cases.

We have discussed this approach at some length over the last year
or so.  Following an initial email discussion on the apps list
http://mail.ivoa.net/pipermail/apps/2025-June/thread.html,
which included a posting from you voicing concern about it,
I drafted a Pull Request on github setting out my approach in
concrete terms: https://github.com/ivoa-std/VOTable/pull/71
This received quite a bit of scrutiny from potential users and
implementers, so I made various changes and presented it at Gorlitz
https://wiki.ivoa.net/internal/IVOA/InterOpNov2025Apps/votable.pdf
I then invited further comments and objections on the mailing list
http://mail.ivoa.net/pipermail/apps/2025-November/001793.html
(none forthcoming) prior to merging it in November 2025.
It is now part of the current Working Draft
https://www.ivoa.net/Documents/VOTable/20260413/
and we have at least three prototype/production implementations.

Your suggestion of a new, non-fixed-size unicodeString type is not
absurd, but for the reasons above I don't personally support it,
and it's quite late in the process to back out of the currently
drafted changes.  However if there is broad support for it instead
of repurposing datatype="char", we can consider that.

I have made more detailed responses to some of your other points below.

> Notice that Unicode and its UTF-8 serialisation is much more complex than
> just an extension of the basic alphabet used in English to "characters"
> existing in other languages. What a language like Java defines as a "
> *Character*" is in fact a *Unicode code point,* which is not necessarily
> what we could call a "character", a "letter", a "symbol" or a "glyph".
> Unicode code points may be invisible (have a zero width), may represent a
> part of a symbol (e.g. an accent), or have a double width. For instance the
> UTF-8 string &#x2648;&#xFE0E; which represents the Aries constellation, is
> made of 6 bytes containing 2 Unicode code points: the first is &#x2648;
> which has a width of 2, and the second is &#xFE0E; which has a width of 0
> and has just a role of preventing from rendering the Aries symbol as an
> emoji (♈).
>
> There are many other traps in Unicode and its UTF-8 serialisation, such as
> several ways of writing a unique symbol like Ω as a 2-byte greek letter
> (&#x3A9;) or as the 3-byte Ohm unit (&#x2126;); similarly letters with an
> accent (e.g. Ô) may be coded with a 2-byte code point (&#xD4;),  or with
> two code points in 3 bytes (O#x302;) etc. etc. see e.g.
> https://utf8everywhere.org/ <https://utf8everywhere.org/.> . As a
> consequence, even the comparison of 2 UTF-8 strings for equality is *not*
> an easy operation.

That is all true and well-understood.  But the large majority of
modern programming environments (e.g. Python, JavaScript, Java;
it's an option in Rust) deal with it transparently, since their
native string type is defined as Unicode and not as ASCII.
Knowing that text is ASCII does not therefore convey much benefit
in most programming contexts.
Nearly all software is written these days in an environment in which
strings are assumed Unicode, but that doesn't mean that programmers
spend their time worrying about the fact that Aries is represented
by two code points or that there is no unique way to encode something
that looks like an Omega or accented characters.  Comparison of two
UTF-8 sequences for equality *is* an easy operation, though it will
not necessarily yield true for two strings whose pixel rendering is
identical.

> Rather than a drastic change in the definition of the *char* data type, I
> believe it would be much better to introduce a *String* datatype in
> VOTable, which would be defined as *a UTF-8 sequence of Unicode code
> points, excluding the *�* (null) code point*.  Such a datatype would be
> more flexible, without having to define what is a "character" or requiring
> an *arraysize* attribute; it would moreover become possible to define
> arrays of strings, which is currently problematic.

The current proposal does not need to define what a "character" is.

> In the TABLEDATA serialization, the representation of a *String* is
> straightforward — there is however a possible problem with the &-symbols :
> while the &#-symbols are easily interpretable (numerical values like &#x26;
> or <), what about alphabetic symbols like & or < ? If these
> alphabetic symbols related to ascii characters can (and should) be
> enumerated as it was in VOTable 1.5, what about the ever-growing list of
> Unicode symbols like ⥫ (⥫) or 𝕏 (𝕏) ? Should these be
> explicitely excluded or accepted?

None of these "&-symbols" present VOTable-specific issues.
The Unicode content of elements and attributes in a well-formed
XML document is well-defined, and character entity references
(numeric or one of the five < > & ' ") are
generally handled and decoded into a stream of code points by an
XML processing layer before application software sees them.
Entities like ⥫ (defined by HTML5) are not legal in XML
unless specifically defined in an associated DTD (and thus processed
by the XML parser). This is completely standard XML processing,
and no VOTable-specific discussion is required.

> The BINARY serialization would not be a problem, since the String would
> just be a stream of bytes ending with a *null*; there would be no need to
>  specify a length preceding the stream of bytes, removing the requirement
> of a maximal
> length (number of bytes, or of code points, of glyphs or whatever size)

This would then be unlike any of the existing datatypes in VOTable,
all of which are fixed size, so probably quite a bit of redrafting
would be necessary.  It would mean that you can't skip over
data in a BINARY stream without reading all the bytes.  1-d string
arrays do become easier to encode, though 2+-d string arrays would require
some special arrangements.  It also means that any VOTable reader
that doesn't know about the new datatype has no chance to read any
of a BINARY stream containing such data.  Is one of these reasons
why strings were not defined this way in the original version
of VOTable?

> The FITS serialization would be a problem, since this type does not (yet)
> exist in FITS; there where several discussions about adding UTF-8 in FITS,
> and an obvious possibility would be to save the string contents in the
> heap, while the binary table row would contain just a  pointer to the
> location of the string in the heap.
>
> Finally shouldn't the introduction of UTF-8 in VOTable also specify whether
> UTF-8 would be acceptable as attribute values ? Could the *name* or value
> attribute of a <FIELD>, <INFO>, <PARAM> contain "characters" outside the
> restricted-ascii set ?

The value type of a PARAM is already defined by its datatype attribute
in just the same way as for a FIELD.  INFO is defined in terms of a
PARAM with datatype="char" arraysize="*" (VOTable 1.5 sec 4.8),
so if char is changed to permit Unicode then INFO values will
automatically allow Unicode content as well.  As for FIELD/INFO/PARAM
names, these attributes are defined by the XSD schema as xs:token,
which means (with a few restrictions about control characters)
they can have any Unicode values, which has been the case since the
initial version of VOTable.  Again, this is normal XML business and
I don't think the VOTable standard would be enhanced by including
an XML primer.

> Sorry for being a bit long, but I think the radical change of transforming
> ascii into UTF-8 is worth thinking about the multiple implications involved.

I agree, we have not got to where we are without thinking about it.

Mark

--
Mark Taylor  Astronomical Programmer  Physics, Bristol University, UK
m.b.taylor at bristol.ac.uk          https://www.star.bristol.ac.uk/mbt/