Questions about UTF-8 in VOTable

Thu Jul 2 20:51:43 CEST 2026

Thank you Mark for taking care of answering to my concerns — and
thanks for the pointers to the previous discussions; but I'm surprised
that removing the possibility of specifying that a non-numerical column
is made of ascii-only characters does not raise more comments or
concerns?… A few more comments embedded below:

==> On 2026-06-29 à 13:02+0100,
      Mark Taylor <m.b.taylor at bristol.ac.uk> wrote:

>François,
>
>On Fri, 19 Jun 2026, Francois Ochsenbein via apps wrote:
>
>> VOTable 1.6 proposes to change the definition of the *char *
>> datatype from ascii to utf-8. I really think *it is not a good
>> idea*, and a new datatype able to handle utf-8 strings should be
>> preferred if the exchange of tables containing non-ascii data is
>> required. This is why I introduced the unicodeChar in the first
>> version of VOTable: open a possibility of exchanging textual data
>> not limited to ascii-only characters. Unicode was in active
>> development at that time (2002), and choosing Unicode for the
>> expansion of textual data seemed the obvious way, as opposed to a
>> choice of a *charset* which enlarges the alphabet to a very limited
>> set.
>>
>> Currently virtually 100% of non-numeric data existing in astronomical
>> tables consist in a sequence of *restricted ascii characters* as
>> defined in FITS (bytes with decimal values between 32 and 126,
>> excluding therefore control characters). Considering the importance
>> of such non-numerical data, It seems fundamental that  a <FIELD>
>> made of  *restricted ascii* characters continues to exist in
>> VOTable.
>
>The benefit of repurposing the char datatype to carry Unicode
>instead of ASCII is to minimise the change required to software,
>and to reduce the impact in the VO where different components
>may be updated to VOTable 1.6 on significantly different timescales.
>
>By doing it this way, (a) all VOTable 1.6 software will be able to
>read pre-VOTable 1.6 content without any special arrangements in
>the code, and (b) most pre-VOTable 1.6 software will be able to read
>most VOTable 1.6 content without even noticing the difference.
>Admittedly (b) is not 100% true, since ASCII-expecting code
>encountering Unicode/UTF-8 bytes may behave strangely, but
>(i) in many cases this will result in only slightly garbled output,
>or even in output as intended in the case that Unicode rather
>than ASCII machinery is in fact used to decode the byte stream; and
>(ii) given that most textual content is likely to continue to fall
>within the ASCII range, such issues will probably only affect
>a small minority of the text encountered.

Well, if you see the contents as just a text it's ok, but if the
contents is something to process it is important to have this
information, rather than having to test each byte before performing
the field interpretation…

>
>If we introduce a new datatype for Unicode, then software writing
>character data to VOTable text will need to decide for each
>column whether to write a char column which is unable to carry
>non-ASCII content but can (probably) be read by all VOTable readers,
>or a unicodeString(?) column which can only be read by V1.6-aware
>VOTable readers.
>
>Making that decision at write time, i.e. knowing whether string
>content is ASCII-only, is typically not easy, since in a
>programming environment where strings are natively Unicode
>not ASCII, unless output code has additional information about
>character data values (for instance that it originated from FITS,
>or comprises ISO-8601 timestamps) it can't assume ASCII and can only
>safely decide to write it as Unicode.  Alternatively it could make
>an additional pass through the data to check for ASCIIness but that
>adds expense and inhibits streaming.  If it writes a new
>unicodeString type then pre-VOTable1.6 readers have no chance
>to make sense of it.  Note also that in the case of BINARY/BINARY2
>encoding, a pre-VOTable1.6 reader would not only fail to understand
>the content of unicodeString columns, but would be unable to read
>any of the data stream for a table containing a unicodeString field,
>because it wouldn't know how to count bytes to skip over
>the Unicode parts.
>
>For code that reads ASCII or Unicode strings, in most cases it
>won't in any case treat the content differently - it's a string.
>Admittedly there may be exceptions; for instance software might want
>to refuse to write a non-ASCII column to a FITS BINTABLE A-format
>column.  But typically for such situations (at least it's what
>I'd do) it would be reasonable to make a best-efforts attempt
>and just transform non-ASCII characters to a '?' or similar,
>in which case knowing that it's ASCII doesn't buy you much.
>

Well, the knowledge of ascii or Unicode contents is in principle known
by the data producer, i.e. the original VOTable writer — it should not
be a decision taken at run-time. So my fundamental question is: how to
propagate this knowledge of "pure ascii" contents to the data consumer,
if the definition of the "character" datatype is modified ? The
proposal of using the "width" attribute can't work with unspecified
length, and looks weird since the definition of what is a "width"
would differ between character and other datatypes (pull/71).

>Basically: use of Unicode is normal for text these days, ASCII is
>the special case.  The effect of having separate arrangements to
>process these formats would be (I claim) more to increase
>complication than to allow for simplified processing in some cases.
>
>We have discussed this approach at some length over the last year
>or so.  Following an initial email discussion on the apps list
>http://mail.ivoa.net/pipermail/apps/2025-June/thread.html,
>which included a posting from you voicing concern about it,
>I drafted a Pull Request on github setting out my approach in
>concrete terms: https://github.com/ivoa-std/VOTable/pull/71
>This received quite a bit of scrutiny from potential users and
>implementers, so I made various changes and presented it at Gorlitz
>https://wiki.ivoa.net/internal/IVOA/InterOpNov2025Apps/votable.pdf
>I then invited further comments and objections on the mailing list
>http://mail.ivoa.net/pipermail/apps/2025-November/001793.html
>(none forthcoming) prior to merging it in November 2025.
>It is now part of the current Working Draft
>https://www.ivoa.net/Documents/VOTable/20260413/
>and we have at least three prototype/production implementations.
>
>Your suggestion of a new, non-fixed-size unicodeString type is not
>absurd, but for the reasons above I don't personally support it,
>and it's quite late in the process to back out of the currently
>drafted changes.  However if there is broad support for it instead
>of repurposing datatype="char", we can consider that.

You are probably right, the introduction of a new datatype should
better be done in a "major" release (VOTable-2.0?)

>
>I have made more detailed responses to some of your other points below.
>
>> Notice that Unicode and its UTF-8 serialisation is much more complex
>> than just an extension of the basic alphabet used in English to
>> "characters" existing in other languages. What a language like Java
>> defines as a " *Character*" is in fact a *Unicode code point,* which
>> is not necessarily what we could call a "character", a "letter", a
>> "symbol" or a "glyph". Unicode code points may be invisible (have a
>> zero width), may represent a part of a symbol (e.g. an accent), or
>> have a double width. For instance the UTF-8 string &#x2648;&#xFE0E;
>> which represents the Aries constellation, is made of 6 bytes
>> containing 2 Unicode code points: the first is &#x2648; which has a
>> width of 2, and the second is &#xFE0E; which has a width of 0 and
>> has just a role of preventing from rendering the Aries symbol as an
>> emoji (♈).
>>
>> There are many other traps in Unicode and its UTF-8 serialisation,
>> such as several ways of writing a unique symbol like Ω as a 2-byte
>> greek letter (&#x3A9;) or as the 3-byte Ohm unit (&#x2126;);
>> similarly letters with an accent (e.g. Ô) may be coded with a 2-byte
>> code point (&#xD4;),  or with two code points in 3 bytes (O#x302;)
>> etc. etc. see e.g. https://utf8everywhere.org/
>> <https://utf8everywhere.org/.> . As a consequence, even the
>> comparison of 2 UTF-8 strings for equality is *not* an easy
>> operation.
>
>That is all true and well-understood.  But the large majority of
>modern programming environments (e.g. Python, JavaScript, Java;
>it's an option in Rust) deal with it transparently, since their
>native string type is defined as Unicode and not as ASCII.
>Knowing that text is ASCII does not therefore convey much benefit
>in most programming contexts.
>Nearly all software is written these days in an environment in which
>strings are assumed Unicode, but that doesn't mean that programmers
>spend their time worrying about the fact that Aries is represented
>by two code points or that there is no unique way to encode something
>that looks like an Omega or accented characters.  Comparison of two
>UTF-8 sequences for equality *is* an easy operation, though it will
>not necessarily yield true for two strings whose pixel rendering is
>identical.
>

Sorry to disagree with the "transparency" of Unicode in the various
languages: for instance length("άβγ♈︎♉︎😊𝕏") gives 12 in Javascript
or Java, while Python3 or awk return 10, which is the correct number of
Unicode code-points in this string(*). Similarly extracting a substring
out of a string gives different results depending on your programming
language when the string contains code-point(s) beyond the BMP.
(*) the Unicode contents of this string is
\u{3B1}\u{301}\u{3B2}\u{3B3}\u{2648}\u{FE0E}\u{2649}\u{FE0E}\u{1F60A}\u{1D54F}

>> Rather than a drastic change in the definition of the *char* data
>> type, I believe it would be much better to introduce a *String*
>> datatype in VOTable, which would be defined as *a UTF-8 sequence of
>> Unicode code points, excluding the *�* (null) code point*.  Such
>> a datatype would be more flexible, without having to define what is
>> a "character" or requiring an *arraysize* attribute; it would
>> moreover become possible to define arrays of strings, which is
>> currently problematic.
>
>The current proposal does not need to define what a "character" is.
>
>> In the TABLEDATA serialization, the representation of a *String* is
>> straightforward — there is however a possible problem with the
>> &-symbols : while the &#-symbols are easily interpretable (numerical
>> values like &#x26; or <), what about alphabetic symbols like
>> & or < ? If these alphabetic symbols related to ascii
>> characters can (and should) be enumerated as it was in VOTable 1.5,
>> what about the ever-growing list of Unicode symbols like ⥫
>> (⥫) or 𝕏 (𝕏) ? Should these be explicitely excluded or
>> accepted?
>
>None of these "&-symbols" present VOTable-specific issues.
>The Unicode content of elements and attributes in a well-formed
>XML document is well-defined, and character entity references
>(numeric or one of the five < > & ' ") are
>generally handled and decoded into a stream of code points by an
>XML processing layer before application software sees them.
>Entities like ⥫ (defined by HTML5) are not legal in XML
>unless specifically defined in an associated DTD (and thus processed
>by the XML parser). This is completely standard XML processing,
>and no VOTable-specific discussion is required.
>

Thank you for the clarification !

>> The BINARY serialization would not be a problem, since the String
>> would just be a stream of bytes ending with a *null*; there would be
>> no need to specify a length preceding the stream of bytes, removing
>> the requirement of a maximal
>> length (number of bytes, or of code points, of glyphs or whatever
>> size)
>
>This would then be unlike any of the existing datatypes in VOTable,
>all of which are fixed size, so probably quite a bit of redrafting
>would be necessary.  It would mean that you can't skip over
>data in a BINARY stream without reading all the bytes.  1-d string
>arrays do become easier to encode, though 2+-d string arrays would
>require some special arrangements.

In fact 1-d strings would also require a clarification on how to write
these in the <TABLEDATA> serialization: unless quoted strings are a
standard in XML?

>It also means that any VOTable
>reader that doesn't know about the new datatype has no chance to read
>any of a BINARY stream containing such data.  Is one of these reasons
>why strings were not defined this way in the original version
>of VOTable?

The reason was a complete compatibility with FITS, even though the
unicodeChar was an attempt to expand the character content beyond the
ascii set. But defining a String datatype from the beginning would have
been wiser (sorry)…

>
>> The FITS serialization would be a problem, since this type does not
>> (yet) exist in FITS; there where several discussions about adding
>> UTF-8 in FITS, and an obvious possibility would be to save the
>> string contents in the heap, while the binary table row would
>> contain just a  pointer to the location of the string in the heap.
>>
>> Finally shouldn't the introduction of UTF-8 in VOTable also specify
>> whether UTF-8 would be acceptable as attribute values ? Could the
>> *name* or value attribute of a <FIELD>, <INFO>, <PARAM> contain
>> "characters" outside the restricted-ascii set ?
>
>The value type of a PARAM is already defined by its datatype attribute
>in just the same way as for a FIELD.  INFO is defined in terms of a
>PARAM with datatype="char" arraysize="*" (VOTable 1.5 sec 4.8),
>so if char is changed to permit Unicode then INFO values will
>automatically allow Unicode content as well.  As for FIELD/INFO/PARAM
>names, these attributes are defined by the XSD schema as xs:token,
>which means (with a few restrictions about control characters)
>they can have any Unicode values, which has been the case since the
>initial version of VOTable.  Again, this is normal XML business and
>I don't think the VOTable standard would be enhanced by including
>an XML primer.

You are right, thank you for the clarification 😊

>
>> Sorry for being a bit long, but I think the radical change of
>> transforming ascii into UTF-8 is worth thinking about the multiple
>> implications involved.
>
>I agree, we have not got to where we are without thinking about it.
>
>Mark
>

Cheers, François
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/apps/attachments/20260702/1303de81/attachment-0001.htm>