From francois.ochsenbein at gmail.com Thu Jul 2 20:51:43 2026 From: francois.ochsenbein at gmail.com (Francois Ochsenbein) Date: Thu, 2 Jul 2026 20:51:43 +0200 Subject: Questions about UTF-8 in VOTable Message-ID: Thank you Mark for taking care of answering to my concerns ? and thanks for the pointers to the previous discussions; but I'm surprised that removing the possibility of specifying that a non-numerical column is made of ascii-only characters does not raise more comments or concerns?? A few more comments embedded below: ==> On 2026-06-29 ? 13:02+0100, Mark Taylor wrote: >Fran?ois, > >On Fri, 19 Jun 2026, Francois Ochsenbein via apps wrote: > >> VOTable 1.6 proposes to change the definition of the *char * >> datatype from ascii to utf-8. I really think *it is not a good >> idea*, and a new datatype able to handle utf-8 strings should be >> preferred if the exchange of tables containing non-ascii data is >> required. This is why I introduced the unicodeChar in the first >> version of VOTable: open a possibility of exchanging textual data >> not limited to ascii-only characters. Unicode was in active >> development at that time (2002), and choosing Unicode for the >> expansion of textual data seemed the obvious way, as opposed to a >> choice of a *charset* which enlarges the alphabet to a very limited >> set. >> >> Currently virtually 100% of non-numeric data existing in astronomical >> tables consist in a sequence of *restricted ascii characters* as >> defined in FITS (bytes with decimal values between 32 and 126, >> excluding therefore control characters). Considering the importance >> of such non-numerical data, It seems fundamental that a >> made of *restricted ascii* characters continues to exist in >> VOTable. > >The benefit of repurposing the char datatype to carry Unicode >instead of ASCII is to minimise the change required to software, >and to reduce the impact in the VO where different components >may be updated to VOTable 1.6 on significantly different timescales. > >By doing it this way, (a) all VOTable 1.6 software will be able to >read pre-VOTable 1.6 content without any special arrangements in >the code, and (b) most pre-VOTable 1.6 software will be able to read >most VOTable 1.6 content without even noticing the difference. >Admittedly (b) is not 100% true, since ASCII-expecting code >encountering Unicode/UTF-8 bytes may behave strangely, but >(i) in many cases this will result in only slightly garbled output, >or even in output as intended in the case that Unicode rather >than ASCII machinery is in fact used to decode the byte stream; and >(ii) given that most textual content is likely to continue to fall >within the ASCII range, such issues will probably only affect >a small minority of the text encountered. Well, if you see the contents as just a text it's ok, but if the contents is something to process it is important to have this information, rather than having to test each byte before performing the field interpretation? > >If we introduce a new datatype for Unicode, then software writing >character data to VOTable text will need to decide for each >column whether to write a char column which is unable to carry >non-ASCII content but can (probably) be read by all VOTable readers, >or a unicodeString(?) column which can only be read by V1.6-aware >VOTable readers. > >Making that decision at write time, i.e. knowing whether string >content is ASCII-only, is typically not easy, since in a >programming environment where strings are natively Unicode >not ASCII, unless output code has additional information about >character data values (for instance that it originated from FITS, >or comprises ISO-8601 timestamps) it can't assume ASCII and can only >safely decide to write it as Unicode. Alternatively it could make >an additional pass through the data to check for ASCIIness but that >adds expense and inhibits streaming. If it writes a new >unicodeString type then pre-VOTable1.6 readers have no chance >to make sense of it. Note also that in the case of BINARY/BINARY2 >encoding, a pre-VOTable1.6 reader would not only fail to understand >the content of unicodeString columns, but would be unable to read >any of the data stream for a table containing a unicodeString field, >because it wouldn't know how to count bytes to skip over >the Unicode parts. > >For code that reads ASCII or Unicode strings, in most cases it >won't in any case treat the content differently - it's a string. >Admittedly there may be exceptions; for instance software might want >to refuse to write a non-ASCII column to a FITS BINTABLE A-format >column. But typically for such situations (at least it's what >I'd do) it would be reasonable to make a best-efforts attempt >and just transform non-ASCII characters to a '?' or similar, >in which case knowing that it's ASCII doesn't buy you much. > Well, the knowledge of ascii or Unicode contents is in principle known by the data producer, i.e. the original VOTable writer ? it should not be a decision taken at run-time. So my fundamental question is: how to propagate this knowledge of "pure ascii" contents to the data consumer, if the definition of the "character" datatype is modified ? The proposal of using the "width" attribute can't work with unspecified length, and looks weird since the definition of what is a "width" would differ between character and other datatypes (pull/71). >Basically: use of Unicode is normal for text these days, ASCII is >the special case. The effect of having separate arrangements to >process these formats would be (I claim) more to increase >complication than to allow for simplified processing in some cases. > >We have discussed this approach at some length over the last year >or so. Following an initial email discussion on the apps list >http://mail.ivoa.net/pipermail/apps/2025-June/thread.html, >which included a posting from you voicing concern about it, >I drafted a Pull Request on github setting out my approach in >concrete terms: https://github.com/ivoa-std/VOTable/pull/71 >This received quite a bit of scrutiny from potential users and >implementers, so I made various changes and presented it at Gorlitz >https://wiki.ivoa.net/internal/IVOA/InterOpNov2025Apps/votable.pdf >I then invited further comments and objections on the mailing list >http://mail.ivoa.net/pipermail/apps/2025-November/001793.html >(none forthcoming) prior to merging it in November 2025. >It is now part of the current Working Draft >https://www.ivoa.net/Documents/VOTable/20260413/ >and we have at least three prototype/production implementations. > >Your suggestion of a new, non-fixed-size unicodeString type is not >absurd, but for the reasons above I don't personally support it, >and it's quite late in the process to back out of the currently >drafted changes. However if there is broad support for it instead >of repurposing datatype="char", we can consider that. You are probably right, the introduction of a new datatype should better be done in a "major" release (VOTable-2.0?) > >I have made more detailed responses to some of your other points below. > >> Notice that Unicode and its UTF-8 serialisation is much more complex >> than just an extension of the basic alphabet used in English to >> "characters" existing in other languages. What a language like Java >> defines as a " *Character*" is in fact a *Unicode code point,* which >> is not necessarily what we could call a "character", a "letter", a >> "symbol" or a "glyph". Unicode code points may be invisible (have a >> zero width), may represent a part of a symbol (e.g. an accent), or >> have a double width. For instance the UTF-8 string ♈︎ >> which represents the Aries constellation, is made of 6 bytes >> containing 2 Unicode code points: the first is ♈ which has a >> width of 2, and the second is ︎ which has a width of 0 and >> has just a role of preventing from rendering the Aries symbol as an >> emoji (?). >> >> There are many other traps in Unicode and its UTF-8 serialisation, >> such as several ways of writing a unique symbol like ? as a 2-byte >> greek letter (Ω) or as the 3-byte Ohm unit (Ω); >> similarly letters with an accent (e.g. ?) may be coded with a 2-byte >> code point (Ô), or with two code points in 3 bytes (O#x302;) >> etc. etc. see e.g. https://utf8everywhere.org/ >> . As a consequence, even the >> comparison of 2 UTF-8 strings for equality is *not* an easy >> operation. > >That is all true and well-understood. But the large majority of >modern programming environments (e.g. Python, JavaScript, Java; >it's an option in Rust) deal with it transparently, since their >native string type is defined as Unicode and not as ASCII. >Knowing that text is ASCII does not therefore convey much benefit >in most programming contexts. >Nearly all software is written these days in an environment in which >strings are assumed Unicode, but that doesn't mean that programmers >spend their time worrying about the fact that Aries is represented >by two code points or that there is no unique way to encode something >that looks like an Omega or accented characters. Comparison of two >UTF-8 sequences for equality *is* an easy operation, though it will >not necessarily yield true for two strings whose pixel rendering is >identical. > Sorry to disagree with the "transparency" of Unicode in the various languages: for instance length("??????????") gives 12 in Javascript or Java, while Python3 or awk return 10, which is the correct number of Unicode code-points in this string(*). Similarly extracting a substring out of a string gives different results depending on your programming language when the string contains code-point(s) beyond the BMP. (*) the Unicode contents of this string is \u{3B1}\u{301}\u{3B2}\u{3B3}\u{2648}\u{FE0E}\u{2649}\u{FE0E}\u{1F60A}\u{1D54F} >> Rather than a drastic change in the definition of the *char* data >> type, I believe it would be much better to introduce a *String* >> datatype in VOTable, which would be defined as *a UTF-8 sequence of >> Unicode code points, excluding the *�* (null) code point*. Such >> a datatype would be more flexible, without having to define what is >> a "character" or requiring an *arraysize* attribute; it would >> moreover become possible to define arrays of strings, which is >> currently problematic. > >The current proposal does not need to define what a "character" is. > >> In the TABLEDATA serialization, the representation of a *String* is >> straightforward ? there is however a possible problem with the >> &-symbols : while the &#-symbols are easily interpretable (numerical >> values like & or <), what about alphabetic symbols like >> & or < ? If these alphabetic symbols related to ascii >> characters can (and should) be enumerated as it was in VOTable 1.5, >> what about the ever-growing list of Unicode symbols like ⥫ >> (?) or 𝕏 (?) ? Should these be explicitely excluded or >> accepted? > >None of these "&-symbols" present VOTable-specific issues. >The Unicode content of elements and attributes in a well-formed >XML document is well-defined, and character entity references >(numeric or one of the five < > & ' ") are >generally handled and decoded into a stream of code points by an >XML processing layer before application software sees them. >Entities like ⥫ (defined by HTML5) are not legal in XML >unless specifically defined in an associated DTD (and thus processed >by the XML parser). This is completely standard XML processing, >and no VOTable-specific discussion is required. > Thank you for the clarification ! >> The BINARY serialization would not be a problem, since the String >> would just be a stream of bytes ending with a *null*; there would be >> no need to specify a length preceding the stream of bytes, removing >> the requirement of a maximal >> length (number of bytes, or of code points, of glyphs or whatever >> size) > >This would then be unlike any of the existing datatypes in VOTable, >all of which are fixed size, so probably quite a bit of redrafting >would be necessary. It would mean that you can't skip over >data in a BINARY stream without reading all the bytes. 1-d string >arrays do become easier to encode, though 2+-d string arrays would >require some special arrangements. In fact 1-d strings would also require a clarification on how to write these in the serialization: unless quoted strings are a standard in XML? >It also means that any VOTable >reader that doesn't know about the new datatype has no chance to read >any of a BINARY stream containing such data. Is one of these reasons >why strings were not defined this way in the original version >of VOTable? The reason was a complete compatibility with FITS, even though the unicodeChar was an attempt to expand the character content beyond the ascii set. But defining a String datatype from the beginning would have been wiser (sorry)? > >> The FITS serialization would be a problem, since this type does not >> (yet) exist in FITS; there where several discussions about adding >> UTF-8 in FITS, and an obvious possibility would be to save the >> string contents in the heap, while the binary table row would >> contain just a pointer to the location of the string in the heap. >> >> Finally shouldn't the introduction of UTF-8 in VOTable also specify >> whether UTF-8 would be acceptable as attribute values ? Could the >> *name* or value attribute of a , , contain >> "characters" outside the restricted-ascii set ? > >The value type of a PARAM is already defined by its datatype attribute >in just the same way as for a FIELD. INFO is defined in terms of a >PARAM with datatype="char" arraysize="*" (VOTable 1.5 sec 4.8), >so if char is changed to permit Unicode then INFO values will >automatically allow Unicode content as well. As for FIELD/INFO/PARAM >names, these attributes are defined by the XSD schema as xs:token, >which means (with a few restrictions about control characters) >they can have any Unicode values, which has been the case since the >initial version of VOTable. Again, this is normal XML business and >I don't think the VOTable standard would be enhanced by including >an XML primer. You are right, thank you for the clarification ? > >> Sorry for being a bit long, but I think the radical change of >> transforming ascii into UTF-8 is worth thinking about the multiple >> implications involved. > >I agree, we have not got to where we are without thinking about it. > >Mark > Cheers, Fran?ois -- -------------- next part -------------- An HTML attachment was scrubbed... URL: