Unicode in VOTable

Tue Aug 12 05:36:57 PDT 2014

Walter and other VOTable enthusiasts,

having recently tracked down a unicode-related bug (I suspect
not the last) in my code, I have thought a bit more about this.

There is a problem with just declaring datatype char to be UTF-8
rather than ASCII, related to the fact that UTF-8 uses a variable
number of bytes per character.  In VOTable the length of an array
of data items is specified in one of two ways:

  1. By the "arraysize" attribute on a FIELD or PARAM element
     (fixed-length fields)

  2. By a 32-bit integer embedded in a BINARY/BINARY2 stream
     immediately before the serialized value
     (variable-length fields)

Strings in VOTable are just arrays with datatype char or unicodeChar,
so this applies to them.  If we want to use a representation of
VOTable primitive values for which the byte count is not fixed
(such as UTF-8 for char) we have to consider whether those length
declarations refer to the number of array elements (unicode code points)
or bytes.

For reasons of efficiency, I think the answer has to be bytes for
both 1 and 2 above (to enable random access by calculating cell
offsets without having to read all the preceding data).

That does present a complication, since the text of the standard
everywhere refers to these length declarations as array sizes
(element counts) not byte counts, so for instance a 3-element
array of 32-bit integers is declared with a size of 3 not 12.
Also, client code wanting to write or read VOTables with
char-array columns may be using strings with fixed character
counts rather than fixed UTF-8 encoding sizes
(e.g. non-ASCII CHAR(x) columns in a database),
so that overflows/truncations might result when some characters
expand to multiple bytes.  This issue might not be a showstopper,
but it certainly requires careful thought.

This suggests the addition of a new primitive datatype named (say)
"utf8" rather than repurposing the existing "char".  Array sizes
declared (by means 1 or 2 above) for fields with datatype="utf8"
would then indicate the number of bytes in the field, and the
number of characters (unicode code points) is not explicitly coded.
Obviously, defining a new datatype is more disruptive in terms
of what VOTables look like, but it would be less prone to
unexpected problems with existing code than redefining an existing one.

Mark

On Mon, 7 Apr 2014, Walter Landry wrote:

> Hi Mark,
> 
> My apologies for taking a while to get back to this.
> 
> Mark Taylor <m.b.taylor at bristol.ac.uk> wrote:
> >> > It's possible that revisiting this in a future version of the standard
> >> > might change that, though for reasons of backward compatibility that
> >> > might be problematic.
> >> > 
> >> > Having said that, I wouldn't be too surprised to find that sloppily
> >> > coded VOTable readers (possibly including mine, I haven't checked)
> >> > in unicode-friendly languages might actually not do that, and treat
> >> > such arrays as UTF-8 strings because the language byte array
> >> > handling naturally makes such interpretations.
> >> 
> >> What I would like is a revision to the standard.  It sounds like you
> >> are agreeing with me that UTF-8 is, to some degree, existing usage.
> >> In that case, specifying UTF-8 would be removing ambiguities and
> >> codifying existing practice, not inventing new usage.
> > 
> > "to some degree" maybe, but I suspect not very much, and to the extent
> > that it is, it's certainly in contravention of what the standard says.
> > So I'm not very comfortable with the idea of adjusting the definition
> > in this way.
> 
> UTF-8 is 100% backwards compatible with the existing standard.  I do
> not understand why you are uncomfortable extending the standard in
> this way.
> 
> >> > Since unicodeChar is supposed to contain unicode strings, the same
> >> > reasoning doesn't apply to datatype="unicodeChar".  Using UTF-16
> >> > in unicodeChar follows the spirit and letter of the standard
> >> > in the (overwhelmingly common?) case that none of the characters
> >> > require surrogates.  If surrogate pairs are required, there is
> >> > a fair chance it will work anyway.  So if you want to put unicode
> >> > into a BINARY2 serialized VOTable, I think you should use
> >> > unicodeChar arrays with a UTF-16 or maybe UCS-2 encoding.
> >> 
> >> I can always write UTF-16 characters for my own consumption.  What I
> >> want is to be able to demand other readers to understand it as well,
> >> in the same way that I can demand other readers to understand boolean
> >> or floatComplex.
> >> 
> >> What I want is revisions to the standard to make, for example, VOTAble
> >> 1.4.  The first step towards that is to get consensus here that the
> >> revision is a good idea.  Do you (or anyone else) agree these are good
> >> revisions, or do you still have some doubts?
> > 
> > As above: my feeling is that an adjustment from UCS-2 to UTF-16 for
> > the unicodeChar type would be a good change, but I have doubts about
> > redefining the char type.  Other people may have different opinions.
> > But if you want to write something now which there's a good chance
> > will work with existing readers and will look pretty similar in
> > future versions of the standard (if any) I'd advise use of unicodeChar
> > and UTF-16.
> 
> I am not looking for something that _might_ work.  I am proposing a
> 100% backwards compatible extension to the standard so it will
> _definitely_ work.
> 
> > I have added a new page to the IVOA wiki with an entry on this topic:
> > 
> >    http://wiki.ivoa.net/twiki/bin/view/IVOA/VOTableIssues13
> > 
> > If you or others have opinions, feel free to add them there, and if
> > the VOTable standard is revised at some point in the future, those
> > notes will be taken into account.  Note however that there is not
> > currently an activity leading towards a new revision of VOTable
> > in the IVOA.
> 
> Thanks.  I applied for credentials to make comments on that page.
> 
> Cheers,
> Walter Landry
> 

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776  http://www.star.bris.ac.uk/~mbt/