Unicode in VOTable

Wed Aug 13 04:55:56 PDT 2014

Mark, hello.

On 2014 Aug 12, at 13:36, Mark Taylor <m.b.taylor at bristol.ac.uk> wrote:

> That does present a complication, since the text of the standard
> everywhere refers to these length declarations as array sizes
> (element counts) not byte counts, so for instance a 3-element
> array of 32-bit integers is declared with a size of 3 not 12.
> Also, client code wanting to write or read VOTables with
> char-array columns may be using strings with fixed character
> counts rather than fixed UTF-8 encoding sizes
> (e.g. non-ASCII CHAR(x) columns in a database),
> so that overflows/truncations might result when some characters
> expand to multiple bytes.  This issue might not be a showstopper,
> but it certainly requires careful thought.

The underlying problem is that the count in FIELD/@arraysize, and the run-length in the BINARY encoding, have slightly different interpretations; in particular, they have different units.

Or, put another way: the idea of the 'length of the primitive' makes sense for the non-character types in Sect. 2.1, which have only a single fixed-length encoding, but it is really meaningless for  character strings.  _Except_, that is, for the ASCII encoding.

Looking back at the on-list discussions in this thread, in March and April, and recalling some discussions in Madrid, I recall that the main problem here is not really whether or not there are characters outside 0x20 -- 0x7f in an array, but that the encoding of character strings in BINARY blocks is such that a variable-length UTF-8 or UTF-16 encoding creates problems for applications that want to skip the encoded string without necessarily parsing the encoded bytes inside.

(Aside: it's not actually hard to scan such a string -- you don't have to have a unicode decoder to hand.  The number of unicode characters in a UTF-8-encoded string is the same as the number of bytes in the encoding minus the number of 10bbbbbb bytes.)

One possible resolution here is to regard the run-length marker for variable-length string encoding as having units of bytes rather than characters (ie, to regard the 'primitive' here as the byte).  That means that for character arrays, the FIELD/@arraysize attribute would be in units of characters, and the run-length would potentially be a different number of bytes.

Unfortunately that _doesn't_ work, on further reflection, because a fixed-length character array (ie, one whose FIELD/@arraysize is not "*") has no run-length marker in its encoding, making it hard to skip the string without at least minimal scanning.  I think that this, by itself, undermines any hope of repurposing 'char' as a array of unicode characters in a way which is continuous with the current specification (it would be possible, I think, but fiddly and probably not worth it). (**)

An alternative would be to redefine the unicodeChar type, but I can see why that's ruled out.

So...

> This suggests the addition of a new primitive datatype named (say)
> "utf8" rather than repurposing the existing "char".  Array sizes
> declared (by means 1 or 2 above) for fields with datatype="utf8"
> would then indicate the number of bytes in the field, and the
> number of characters (unicode code points) is not explicitly coded.

'utf8' would be a had name, since UTF-8 is a detail of an encoding, and nothing to do with the data model.  Perhaps this should be called "char/utf8" at least.

One problem is that this datatype should presumably be legal in an XML-encoded table, even though it's fairly unlikely to appear there:

<TABLE>
  <FIELD ID="aString" datatype="char/utf8" arraysize="10"/>
  <DATA><TABLEDATA>
  <TR>
   <TD>Apple</TD> 
</TR></TABLE>

In this context, what could the FIELD/@arraysize mean?  It can't mean bytes, because this is XML, and all notion of bytes has been left behind in the lexer.

One possibility would be to say that "char/utf8" must not be used other than with a BINARY-encoded TABLE, but that's getting intricate.

The other problem is that this shares the problem at (**) above.  If the char/utf8 is part of a fixed-size array of characters (ie FIELD/@arraysize is not "*"), then the encoded value will occupy a variable number of bytes, but has no length prefix, and so can't be skipped as desired.

Two possible resolutions:

  1. Create a datatype "char/utf8", which is equivalent to type "char" in every sense (including the meaning of FIELD/@arraysize) except that it has a different BINARY-encoding.

  2. Leave the datatype as "char" but add a new attribute encoding="utf8".  This is ignored when the table content is XML, but indicates the encoding of any BINARY-encoded content. 

Option 2 might be ruled out on the grounds that a pre-1.4 client might read datatype='char', ignore encoding='utf8', and confuse itself -- I don't know how bad that would be.

In each case, I think the encoding in question should be UTF-8 _plus_ a run-length prefix in units of bytes, and that this prefix should be present for fixed- and variable-length strings.

See you,

Norman

-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK