Unicode in VOTable

Mark Taylor m.b.taylor at bristol.ac.uk
Mon Mar 31 07:23:24 PDT 2014


Walter,

apologies for leaving this for a couple of weeks, I was very busy
with other things.

On Mon, 17 Mar 2014, Walter Landry wrote:

> Mark Taylor <M.B.Taylor at bristol.ac.uk> wrote:
> > Walter,
> > 
> > On Fri, 14 Mar 2014, Walter Landry wrote:
> > 
> >> Hi Norman,
> >> 
> >> Norman Gray <norman at astro.gla.ac.uk> wrote:
> >> > The only place (I think) where there's any need for discussing a
> >> > unicode serialisation is within BINARY blobs.  I doubt there's even
> >> > a need for discussing it within FITS blobs, since their internal
> >> > encoding is already specified elsewhere.
> >> 
> >> I am sorry if I gave the impression otherwise, but for this discussion
> >> I have always only been interested in BINARY2 blobs.  In particular, I
> >> want to know how to read and write Unicode characters into BINARY2
> >> blobs.  Is it OK to put UTF-8 into an "ASCII Character" array, or
> >> UTF-16 into a "Unicode Character" array?  The current standard says
> >> no.  Can we all agree that it should say yes?
> > 
> > My opinion: I do not think it's a good idea to put UTF-8 into
> > datatype="char" arrays as far as the existing version of VOTable goes.
> > Software following the letter or spirit of the current standard should
> > treat char arrays as having one character per array element, so a char
> > with the high bit set should be interpreted as a character from an extended
> > ASCII-like set rather than a UTF-8 surrogate character.
> 
> The current standard says ASCII, not ISO-8859-1, Windows-1250, or JIS
> X 0201.  So 8-bit extended ASCII characters in a 'char' array are
> already disallowed.  Do you have examples of VOTables in the wild that
> use some form of extended ASCII?
> 
> > It's possible that revisiting this in a future version of the standard
> > might change that, though for reasons of backward compatibility that
> > might be problematic.
> > 
> > Having said that, I wouldn't be too surprised to find that sloppily
> > coded VOTable readers (possibly including mine, I haven't checked)
> > in unicode-friendly languages might actually not do that, and treat
> > such arrays as UTF-8 strings because the language byte array
> > handling naturally makes such interpretations.
> 
> What I would like is a revision to the standard.  It sounds like you
> are agreeing with me that UTF-8 is, to some degree, existing usage.
> In that case, specifying UTF-8 would be removing ambiguities and
> codifying existing practice, not inventing new usage.

"to some degree" maybe, but I suspect not very much, and to the extent
that it is, it's certainly in contravention of what the standard says.
So I'm not very comfortable with the idea of adjusting the definition
in this way.

> > Since unicodeChar is supposed to contain unicode strings, the same
> > reasoning doesn't apply to datatype="unicodeChar".  Using UTF-16
> > in unicodeChar follows the spirit and letter of the standard
> > in the (overwhelmingly common?) case that none of the characters
> > require surrogates.  If surrogate pairs are required, there is
> > a fair chance it will work anyway.  So if you want to put unicode
> > into a BINARY2 serialized VOTable, I think you should use
> > unicodeChar arrays with a UTF-16 or maybe UCS-2 encoding.
> 
> I can always write UTF-16 characters for my own consumption.  What I
> want is to be able to demand other readers to understand it as well,
> in the same way that I can demand other readers to understand boolean
> or floatComplex.
> 
> What I want is revisions to the standard to make, for example, VOTAble
> 1.4.  The first step towards that is to get consensus here that the
> revision is a good idea.  Do you (or anyone else) agree these are good
> revisions, or do you still have some doubts?

As above: my feeling is that an adjustment from UCS-2 to UTF-16 for
the unicodeChar type would be a good change, but I have doubts about
redefining the char type.  Other people may have different opinions.
But if you want to write something now which there's a good chance
will work with existing readers and will look pretty similar in
future versions of the standard (if any) I'd advise use of unicodeChar
and UTF-16.

I have added a new page to the IVOA wiki with an entry on this topic:

   http://wiki.ivoa.net/twiki/bin/view/IVOA/VOTableIssues13

If you or others have opinions, feel free to add them there, and if
the VOTable standard is revised at some point in the future, those
notes will be taken into account.  Note however that there is not
currently an activity leading towards a new revision of VOTable
in the IVOA.

Mark

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776  http://www.star.bris.ac.uk/~mbt/


More information about the apps mailing list