Unicode in VOTable

Mon Apr 7 15:45:00 PDT 2014

Hi Mark,

My apologies for taking a while to get back to this.

Mark Taylor <m.b.taylor at bristol.ac.uk> wrote:
>> > It's possible that revisiting this in a future version of the standard
>> > might change that, though for reasons of backward compatibility that
>> > might be problematic.
>> > 
>> > Having said that, I wouldn't be too surprised to find that sloppily
>> > coded VOTable readers (possibly including mine, I haven't checked)
>> > in unicode-friendly languages might actually not do that, and treat
>> > such arrays as UTF-8 strings because the language byte array
>> > handling naturally makes such interpretations.
>> 
>> What I would like is a revision to the standard.  It sounds like you
>> are agreeing with me that UTF-8 is, to some degree, existing usage.
>> In that case, specifying UTF-8 would be removing ambiguities and
>> codifying existing practice, not inventing new usage.
> 
> "to some degree" maybe, but I suspect not very much, and to the extent
> that it is, it's certainly in contravention of what the standard says.
> So I'm not very comfortable with the idea of adjusting the definition
> in this way.

UTF-8 is 100% backwards compatible with the existing standard.  I do
not understand why you are uncomfortable extending the standard in
this way.

>> > Since unicodeChar is supposed to contain unicode strings, the same
>> > reasoning doesn't apply to datatype="unicodeChar".  Using UTF-16
>> > in unicodeChar follows the spirit and letter of the standard
>> > in the (overwhelmingly common?) case that none of the characters
>> > require surrogates.  If surrogate pairs are required, there is
>> > a fair chance it will work anyway.  So if you want to put unicode
>> > into a BINARY2 serialized VOTable, I think you should use
>> > unicodeChar arrays with a UTF-16 or maybe UCS-2 encoding.
>> 
>> I can always write UTF-16 characters for my own consumption.  What I
>> want is to be able to demand other readers to understand it as well,
>> in the same way that I can demand other readers to understand boolean
>> or floatComplex.
>> 
>> What I want is revisions to the standard to make, for example, VOTAble
>> 1.4.  The first step towards that is to get consensus here that the
>> revision is a good idea.  Do you (or anyone else) agree these are good
>> revisions, or do you still have some doubts?
> 
> As above: my feeling is that an adjustment from UCS-2 to UTF-16 for
> the unicodeChar type would be a good change, but I have doubts about
> redefining the char type.  Other people may have different opinions.
> But if you want to write something now which there's a good chance
> will work with existing readers and will look pretty similar in
> future versions of the standard (if any) I'd advise use of unicodeChar
> and UTF-16.

I am not looking for something that _might_ work.  I am proposing a
100% backwards compatible extension to the standard so it will
_definitely_ work.

> I have added a new page to the IVOA wiki with an entry on this topic:
> 
>    http://wiki.ivoa.net/twiki/bin/view/IVOA/VOTableIssues13
> 
> If you or others have opinions, feel free to add them there, and if
> the VOTable standard is revised at some point in the future, those
> notes will be taken into account.  Note however that there is not
> currently an activity leading towards a new revision of VOTable
> in the IVOA.

Thanks.  I applied for credentials to make comments on that page.

Cheers,
Walter Landry