RegTAP Post-RFC: Non-ASCII

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Wed May 21 11:57:24 PDT 2014


Dear Registry,

Since discussion on RegTAP was a bit short today -- lecture notes are
at
http://wiki.ivoa.net/internal/IVOA/InterOpMay2014Registry/regtaprfc.pdf
-- , I'd like to put the questions that seemed to require somehwat
more exchange on to the list.  If you're in Madrid, you're more than
welcome to chat with me in person, and I'd summarise on-list --
otherwise, reply here.  I'm sending separate mails for the issues so
we can keep manageable threads.

So, the first question is from "4. Open Questions":  The issue here
is that several columns (most notably resource.res_description,
resource.res_title, resource.creator_seq, and res_role.role_name,
possibly resource.source_value, res_role.street_address and the
remaining descriptions) may contain non-ascii that we should not
butcher.

Now, TAP's type system is unconcerned with this -- adql:varchar(*)
has no guarantees as regards character sets and such.  The best we
can hope is for preservation of whatever bytes the ingestor puts in.
This is relevant when querying against non-ASCII-Strings, but there
I see little room for standardisation.

On the delivery side, on the other hand, this may be different: can
we, with some prose, increase the likelihood that Müller and
Wambsganß survive a round trip through the relational registry?

The right thing to do would be to (1) require the returned VOTable fields
to be of VOTable type unicodeChar.  That's not a complete no-brainer,
as

(a) unicodeChar may have limited client support
(b) this may be an implementation concern on the server side
(implementors: is it?)
(c) VOTable unicodeChar wants UCS-2, which is basically long dead and
not terribly well supported by libraries these days (utf-16 would be
close enough, I guess) [1].

An alternative would be to say (2) strings in general should be utf-8
encoded where not otherwise specified and should (obviously) have the
container encoding in TABLEDATA, where the type would then be VOTable
char.  Not great either, as it

(a) probably goes against the intention of the VOTable authors
(b) is a massive gamble as to what clients actually do with it

Finally, we could do (3) nothing or (your index here) something else.

There's also valuable material on this matter in a recent discussion
over in apps: 

http://www.ivoa.net/pipermail/apps/2014-March/000938.html

Disclosure: I'm leaning towards (1), as it's "the right thing".  So,
if this bothers you, please speak up now.

Cheers,

          Markus

[1] Illustration:

$ python
Python 2.7.3 (default, Mar 14 2014, 11:57:14) 
[GCC 4.7.2] on linux2
>>> "ab".encode("ucs-2")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: ucs-2
>>> "ab".encode("utf-16")
'\xff\xfea\x00b\x00'




More information about the registry mailing list