Unicode in VOTable
msdemlei at ari.uni-heidelberg.de
Mon Aug 25 02:02:08 PDT 2014
On Fri, Aug 22, 2014 at 12:39:02PM +0100, Dave Morris wrote:
> On 2014-08-14 17:35, Mark Taylor wrote:
> >On Thu, 14 Aug 2014, Markus Demleitner wrote:
> >> VOTable considers char as byte streams that can be decoded from
> >> for presentation purposes. TABLEDATA encoding is presentation.
> >> arraysize refers to the length of the bytestream always, never to
> >> the length of any unicode code sequence decodeable from the byte
> >> stream.
> >Yes, I think that would work. "TABLEDATA encoding is presentation"
> >seems like a rather radical statement in terms of the way one
> >usually thinks about VOTable, but I can't think of any actual
> >negative consequences.
> If I have a SQL database with a column defined as CHAR(3),
> CREATE TABLE my_table (
> xyz CHAR(3)
> How would I describe that as a FIELD ?
> <FIELD name='xyz' datatype='char' arraysize='3'>
> <FIELD name='xyz' datatype='char' arraysize='12'>
> <FIELD name='xyz' datatype='char' encoding='utf-8' arraysize='3'>
> <FIELD name='xyz' datatype='char' encoding='utf-8' arraysize='12'>
First, I'd hope there's no "encoding" attribute to FIELD, so let's
discount the cases with that attribute.
Other than that: I'd say do arraysize="*" here if you database
actually stores codepoints; it's probably more space-efficient than
any fixed size. Of course, you don't have fixed-size records then
any more. If you want these, I'd say store UTF-8 in your database.
arraysize="*" admittedly doesn't help you if your database can do
arrays and you suddenly have an array of strings (sequences of
codepoints rather than sequences of bytes); if that's really
something you must do, I guess you won't get around computing a
worst-case length based on what you have in your database (I seem to
remember non-BMP characters can take up to 5 bytes right now).
More information about the apps