Unicode in VOTable

Mon Aug 25 02:02:08 PDT 2014

Hi Dave,

On Fri, Aug 22, 2014 at 12:39:02PM +0100, Dave Morris wrote:
> On 2014-08-14 17:35, Mark Taylor wrote:
> >On Thu, 14 Aug 2014, Markus Demleitner wrote:
> >>  VOTable considers char as byte streams that can be decoded from
> >>utf-8
> >>  for presentation purposes.   TABLEDATA encoding is presentation.
> >>  arraysize refers to the length of the bytestream always, never to
> >>  the length of any unicode code sequence decodeable from the byte
> >>  stream.
> >
> >Yes, I think that would work.  "TABLEDATA encoding is presentation"
> >seems like a rather radical statement in terms of the way one
> >usually thinks about VOTable, but I can't think of any actual
> >negative consequences.
> >
> 
> If I have a SQL database with a column defined as CHAR(3),
> 
>     CREATE TABLE my_table (
>         xyz CHAR(3)
>         );
> 
> How would I describe that as a FIELD ?
> 
>     <FIELD name='xyz' datatype='char' arraysize='3'>
> 
>     <FIELD name='xyz' datatype='char' arraysize='12'>
> 
>     <FIELD name='xyz' datatype='char' encoding='utf-8' arraysize='3'>
> 
>     <FIELD name='xyz' datatype='char' encoding='utf-8' arraysize='12'>

First, I'd hope there's no "encoding" attribute to FIELD, so let's
discount the cases with that attribute.

Other than that: I'd say do arraysize="*" here if you database
actually stores codepoints; it's probably more space-efficient than
any fixed size.  Of course, you don't have fixed-size records then
any more.  If you want these, I'd say store UTF-8 in your database.

arraysize="*" admittedly doesn't help you if your database can do
arrays and you suddenly have an array of strings (sequences of
codepoints rather than sequences of bytes); if that's really
something you must do, I guess you won't get around computing a
worst-case length based on what you have in your database (I seem to
remember non-BMP characters can take up to 5 bytes right now).

Cheers,

        Markus