Unicode in VOTable

Dave Morris dave.morris at metagrid.co.uk
Wed Oct 15 23:55:07 CEST 2014


On 2014-10-15 15:53, Mark Taylor wrote:
> 
>    P3: Define both arraysize and binary run-length as "number of bytes
>        the characters would take in UTF-8"
> 

P3 does not work.

----

If a database column contains unicode codepoints then the number of 
bytes in each row depends on the number of multi-byte characters in each 
value.

There is no way to calculate a single value for "number of bytes the 
characters would take in UTF-8" to cover all the values in a column 
because every row will be different.

All of the char columns in the database would have to be described as 
arraysize='*', loosing any possibility of doing pointer arithmetic in 
the binary stream.

P3 gives us all the disadvantages of a 'string' data type (every char 
column is variable length) without the advantage of explicitly defining 
a 'string' data type (arrays of strings).

----

This is already happening.

The default encoding for a PostgreSQL database is UTF8.

That means all of the character data columns contain unicode codepoints.

A column definition of CHAR(3) means it contains three codepoints, but 
the number of bytes in each row will depend on how many multi-byte 
characters is contained in each value.

If we try to link array size and byte count, then if data from the 
CHAR(3) column is output as a VOTable, the FIELD header will have to be 
defined as arraysize='*', NOT arraysize='3'.

If a VOTable from that service is imported into another database, then 
the arraysize='*' FIELD header will mean that the imported column will 
have to be created as TEXT or VARCHAR with no length limit.

Over time, as data is uploaded and downloaded between services, all of 
the char columns in the VO will gradually tend towards arraysize='*'.

At which point, arraysize becomes redundant, and using pointer 
arithmetic within a binary stream will no longer be an issue.

----

It would be better to define a 'string' data type now and make use of 
the advantages it brings.

Dave

--------
Dave Morris
Software Developer
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
--------



More information about the apps mailing list