Unicode in VOTable
Dave Morris
dave.morris at metagrid.co.uk
Wed Oct 15 23:55:07 CEST 2014
On 2014-10-15 15:53, Mark Taylor wrote:
>
> P3: Define both arraysize and binary run-length as "number of bytes
> the characters would take in UTF-8"
>
P3 does not work.
----
If a database column contains unicode codepoints then the number of
bytes in each row depends on the number of multi-byte characters in each
value.
There is no way to calculate a single value for "number of bytes the
characters would take in UTF-8" to cover all the values in a column
because every row will be different.
All of the char columns in the database would have to be described as
arraysize='*', loosing any possibility of doing pointer arithmetic in
the binary stream.
P3 gives us all the disadvantages of a 'string' data type (every char
column is variable length) without the advantage of explicitly defining
a 'string' data type (arrays of strings).
----
This is already happening.
The default encoding for a PostgreSQL database is UTF8.
That means all of the character data columns contain unicode codepoints.
A column definition of CHAR(3) means it contains three codepoints, but
the number of bytes in each row will depend on how many multi-byte
characters is contained in each value.
If we try to link array size and byte count, then if data from the
CHAR(3) column is output as a VOTable, the FIELD header will have to be
defined as arraysize='*', NOT arraysize='3'.
If a VOTable from that service is imported into another database, then
the arraysize='*' FIELD header will mean that the imported column will
have to be created as TEXT or VARCHAR with no length limit.
Over time, as data is uploaded and downloaded between services, all of
the char columns in the VO will gradually tend towards arraysize='*'.
At which point, arraysize becomes redundant, and using pointer
arithmetic within a binary stream will no longer be an issue.
----
It would be better to define a 'string' data type now and make use of
the advantages it brings.
Dave
--------
Dave Morris
Software Developer
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
--------
More information about the apps
mailing list