Unicode in VOTable
Dave Morris
dave.morris at metagrid.co.uk
Thu Oct 16 03:14:20 CEST 2014
On 2014-10-15 15:53, Mark Taylor wrote:
>
> This ends with the following three proposals for the way forward
> as regards representing character array length:
>
> P1: Define both arraysize and binary run-length as "number of code
> points"
> P2: Define arraysize as "number of code points" and
> binary run-length as "number of bytes"
> P3: Define both arraysize and binary run-length as "number of bytes
> the characters would take in UTF-8"
>
Are the problems caused because we are trying to fit two different
concepts into a single attribute.
In which case, may I suggest a fourth option.
P4: Define a 'arraysize' as the number of characters (codepoints).
Define a new optional attribute 'bytecount' which contains
"number of bytes the characters would take in UTF-8"
If the data source is able to calculate the byte count then it may add
the bytecount attribute to the FIELD, enabling binary encoding parsers
to use pointer arithmetic to skip fixed size rows.
If the data source is not able to calculate the byte count efficiently
then it may either set the value to '*' or omit the attribute entirely.
So, based on the examples on the wiki page
http://wiki.ivoa.net/twiki/bin/view/IVOA/VOTableUnicode20141016
If the table only contained the first string, with only ASCII
characters, then the VOTable header would be
<FIELD type='char' arraysize='4' bytecount='4'/>
If the table contained the second string, with a single multi-byte
character in it, then the VOTable header would be
<FIELD type='char' arraysize='4' bytecount='5'/>
If the data source is unable to calculate the UTF-8 byte count
efficiently, then the VOTable header would be
<FIELD type='char' arraysize='4'/>
or
<FIELD type='char' arraysize='4' bytecount='*'/>
This allows novice programmers and scripters to generate simple VOTables
without having to worry about scanning the data, encoding strings and
counting the bytes in an encoding they probably weren't aware of in the
first place (99% of the time their programming environment/language just
takes care of it).
It also allows technically advance binary encoding library writers to
generate VOTable headers with the right metadata to enable binary
encoding parsers to use pointer arithmetic to skip fixed length rows
where possible.
Obviously we would need to work out what the side effects would be for
existing parsers and VOTables, but assuming it is possible, would this
meet everyone's requirements ?
Dave
--------
Dave Morris
Software Developer
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
--------
More information about the apps
mailing list