Unicode in VOTable

Thu Oct 16 03:14:20 CEST 2014

On 2014-10-15 15:53, Mark Taylor wrote:
> 
> This ends with the following three proposals for the way forward
> as regards representing character array length:
> 
>    P1: Define both arraysize and binary run-length as "number of code 
> points"
>    P2: Define arraysize as "number of code points" and
>        binary run-length as "number of bytes"
>    P3: Define both arraysize and binary run-length as "number of bytes
>        the characters would take in UTF-8"
> 

Are the problems caused because we are trying to fit two different 
concepts into a single  attribute.

In which case, may I suggest a fourth option.

     P4: Define a 'arraysize' as the number of characters (codepoints).
         Define a new optional attribute 'bytecount' which contains 
"number of bytes the characters would take in UTF-8"

If the data source is able to calculate the byte count then it may add 
the bytecount attribute to the FIELD, enabling binary encoding parsers 
to use pointer arithmetic to skip fixed size rows.

If the data source is not able to calculate the byte count efficiently 
then it may either set the value to '*' or omit the attribute entirely.

So, based on the examples on the wiki page
http://wiki.ivoa.net/twiki/bin/view/IVOA/VOTableUnicode20141016

If the table only contained the first string, with only ASCII 
characters, then the VOTable header would be

     <FIELD type='char' arraysize='4' bytecount='4'/>

If the table contained the second string, with a single multi-byte 
character in it, then the VOTable header would be

     <FIELD type='char' arraysize='4' bytecount='5'/>

If the data source is unable to calculate the UTF-8 byte count 
efficiently, then the VOTable header would be

     <FIELD type='char' arraysize='4'/>
or
     <FIELD type='char' arraysize='4' bytecount='*'/>

This allows novice programmers and scripters to generate simple VOTables 
without having to worry about scanning the data, encoding strings and 
counting the bytes in an encoding they probably weren't aware of in the 
first place (99% of the time their programming environment/language just 
takes care of it).

It also allows technically advance binary encoding library writers to 
generate VOTable headers with the right metadata to enable binary 
encoding parsers to use pointer arithmetic to skip fixed length rows 
where possible.

Obviously we would need to work out what the side effects would be for 
existing parsers and VOTables, but assuming it is possible, would this 
meet everyone's requirements ?

Dave

--------
Dave Morris
Software Developer
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
--------