Unicode in VOTable

Dave Morris dave.morris at metagrid.co.uk
Mon Aug 25 16:55:27 PDT 2014


On 2014-08-13 17:05, Mark Taylor wrote:
>> 
>> <TABLE>
>>   <FIELD ID="aString" datatype="char/utf8" arraysize="10"/>
>>   <DATA><TABLEDATA>
>>   <TR>
>>    <TD>Apple</TD>
>> </TR></TABLE>
>> 
>> In this context, what could the FIELD/@arraysize mean?
>> It can't mean bytes, because this is XML, and all notion of bytes has
>> been left behind in the lexer.
> 
> I don't see why it can't mean "bytes that would be occupied by the 
> string
> if it were to be encoded as utf8".

FIELD/@arraysize is in the header and applies to the whole table, not to 
a specific row.

It can't mean "bytes that would be occupied by the string ..." because 
we don't have _a_ single string, we have a different string in each row.

   <TABLE>
        <FIELD ID="aString" datatype="char/utf8" bytecount="??"/>
        <DATA>
            <TABLEDATA>
                <TR>
                    <TD>Apple</TD>
                </TR>
                <TR>
                    <TD>Ant</TD>
                </TR>
                <TR>
                    <TD>Ardvark</TD>
                </TR>

* yes I know this example would probably be arraysize='*'
Trying to illustrate different unicode strings clearly in an email is 
not easy. I ask you to imagine they are three strings with the same 
number of characters but with different numbers of special characters, 
resulting in different UTF-8 encoded byte counts.

If the data came from a database, then we might know the maximum number 
of characters for that column

     CREATE TABLE aTable (
         aString CHAR(5)
         );

Based on that we could calculate the maximum number of bytes needed to 
encode a value for that column

     max byte count = max size of encoded character * number of 
characters

     4 * 5 = 20

Giving us

    <TABLE>
         <FIELD ID="aString" datatype="char/utf8" maxbytecount="20"/>
         <DATA>
             <TABLEDATA>
                 <TR>
                     <TD>Apple</TD>
                 </TR>

> 
> Use of a type in which the run length and arraysize are both utf8 byte
> counts allows columns (and hence possibly rows) with fixed-length
> encodings.

If we want to have fixed length columns we would be to pad each value to 
the same length.

For CHAR(5) encoded as "char/utf8" that would be

     max byte count = max size of encoded character * number of 
characters

     4 * 5 = 20

This ensures we leave enough space in the byte stream to encode a string 
containing the most complicated encoded characters.

    <TABLE>
         <FIELD ID="aString" datatype="char/utf8" maxbytecount="20"/>
         <DATA>
             <TABLEDATA>
                 <TR>
                     <TD>Apple</TD>
                 </TR>

But if we are going to do that, it would be better to go back to the 
original meaning of FIELD/@arraysize = element count.

    <TABLE>
         <FIELD ID="aString" datatype="char/utf8" arraysize="5"/>
         <DATA>
             <TABLEDATA>
                 <TR>
                     <TD>Apple</TD>
                 </TR>

and describe the byte count and padding in the definition of the 
datatype serialization

     The BINARY serialization of a fixed size "char/utf8"
     field consists of an array of bytes with enough space
     for the most complex encoded character sequence for
     that field.

     The size of the byte array is calculated by multiplying
     the size of the most complex encoded character by the
     number of characters in the field.

         byte count = most complex character * number of characters

     The resulting byte array contains the UTF-8 encoded value,
     followed by padding with zero bytes up to the required byte
     count for the field.


--------
Dave Morris
Software Developer
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
--------



More information about the apps mailing list