Unicode in VOTable
Dave Morris
dave.morris at metagrid.co.uk
Mon Aug 25 16:55:27 PDT 2014
On 2014-08-13 17:05, Mark Taylor wrote:
>>
>> <TABLE>
>> <FIELD ID="aString" datatype="char/utf8" arraysize="10"/>
>> <DATA><TABLEDATA>
>> <TR>
>> <TD>Apple</TD>
>> </TR></TABLE>
>>
>> In this context, what could the FIELD/@arraysize mean?
>> It can't mean bytes, because this is XML, and all notion of bytes has
>> been left behind in the lexer.
>
> I don't see why it can't mean "bytes that would be occupied by the
> string
> if it were to be encoded as utf8".
FIELD/@arraysize is in the header and applies to the whole table, not to
a specific row.
It can't mean "bytes that would be occupied by the string ..." because
we don't have _a_ single string, we have a different string in each row.
<TABLE>
<FIELD ID="aString" datatype="char/utf8" bytecount="??"/>
<DATA>
<TABLEDATA>
<TR>
<TD>Apple</TD>
</TR>
<TR>
<TD>Ant</TD>
</TR>
<TR>
<TD>Ardvark</TD>
</TR>
* yes I know this example would probably be arraysize='*'
Trying to illustrate different unicode strings clearly in an email is
not easy. I ask you to imagine they are three strings with the same
number of characters but with different numbers of special characters,
resulting in different UTF-8 encoded byte counts.
If the data came from a database, then we might know the maximum number
of characters for that column
CREATE TABLE aTable (
aString CHAR(5)
);
Based on that we could calculate the maximum number of bytes needed to
encode a value for that column
max byte count = max size of encoded character * number of
characters
4 * 5 = 20
Giving us
<TABLE>
<FIELD ID="aString" datatype="char/utf8" maxbytecount="20"/>
<DATA>
<TABLEDATA>
<TR>
<TD>Apple</TD>
</TR>
>
> Use of a type in which the run length and arraysize are both utf8 byte
> counts allows columns (and hence possibly rows) with fixed-length
> encodings.
If we want to have fixed length columns we would be to pad each value to
the same length.
For CHAR(5) encoded as "char/utf8" that would be
max byte count = max size of encoded character * number of
characters
4 * 5 = 20
This ensures we leave enough space in the byte stream to encode a string
containing the most complicated encoded characters.
<TABLE>
<FIELD ID="aString" datatype="char/utf8" maxbytecount="20"/>
<DATA>
<TABLEDATA>
<TR>
<TD>Apple</TD>
</TR>
But if we are going to do that, it would be better to go back to the
original meaning of FIELD/@arraysize = element count.
<TABLE>
<FIELD ID="aString" datatype="char/utf8" arraysize="5"/>
<DATA>
<TABLEDATA>
<TR>
<TD>Apple</TD>
</TR>
and describe the byte count and padding in the definition of the
datatype serialization
The BINARY serialization of a fixed size "char/utf8"
field consists of an array of bytes with enough space
for the most complex encoded character sequence for
that field.
The size of the byte array is calculated by multiplying
the size of the most complex encoded character by the
number of characters in the field.
byte count = most complex character * number of characters
The resulting byte array contains the UTF-8 encoded value,
followed by padding with zero bytes up to the required byte
count for the field.
--------
Dave Morris
Software Developer
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
--------
More information about the apps
mailing list