Unicode in VOTable
Mark Taylor
m.b.taylor at bristol.ac.uk
Thu Oct 16 14:04:27 CEST 2014
Dave,
On Wed, 15 Oct 2014, Dave Morris wrote:
> On 2014-10-15 15:53, Mark Taylor wrote:
> >
> > P3: Define both arraysize and binary run-length as "number of bytes
> > the characters would take in UTF-8"
> >
>
> P3 does not work.
I feel that "does not work" is overstating it.
Your discussion below demonstrates that for the case of writing
streamed VOTable output from a database with fixed-length unicode
columns whose content may include non-ASCII characters, P3 does not
allow you to write VOTable fields with fixed-arraysize columns.
Other cases exist, for instance translating FITS to VOTable,
streaming database output to VOTable in the case that you know
a column, however it is represented in the database, in fact
contains only ASCII characters (Walter cited this case),
or writing to a VOTable in a non-streamed fashion where you have
the chance (and think it's worthwhile) to go through the output
and check the UTF-8 length of each character cell. In all those
cases you could determine and output a fixed arraysize value in P3.
I'm not arguing here that P3 must be the right answer,
just clarifying what are and are not its deficiencies.
However:
> P3 gives us all the disadvantages of a 'string' data type (every char column
> is variable length) without the advantage of explicitly defining a 'string'
> data type (arrays of strings).
I admit that BINARY serialization of arrays of strings
(multi-dimensional arrays of characters) becomes problematic with P3,
since in this case you must supply the length (in P3, UTF-8 byte count)
of the longest string - unlike for 1-d char arrays, variable-length
is not an option. That may be a killer argument.
Mark
> If a database column contains unicode codepoints then the number of bytes in
> each row depends on the number of multi-byte characters in each value.
>
> There is no way to calculate a single value for "number of bytes the
> characters would take in UTF-8" to cover all the values in a column because
> every row will be different.
>
> All of the char columns in the database would have to be described as
> arraysize='*', loosing any possibility of doing pointer arithmetic in the
> binary stream.
>
> P3 gives us all the disadvantages of a 'string' data type (every char column
> is variable length) without the advantage of explicitly defining a 'string'
> data type (arrays of strings).
>
> ----
>
> This is already happening.
>
> The default encoding for a PostgreSQL database is UTF8.
>
> That means all of the character data columns contain unicode codepoints.
>
> A column definition of CHAR(3) means it contains three codepoints, but the
> number of bytes in each row will depend on how many multi-byte characters is
> contained in each value.
>
> If we try to link array size and byte count, then if data from the CHAR(3)
> column is output as a VOTable, the FIELD header will have to be defined as
> arraysize='*', NOT arraysize='3'.
>
> If a VOTable from that service is imported into another database, then the
> arraysize='*' FIELD header will mean that the imported column will have to be
> created as TEXT or VARCHAR with no length limit.
>
> Over time, as data is uploaded and downloaded between services, all of the
> char columns in the VO will gradually tend towards arraysize='*'.
>
> At which point, arraysize becomes redundant, and using pointer arithmetic
> within a binary stream will no longer be an issue.
>
> ----
>
> It would be better to define a 'string' data type now and make use of the
> advantages it brings.
>
> Dave
>
> --------
> Dave Morris
> Software Developer
> Wide Field Astronomy Unit
> Institute for Astronomy
> University of Edinburgh
> --------
>
>
>
--
Mark Taylor Astronomical Programmer Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776 http://www.star.bris.ac.uk/~mbt/
More information about the apps
mailing list