Unicode in VOTable

Thu Oct 16 14:04:27 CEST 2014

Dave,

On Wed, 15 Oct 2014, Dave Morris wrote:

> On 2014-10-15 15:53, Mark Taylor wrote:
> > 
> >    P3: Define both arraysize and binary run-length as "number of bytes
> >        the characters would take in UTF-8"
> > 
> 
> P3 does not work.

I feel that "does not work" is overstating it.

Your discussion below demonstrates that for the case of writing
streamed VOTable output from a database with fixed-length unicode
columns whose content may include non-ASCII characters, P3 does not
allow you to write VOTable fields with fixed-arraysize columns.

Other cases exist, for instance translating FITS to VOTable,
streaming database output to VOTable in the case that you know
a column, however it is represented in the database, in fact
contains only ASCII characters (Walter cited this case),
or writing to a VOTable in a non-streamed fashion where you have
the chance (and think it's worthwhile) to go through the output
and check the UTF-8 length of each character cell.  In all those
cases you could determine and output a fixed arraysize value in P3.

I'm not arguing here that P3 must be the right answer,
just clarifying what are and are not its deficiencies.

However:

> P3 gives us all the disadvantages of a 'string' data type (every char column
> is variable length) without the advantage of explicitly defining a 'string'
> data type (arrays of strings).

I admit that BINARY serialization of arrays of strings
(multi-dimensional arrays of characters) becomes problematic with P3,
since in this case you must supply the length (in P3, UTF-8 byte count)
of the longest string - unlike for 1-d char arrays, variable-length
is not an option.  That may be a killer argument.

Mark

> If a database column contains unicode codepoints then the number of bytes in
> each row depends on the number of multi-byte characters in each value.
> 
> There is no way to calculate a single value for "number of bytes the
> characters would take in UTF-8" to cover all the values in a column because
> every row will be different.
> 
> All of the char columns in the database would have to be described as
> arraysize='*', loosing any possibility of doing pointer arithmetic in the
> binary stream.
> 
> P3 gives us all the disadvantages of a 'string' data type (every char column
> is variable length) without the advantage of explicitly defining a 'string'
> data type (arrays of strings).
> 
> ----
> 
> This is already happening.
> 
> The default encoding for a PostgreSQL database is UTF8.
> 
> That means all of the character data columns contain unicode codepoints.
> 
> A column definition of CHAR(3) means it contains three codepoints, but the
> number of bytes in each row will depend on how many multi-byte characters is
> contained in each value.
> 
> If we try to link array size and byte count, then if data from the CHAR(3)
> column is output as a VOTable, the FIELD header will have to be defined as
> arraysize='*', NOT arraysize='3'.
> 
> If a VOTable from that service is imported into another database, then the
> arraysize='*' FIELD header will mean that the imported column will have to be
> created as TEXT or VARCHAR with no length limit.
> 
> Over time, as data is uploaded and downloaded between services, all of the
> char columns in the VO will gradually tend towards arraysize='*'.
> 
> At which point, arraysize becomes redundant, and using pointer arithmetic
> within a binary stream will no longer be an issue.
> 
> ----
> 
> It would be better to define a 'string' data type now and make use of the
> advantages it brings.
> 
> Dave
> 
> --------
> Dave Morris
> Software Developer
> Wide Field Astronomy Unit
> Institute for Astronomy
> University of Edinburgh
> --------
> 
> 
> 

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776  http://www.star.bris.ac.uk/~mbt/