VOTable multi-dimensional arrays too restrictive

Thu Feb 18 09:26:53 CET 2021

Hi Apps,

On Wed, Feb 17, 2021 at 09:52:17AM -0800, Patrick Dowler wrote:
> The multi-dimensional array (arraysize="10x*") mechanism is too restrictive
> for really common user cases: list of words.
> 
> see: https://github.com/ivoa-std/VOTable/issues/25
> 
> That discussion shows a few use cases for char arrays that we would like to
> explore and would like to avoid making up a bespoke DataLink- or even
> DAL-specific mechanism:
> - multiple terms in DataLink semantics

For the record: I'd like to avoid this particular use case for a
number of reasons, but that's a datalink question.

In general, I've wanted the array[*][n] (n of variable-length
things) many times. and array[*][*] is almost certainly simple when
you can do array[*][n].  Whether we need to go beyond that
(array[*][*][*]) I'm less sure about, so I wouldn't rule out
solutions that don't allow that.

What I'm against is a per-column separator ("word-list-|",
"word-list-x").  A generic serialiser will have to know all values
that go into such a list up front before choosing the separator, and
that kind of thing has already been really painful for the null value
thing; that's been why we have introduced BINARY2.

A constant separator would be ok, but the "cross-fingers" approach
that Mark describes in
https://github.com/ivoa-std/VOTable/issues/25#issuecomment-651039726
("let's use a new-line and hope nobody will ever need it in one of
their values") is IMHO sure to break at some point.  Let's use the
conventional way of dealing with this kind of thing and devise an
escaping mechanism -- I'd go for normal backslash-escaping, where
we could allow \t, \n, \r, \\ and say "use unsignedBytes arrays for
anything else".

There's a problem with using line feeds as separator, though, because
some XML processors take liberties with whitespace whether or not
they're allowed to do that, so using whitespace is probably a rather
fragile choice.  Worse: Depending on whether the VOTable was
transported as text/xml (where MIME wants you to have CR LF line
endings) or as application/x-votable+xml (where MIME says to leave
line endings as they are) you might see different array contents.

See below for what I'd do instead.

On the other hand, if we start defining syntax, we could of course go
the postgres way and declare a type json.  Let me say immediately
that my gut feeling is that that's a bad idea for two reasons:

(a) this would provide very weak column metadata indeed, so it will
make the clients' lives a lot harder, and

(b) it will make a table designer's conundrum what to represent on
the cell level and what on the table level a lot harder.  Cf. our
experiments to represent spectra in single table rows using
array-valued cells.

Having said that, one can't deny that json-valued cells would solve
this case (and a lot of others on top) with rather moderate
implementation cost on the side of the VOTable parser writers, in
particular if we defined the json to be utf-8-encoded.

The technically most stable thing that, at the same time, requires
parsers of low complexity is marking up opening and closing
explicitly (rather than just using a separator, that is).  CSV is a
semi-member of that family, and at least for the strings we could
just agree on a specific dialect (separator character, quote
escaping) and be done with it.

For our "list of strings" use case, a solution I could easily learn
to like would use ASCII shift-in and shift-out, \x0e and \x0f.  For
python

  [["ab", "c"], ["de", "fgh"]]

you'd be writing SI SI ab SO SI c SO SO SI de SO SI fgh SO SO
-- so you could even represent arbitrarily deep nesting.  This, by
the way, works in TABLEDATA, too; just write &#0e;, etc.  Since we
and utf-8 don't allow control characters in char material, we might
not even need escaping.  However, as we enter hard-core ASCII, we
could define ESC (\x1b) as an escape character if we wanted and then
even have SI and SO in our strings if, in some future, we want to
make char binary-proof.

This doesn't immediately help for the multipolygon use case.
However, we might designate three floating point values that are
exactly representable in IEEE float and doubles and perhaps aren't so
frequent as shift in, shift out, and escape.  Using, for instance,
0.25 as SI, 0.5 as SO, and 0.75 as escape, 

  [[0.1, NaN, 0.25], [1.0, +Inf]]

would become

  0.25 0.1 NaN 0.75 0.25 0.5 0.25 1.0 +Inf 0.5 0.5

Disgusted?  Perhaps, but I'd say only until you see the
alternatives worked out...

For integers we could use maxint, maxint-1, and maxint-2 as SI, SO,
and ESC; in most data, they ought to be reasonably rare to keep
escaping at an acceptable level.  For bits, that would become
extremely hard, but then I've not seen many bits in the wild, and we
should perhaps think of just deprecating the type.

So... if we started from scratch, I'd probably argue a lot for my
SI/SO/ESC proposal (or something else giving open/close/escape
without going all XML with different elements or even attributes).

Since there's a world around us, I might find it in me to like
embedded CSV or perhaps even not be abhorred at embedded JSON.  A
simple separator I'm not so wild about: It doesn't save much in terms
of space against SI/SO/ESC, but it immediately precludes deeper
nesting.  And we'd still need escaping, I claim.

What I'm fairly much against are schemes that would involve
"<start-record> <length> <data>" -- if only for the theoretical
reason that you need a full Turing machine to parse this kind of
thing (as opposed to a pushdown automaton for SI/SO/ESC and a finite
state machine for the separator).

Sorry for the long mail,

          Markus