VOTable multi-dimensional arrays too restrictive

Thu Feb 18 13:45:25 CET 2021

HI Markus, all
Not that I propose using it, but what are the reasons *against* going full XML and have something like <TD><A> ... </A></TD>, where the content of the <A></A> could be a bunch of <V></V>-s and each <V> could contain another <A></A>? Would keep it fully XML and maybe dissuade people from using it too much?

FWIW I would favour JSON for representing (multi-dimensional) arrays. It is the way by which we are representing array columns in SQL Server for example, which allows querying these as I suppose do Postgres and other db vendors.  I think it preferrable to some custom syntax that would need to be translated back and forth to a more usable form, when going in and out a database for example.

If some special syntax is chosen, would be acceptable for all array values inside VOTable cells, or only multi-dimensional ones?

Cheers
Gerard

> -----Original Message-----
> From: apps-bounces at ivoa.net <apps-bounces at ivoa.net> On Behalf Of
> Markus Demleitner
> Sent: Thursday, February 18, 2021 3:27
> To: apps at ivoa.net
> Subject: Re: VOTable multi-dimensional arrays too restrictive
> 
> Hi Apps,
> 
> On Wed, Feb 17, 2021 at 09:52:17AM -0800, Patrick Dowler wrote:
> > The multi-dimensional array (arraysize="10x*") mechanism is too
> > restrictive for really common user cases: list of words.
> >
> > see:
> >
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> > ub.com%2Fivoa-
> std%2FVOTable%2Fissues%2F25&amp;data=04%7C01%7Cglemson1%
> >
> 40jhu.edu%7C6817d76bd88a4f3cfe9508d8d3e7cd5b%7C9fa4f438b1e6473b80
> 3f86f
> >
> 8aedf0dec%7C0%7C0%7C637492339834762951%7CUnknown%7CTWFpbGZsb
> 3d8eyJWIjo
> >
> iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C10
> 00&amp
> >
> ;sdata=urJBSIPPjn%2BaXBTkW2UfcVU9%2FMNFEoOCvTeI6F%2B0e44%3D&a
> mp;reserv
> > ed=0
> >
> > That discussion shows a few use cases for char arrays that we would
> > like to explore and would like to avoid making up a bespoke DataLink-
> > or even DAL-specific mechanism:
> > - multiple terms in DataLink semantics
> 
> For the record: I'd like to avoid this particular use case for a number of
> reasons, but that's a datalink question.
> 
> In general, I've wanted the array[*][n] (n of variable-length
> things) many times. and array[*][*] is almost certainly simple when you can
> do array[*][n].  Whether we need to go beyond that
> (array[*][*][*]) I'm less sure about, so I wouldn't rule out solutions that don't
> allow that.
> 
> What I'm against is a per-column separator ("word-list-|", "word-list-x").  A
> generic serialiser will have to know all values that go into such a list up front
> before choosing the separator, and that kind of thing has already been really
> painful for the null value thing; that's been why we have introduced
> BINARY2.
> 
> A constant separator would be ok, but the "cross-fingers" approach that
> Mark describes in
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> ub.com%2Fivoa-std%2FVOTable%2Fissues%2F25%23issuecomment-
> 651039726&amp;data=04%7C01%7Cglemson1%40jhu.edu%7C6817d76bd88a
> 4f3cfe9508d8d3e7cd5b%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7
> C637492339834762951%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sda
> ta=f6cKNSQHr0kZ3mLatAh1%2BK3zmYc%2F0itZ5irQec8I9J0%3D&amp;reserv
> ed=0
> ("let's use a new-line and hope nobody will ever need it in one of their
> values") is IMHO sure to break at some point.  Let's use the conventional way
> of dealing with this kind of thing and devise an escaping mechanism -- I'd go
> for normal backslash-escaping, where we could allow \t, \n, \r, \\ and say
> "use unsignedBytes arrays for anything else".
> 
> There's a problem with using line feeds as separator, though, because some
> XML processors take liberties with whitespace whether or not they're
> allowed to do that, so using whitespace is probably a rather fragile choice.
> Worse: Depending on whether the VOTable was transported as text/xml
> (where MIME wants you to have CR LF line
> endings) or as application/x-votable+xml (where MIME says to leave line
> endings as they are) you might see different array contents.
> 
> See below for what I'd do instead.
> 
> On the other hand, if we start defining syntax, we could of course go the
> postgres way and declare a type json.  Let me say immediately that my gut
> feeling is that that's a bad idea for two reasons:
> 
> (a) this would provide very weak column metadata indeed, so it will make
> the clients' lives a lot harder, and
> 
> (b) it will make a table designer's conundrum what to represent on the cell
> level and what on the table level a lot harder.  Cf. our experiments to
> represent spectra in single table rows using array-valued cells.
> 
> Having said that, one can't deny that json-valued cells would solve this case
> (and a lot of others on top) with rather moderate implementation cost on
> the side of the VOTable parser writers, in particular if we defined the json to
> be utf-8-encoded.
> 
> 
> The technically most stable thing that, at the same time, requires parsers of
> low complexity is marking up opening and closing explicitly (rather than just
> using a separator, that is).  CSV is a semi-member of that family, and at least
> for the strings we could just agree on a specific dialect (separator character,
> quote
> escaping) and be done with it.
> 
> For our "list of strings" use case, a solution I could easily learn to like would
> use ASCII shift-in and shift-out, \x0e and \x0f.  For python
> 
>   [["ab", "c"], ["de", "fgh"]]
> 
> you'd be writing SI SI ab SO SI c SO SO SI de SO SI fgh SO SO
> -- so you could even represent arbitrarily deep nesting.  This, by the way,
> works in TABLEDATA, too; just write &#0e;, etc.  Since we and utf-8 don't
> allow control characters in char material, we might not even need escaping.
> However, as we enter hard-core ASCII, we could define ESC (\x1b) as an
> escape character if we wanted and then even have SI and SO in our strings if,
> in some future, we want to make char binary-proof.
> 
> This doesn't immediately help for the multipolygon use case.
> However, we might designate three floating point values that are exactly
> representable in IEEE float and doubles and perhaps aren't so frequent as
> shift in, shift out, and escape.  Using, for instance,
> 0.25 as SI, 0.5 as SO, and 0.75 as escape,
> 
>   [[0.1, NaN, 0.25], [1.0, +Inf]]
> 
> would become
> 
>   0.25 0.1 NaN 0.75 0.25 0.5 0.25 1.0 +Inf 0.5 0.5
> 
> Disgusted?  Perhaps, but I'd say only until you see the alternatives worked
> out...
> 
> For integers we could use maxint, maxint-1, and maxint-2 as SI, SO, and ESC;
> in most data, they ought to be reasonably rare to keep escaping at an
> acceptable level.  For bits, that would become extremely hard, but then I've
> not seen many bits in the wild, and we should perhaps think of just
> deprecating the type.
> 
> So... if we started from scratch, I'd probably argue a lot for my SI/SO/ESC
> proposal (or something else giving open/close/escape without going all XML
> with different elements or even attributes).
> 
> Since there's a world around us, I might find it in me to like embedded CSV or
> perhaps even not be abhorred at embedded JSON.  A simple separator I'm
> not so wild about: It doesn't save much in terms of space against SI/SO/ESC,
> but it immediately precludes deeper nesting.  And we'd still need escaping, I
> claim.
> 
> What I'm fairly much against are schemes that would involve "<start-record>
> <length> <data>" -- if only for the theoretical reason that you need a full
> Turing machine to parse this kind of thing (as opposed to a pushdown
> automaton for SI/SO/ESC and a finite state machine for the separator).
> 
> Sorry for the long mail,
> 
>           Markus