Array element access in ADQL

Mon Jul 10 11:17:17 CEST 2017

Hi DAL,

On Tue, Jul 04, 2017 at 12:43:07PM +0100, Mark Taylor wrote:
>    Experimental support for array functions has been added to Gaia Archive
>    v1.3.0. Array types are supported as valid output formats. In addition,
>    some user defined functions have been defined for direct manipulation,
>    most notably (see Archive help):

While I'm usually wary of introducing new syntax, I believe allowing
the usual colref[index] syntax is where we should be going.  I
believe (but since I've not actually implemented it, it's likely I'm
overlooking something), I think we could (using the published
ADQL-2.0 grammar) just, in value_expression_primary, insert a

  | <array_element_reference>

(while it doesn't matter in theory, for several parser generators it
is preferable to insert it in front of <column_reference>) and then
write

<array_element_reference> ::=
  <column_reference> '[' <unsigned_integer> ']'

We could then, with experience, relax the constraints on the array
index.  I suspect most initial use cases would be covered by just an
integer literal, though (has anyone collected some use cases?).  I'd
not say no to <numeric_value_expression> even from the start, though.

My feeling would be that Gaia could simply go ahead and use that even
before ADQL-2.2 (or whatever will have this) is finalised -- the odds
that whatever the IVOA finally passes will be significantly different
from this are IMHO minute.

I'm afraid this will include FORTRAN array indexing ("OPTION BASE 1",
i.e., the first element has index 1).  Much as I (and obviously
Walter) despise this, doing this differently from all known engines
will not fix anything but just add confusion.

I'm much less enthusiastic about introducing array literals into ADQL
('{1,2,3}') -- what would these be used for?

I'd not declare array support using a language identifier
("ADQL-2.0-with-arrays").  Rather, I think we should introduce a
ivo://ivoa.net/std/TAPRegExt#features-arrays languageFeature ID.
While additional "features" could be defined (1D vs. nD?  not sure),
I think a simple

<languageFeature
  type="ivo://ivoa.net/std/TAPRegExt#features-arrays"/>

would then tell clients that you can use arrays.

With this, I don't think we'd need UDFs for element access, not even
for prototyping.  But if you still think you want something like:

>    GET_DOUBLE_ARRAY_ELEMENT(array,indexes): Returns the selected element
>    from the array of double precision values, where:
> 
>       - array [double]:
>            Input array
>       - indexes [string]:
>            String with the selected indexes with the format '[i][j]..'
> 
>    The syntax is functional, but should be considered a work in progress
>    until DR2.

...I'd say let's just have GET_ARRAY_ELEMENT(arr, i, j, k...).  SQL
has had polymorphic functions for a long, long time, and it's really
a shortcoming of TAPRegExt that it doesn't discuss that.

In TAPRegExt 1.0, people would probably declare several functions

GET_ARRAY_ELEMENT(DOUBLE PRECISION[] arr, INTEGER i)
GET_ARRAY_ELEMENT(DOUBLE PRECISION[] arr, INTEGER i, INTEGER j)
GET_ARRAY_ELEMENT(INTEGER[] arr, INTEGER i)

and so on.  I agree this is ugly and tedious.  I'd be ok with
changing TAPRegExt such that a single

GET_ARRAY_ELEMENT(ANY[] arr, INTEGER index, ...)

would be ok.  Opinions?

>    Extra functionality is needed Some suggestions to discuss include
>    the following:
> 
>       - get_length(array_column, any type), null for null input cell
>       - create_double_array(list_of_columns), autocasting

I'd say determining the length of an array is an important thing, but
I'm still not sure how useful it is to create arrays within ADQL,
either as literals (as above) or from expressions (as apparently
proposed here).

To get an idea of what a minimal array type should have in terms of
behaviour, I've checked sect. 22.1 (about p. 437) of the ECMA-262
standard (which is, I believe, the authority for Javascript and
appears to be pretty rigorously specified, so we might want to steal
from them/reference them when describing the semantics we want).
Creation, housekeeping, and methods I found obviously inappropriate
aside, here's what I've found as methods:

* length (sounds like a good idea to me, too; that way, people can
  build "index from the back" accesses, as in python's seq[-1], which
  in turn I don't think we want in ADQL in this style)
*`concat (doesn't seem to me in a query context)
* slice (I think that we should have; it also meshes nicely with
  aggregate functions as proposed below)
* every (true when some function returns true for all array items;
  given aggregate functions, I don't think we want to do things in
  this way in ADQL).  Similar considerations apply to filter (but see
  below), forEach, map, some.
* filter (select entries using a boolean function.  I could see uses
  for that ("AVG(arr, element>1)"), but pulling this off in a SQL
  query is a minor nightmare as far as I can see)
* find/findIndex (return an element or index where some condition is
  true; again, could be useful but nightmarish to do in a query)
* includes (true when an element is within an array.  I can
  definitely see where I may want to have something like this in
  contexts like RegTAP, but for general ADQL it's probably
  not a great gain)
* join (concatenate into a string; I'd rather generalize
  ivo_string_agg for that purpose; of course, that's an aggregate
  function)
* lastIndexOf (more or less findIndex from behind, so about the same
  considerations).
* reduce (that's as python's reduce -- I like it, but certainly far
  too complicated in SQL)
* sort (I think the arrays we're talking about have no business being
  sorted -- or do they? Anyway, we shouldn't have anything with side
  effects in ADQL)

What I'd *really* like to see is that we say that aggregate functions
should work on arrays as if they were 1-column relations.

I've not looked into how complex that is to get to work in postgres
(let alone other DB engines); at least natively, it doesn't let you
do that:

  postgres> select AVG('{1, 2}'::real[]);
  ERROR:  function avg(real[]) does not exist
  ZEILE 1: select AVG('{1, 2}'::real[]);
                  ^
Hm.  Clearly needs more thought.

        -- Markus