ADQL evolution: OFFSET?

Tue Jan 12 11:21:12 CET 2016

Dear DAL folks,

First off, thanks for the lively discussion.  Let me try to summarise
the issues based on the various replies.

On the question of whether OFFSET should be coupled to ORDER BY:
The standards effort is about the same; if we linked OFFSET to ORDER
BY, the ruleset from 

http://mail.ivoa.net/pipermail/dal/2016-January/007237.html

would change to

<offset_spec> ::= "OFFSET" <unsigned_integer>

as given there (plus possibly the ROW[S], which I'm not a
particularly big fan of) and 

<order_by_clause> ::= ORDER BY <sort_specification_list>
  [<offset_spec>]

In actual implementation, I expect it's even a bit simpler, as I
suspect few actual implementations will have left alone the
query_specification production, whereas the order_by_clause probably
is as in the spec for most of them.

Given that:

On Mon, Jan 11, 2016 at 03:30:14PM -0500, Tom McGlynn (NASA/GSFC Code 660.1) wrote:
> - So why do we want to have the OFFSET explicitly tied to an ORDER BY?

I'd say the main reason is that there's important RDBMSes for which
free-floating OFFSETs might be a problem (of course, implementations
could always insert an artificial ORDER BY, but that's certainly not
going to help implementability, at least as long as there are tables
without primary keys out there).

In terms of whether OFFSET makes sense without ORDER BY: I think
that's really a moot point since ADQL has allowed TOP without ORDER BY
since the beginning, which is exactly as questionable.  So, I'd claim
philosophy should, if at all, only be a secondary concern here.

> - Are there costs to requiring order by?
> 
> Order by's can dramatically slow down queries, so it's not entirely free.

Yes -- and since I suspect most RDMBSes would use on-disk structure
as a "default" sequence absent other constraints, you may in practice
get by without.  But that's leading up for my main remark, namely
that...

> Since some static databases actually do return rows in a consistent order
> (for a given query), this could be a performance hit.  And of course if the
> database is not static, then even with an order by in the query one is not
> guananteed to get consistent results since new rows may have been added or
> deleted.

...OFFSET is a classical case of an 80% solution: it's probably good
enough for doing some things in 80% of the situations and miserably
fails for the remaining 20% (which of course would require at least
80% of the work for a comprehensive solution).

For both pagination and harvesting (which are really fairly similar
use cases), the problems of sorting and temporal stability are there; I
happen to believe that in most cases we actually deal with, they are
acceptable.  In particular, any solution not suffering from such
stability issues introduces some sort of transaction (e.g., through
resumption tokens, persistent connections, or whatever) and therefore
server-side state.  That certainly cannot be done in ADQL in the
first place, and it's certainly going to a lot more painful in
implementation.

So, the question I think boils down to: Do we want such an 80%
solution?

> - The SQL 2008 standard construct is
> 
>    OFFSET start ROW[S]  (at least according to POSTGRESQL which supports
> many variants here)

Does anyone have a library (of the books-within-walls sort) at hand
that has the SQL 2008 standard and could look up the grammar there?
I agree with Tom that we should have a long hard look at what they
have to say:

> While ADQL is based upon an earlier version of the SQL standard, it seems
> gratuitously silly to not follow the agreed standard when we're adding
> something new.  It sets a tone that we don't care about other standards
> development.

Then,

On Mon, Jan 11, 2016 at 04:46:15PM +0100, François-Xavier Pineau wrote:
> We are not DBMS experts but we are slightly concerned by the possible
> consequences of the usage of
> OFFSET to harvest large tables, e.g a catalogue like 2MASS.

Yes, it is true that OFFSET lets users write slow queries -- but
almost all parts of ADQL have that property (I'd almost call it a
feature in the vein of the old saying "it doesn't keep you from doing
dumb things because that would also keep you from doing smart
things").

Harvesting, I think, will typically be done on tables like
ivoa.obscore.  Whether it's even necessary with TAP, where hard
MAXRECs of less than 1e6 are rare remains to be seen.  My feeling is
that it would be nice to be prepared.  The non-harvestability of SSAP
and SIAP, anyway, turned out to be a pain.

> >(2) It enables pagination UIs bolted on top of an ADQL service
> Except for large results (let's say > 100 000 rows for a web browser), the
> pagination can be made on the client side.

Well -- in principle.  But if implementors want responsive web
interfaces, that's not easy already for 1000 rows given transfer
times, latencies, and the speed of Javascript DOMs.  So, while I
agree there are real problems with the whole notion of pagination
when there's no sortable primary key, I think the pagination problem
will not go away by pointing to the client side.

> >(3) The OFFSET 0 hack that helps in several database systems to set a
> >     confused query planner right.
> May OFFSET 0 confuse the query planner?
> If not, I would vote for systematically add it on the server side (the user
> do not have to know about it, it is internal cuisine).

Well, "confuse" is a difficult word when it comes to machines.
But no, adding OFFSET 0 systematically certainly is not a good idea
because in general, it is of course highly beneficial if the query
planner performs a global optimisation of the query.  It's just in
special situations that that global optimization gets things
dramatically wrong.  Unfortunately, one of the situations tends to be
the common query pattern

  SELECT <tricky analysis>
    JOIN (SELECT TOP 5000 from bigtable) as q
  <tricky analysis>

i.e., you're trying your smart query on a small subset of a large
table first.  Query planners regularly perform badly in the presence
of query limits in the first place, and this kind of thing can
quickly get ugly (query times of hours instead of seconds).

Annotation for the planner is an important thing (I've told my users
to write SELECT ALL for a planner barrier so far, but I don't think
many have noticed).  OFFSET 0 may not even reach the 80% level in
that field, but it's helped me and others in a couple of pinches.

So -- it seems nobody wanted to keep OFFSET out of ADQL 2.1 really
hard.  Now, Dave, as the editor of ADQL 2.1, what's your stance:
Should OFFSET go in?  And if so, attached to query_specification or
to order_by_clause?  Should I put something into the text on volute?
Should we solicit more opinions?

Cheers,

           Markus