ADQL evolution: OFFSET?

Mon Jan 11 23:29:31 CET 2016

Hi Tom

> -----Original Message-----
> From: dal-bounces at ivoa.net [mailto:dal-bounces at ivoa.net] On Behalf Of Tom
> McGlynn (NASA/GSFC Code 660.1)
> Sent: Monday, January 11, 2016 3:30 PM
> To: dal at ivoa.net
> Subject: Re: ADQL evolution: OFFSET?
> 
> In this discussion it seems to be implicit than using an ORDER BY expression
> ensures that the rows will be returned in some defined order.  While that is a
> necessary condition it is not sufficient.  The value of the expression must be
> unique in each row to guarantee a fixed ordering.

Yes, that is true. A unique (set of) column(s) would need to be included.

> Given this below are a few cogitations on the idea of whether we should couple
> OFFSET and ORDER BY.
> 
> - So why do we want to have the OFFSET explicitly tied to an ORDER BY?
> 
> If we want to guarantee that we can do pagination correctly, this won't do it.
> If we want to enable users to do pagaination correctly they will still need to
> carefully pick the expression they use in the order by clause.  So it's not really
> much easier.
No, but it is still tied to an ORDER BY.

> 
> - Are there costs to requiring order by?
> 
> Order by's can dramatically slow down queries, so it's not entirely free.  Since
> some static databases actually do return rows in a consistent order (for a given
> query), this could be a performance hit.

Often depends on the content of the cache. If someone just ran a query that put some rows in cache, the DB may decide to start returning those while retrieving other rows from disk. But it is true I suppose that 'order by' a non-unique column can cause different orders.
I am curious to the use case for 'OFFSET number' without ORDER BY.

> And of course if the database is not static, then even with an order by in the
> query one is not guananteed to get consistent results since new rows may have
> been added or deleted.
> 
I suppose the only way in which we could be sure we'd get what we want, i.e. consistent, non-repeating pagination, is if we could open a cursor and fetch results from it. I am assuming a cursor to be transactionally stable.
I don't think this solution would work.

> - What about supporting OFFSET where the underlying database requires the
> OFFSET to be in an ORDER BY clause: Can we handle lone OFFSETs?
> 
> I'm not familiar with the exact syntax required, but perhaps something like:
>     ORDER BY 'x' OFFSET 100
> could be used if there is no existing ORDER BY clause in the query.  Any constant
> expression would do except that I believe some databases use simple integer
> constants to refer to a column number in the result.  But this would need to be
> tested.
> 
I just now tied this on CasJobs' SQLServer and the error message says: 'A constant expression was encountered in the ORDER BY list'
What was accepted was ' ORDER BY 0*<numerical column> OFFSET 10 ROWS'
> 
> - So my feeling is that we shouldn't go out of our way to couple OFFSET and
> ORDER BY.
> 
> - The SQL 2008 standard construct is
> 
>     OFFSET start ROW[S]  (at least according to POSTGRESQL which supports
> many variants here)
> 
Same for MS SQLServer.

> While ADQL is based upon an earlier version of the SQL standard, it seems
> gratuitously silly to not follow the agreed standard when we're adding
> something new.  It sets a tone that we don't care about other standards
> development.
> 
> - In the vein of the previous comment, we might also consider adding support
> for the standard
>     FETCH FIRST count ROWS ONLY
> as a standard if verbose synonym for TOP.    [I don't like the idea of
> adding support for LIMIT since it is another non-standard keyword, though since
> the HEASARC's TAP is running in Postgres it's easy enough for us to support.]
> 
For what it's worth, I agree.

But I would now like to see the reason for introducing this clause at all without ORDER BY.

Gerard

>      Tom McGlynn
> 
> François-Xavier Pineau wrote:
> > Dear Markus, dear all,
> >
> > Here after our (Laurent Michel and I) feelings about this.
> >
> > On 01/11/2016 11:31 AM, Markus Demleitner wrote:
> >> Dear DAL list,
> >>
> >> One of the users of DaCHS' ADQL engine requested the addition of
> >> OFFSET, and indeed I think there's at least three good reasons to do
> >> so:
> >>
> >> (1) It aids harvesting, which in particular with obscore is IMHO a
> >>      very nice thing to have
> > We are not DBMS experts but we are slightly concerned by the possible
> > consequences of the usage of OFFSET to harvest large tables, e.g a
> > catalogue like 2MASS.
> >
> > First point (not specific to large tables) underlined by Gerard: to
> > obtain consistent results, one have to use OFFSET together with ORDER
> > BY (at least in PostgresSQL), see PSQL doc here:
> > http://www.postgresql.org/docs/9.5/static/queries-limit.html.
> >
> > Then (specific to large tables), like it is also mentioned in the PSQL
> > doc:
> > "The rows skipped by an OFFSET clause still have to be computed inside
> > the server; therefore a large OFFSET might be inefficient."
> >
> > So using LIMIT, ORDER BY and OFFSET may not be the best way to harvest
> > a large table (huge server load), except (maybe) if ORDER BY is made
> > on an indexed column, without filtering rows (no 'WHERE'), and if the
> > table is clustered according to the indexed column.
> >
> >> (2) It enables pagination UIs bolted on top of an ADQL service
> > Except for large results (let's say > 100 000 rows for a web browser),
> > the pagination can be made on the client side.
> > And above 100 000 rows, you will not look at all values but probably
> > want to order then according to a given column or put additional or
> > more restrictive constraints.
> >> (3) The OFFSET 0 hack that helps in several database systems to set a
> >>      confused query planner right.
> > May OFFSET 0 confuse the query planner?
> > If not, I would vote for systematically add it on the server side (the
> > user do not have to know about it, it is internal cuisine).
> >
> > Having said this does not means that we are against its implementation
> > in ADQL.
> >
> > Cheers,
> >
> > François-Xavier Pineau (CDS) and Laurent Michel (OAS)
> >
> >>
> >> Given that, and given I've missed OFFSET even back in 2010 or so
> >> (when I wrote the example
> >> http://dc.g-vo.org/tap/examples#Crossmatchforaguidestar), I'm a bit
> >> surprised OFFSET hasn't been part of my TAP implementation notes.
> >>
> >> I don't remember why I dropped it -- perhaps because, equally
> >> surprisingly, OFFSET is not part of SQL92.  Nevertheless, I'd like to
> >> propose the addition of (at first optional, mandatory starting 3.0)
> >> support for OFFSET in ADQL 2.1.
> >>
> >> This would entail giving a new rule
> >>
> >>    <offset_spec> ::= "OFFSET" <unsigned_integer>
> >>
> >> and changing the query_specification production to
> >>
> >>    <query_specification> ::=
> >>      "SELECT" [ <set_quantifier> ] [ <set_limit> ]
> >>        <select_list> <table_expression> [ <offset_spec> ]
> >>
> >> -- which would correspond to the postgres syntax of offset.
> >>
> >> Since it's not part of SQL 92, a bit of prose would be required in
> >> the body of the spec, which I'd be happy to contribute if people
> >> don't protest.
> >>
> >> But: How does everyone else feel about this?  In particular: Is this
> >> an issue for non-postgres databases (and I'm mainly thinking of MySQL
> >> and Oracle here; SQLlite has the feature and even syntax as proposed
> >> here)?
> >>
> >> Cheers,
> >>
> >>            Markus
> >
> >