ADQL evolution: OFFSET?

Mon Jan 11 23:30:04 CET 2016

Hi Tom

> -----Original Message-----
> From: dal-bounces at ivoa.net [mailto:dal-bounces at ivoa.net] On Behalf Of 
> Tom McGlynn (NASA/GSFC Code 660.1)
> Sent: Monday, January 11, 2016 3:30 PM
> To: dal at ivoa.net
> Subject: Re: ADQL evolution: OFFSET?
> 
> In this discussion it seems to be implicit than using an ORDER BY 
> expression ensures that the rows will be returned in some defined 
> order.  While that is a necessary condition it is not sufficient.  The 
> value of the expression must be unique in each row to guarantee a fixed ordering.

Yes, that is true. A unique (set of) column(s) would need to be included.

> Given this below are a few cogitations on the idea of whether we 
> should couple OFFSET and ORDER BY.
> 
> - So why do we want to have the OFFSET explicitly tied to an ORDER BY?
> 
> If we want to guarantee that we can do pagination correctly, this won't do it.
> If we want to enable users to do pagaination correctly they will still 
> need to carefully pick the expression they use in the order by clause.  
> So it's not really much easier.
No, but it is still tied to an ORDER BY.

> 
> - Are there costs to requiring order by?
> 
> Order by's can dramatically slow down queries, so it's not entirely 
> free.  Since some static databases actually do return rows in a 
> consistent order (for a given query), this could be a performance hit.

Often depends on the content of the cache. If someone just ran a query that put some rows in cache, the DB may decide to start returning those while retrieving other rows from disk. But it is true I suppose that 'order by' a non-unique column can cause different orders.
I am curious to the use case for 'OFFSET number' without ORDER BY.

> And of course if the database is not static, then even with an order 
> by in the query one is not guananteed to get consistent results since 
> new rows may have been added or deleted.
> 
I suppose the only way in which we could be sure we'd get what we want, i.e. consistent, non-repeating pagination, is if we could open a cursor and fetch results from it. I am assuming a cursor to be transactionally stable.
I don't think this solution would work.

> - What about supporting OFFSET where the underlying database requires 
> the OFFSET to be in an ORDER BY clause: Can we handle lone OFFSETs?
> 
> I'm not familiar with the exact syntax required, but perhaps something like:
>     ORDER BY 'x' OFFSET 100
> could be used if there is no existing ORDER BY clause in the query.  
> Any constant expression would do except that I believe some databases 
> use simple integer constants to refer to a column number in the 
> result.  But this would need to be tested.
> 
I just now tied this on CasJobs' SQLServer and the error message says: 'A constant expression was encountered in the ORDER BY list'
What was accepted was ' ORDER BY 0*<numerical column> OFFSET 10 ROWS'
> 
> - So my feeling is that we shouldn't go out of our way to couple 
> OFFSET and ORDER BY.
> 
> - The SQL 2008 standard construct is
> 
>     OFFSET start ROW[S]  (at least according to POSTGRESQL which 
> supports many variants here)
> 
Same for MS SQLServer.

> While ADQL is based upon an earlier version of the SQL standard, it 
> seems gratuitously silly to not follow the agreed standard when we're 
> adding something new.  It sets a tone that we don't care about other 
> standards development.
> 
> - In the vein of the previous comment, we might also consider adding 
> support for the standard
>     FETCH FIRST count ROWS ONLY
> as a standard if verbose synonym for TOP.    [I don't like the idea of
> adding support for LIMIT since it is another non-standard keyword, 
> though since the HEASARC's TAP is running in Postgres it's easy enough 
> for us to support.]
> 
For what it's worth, I agree.

But I would now like to see the reason for introducing this clause at all without ORDER BY.

Gerard

>      Tom McGlynn
> 
> François-Xavier Pineau wrote:
> > Dear Markus, dear all,
> >
> > Here after our (Laurent Michel and I) feelings about this.
> >
> > On 01/11/2016 11:31 AM, Markus Demleitner wrote:
> >> Dear DAL list,
> >>
> >> One of the users of DaCHS' ADQL engine requested the addition of 
> >> OFFSET, and indeed I think there's at least three good reasons to 
> >> do
> >> so:
> >>
> >> (1) It aids harvesting, which in particular with obscore is IMHO a
> >>      very nice thing to have
> > We are not DBMS experts but we are slightly concerned by the 
> > possible consequences of the usage of OFFSET to harvest large 
> > tables, e.g a catalogue like 2MASS.
> >
> > First point (not specific to large tables) underlined by Gerard: to 
> > obtain consistent results, one have to use OFFSET together with 
> > ORDER BY (at least in PostgresSQL), see PSQL doc here:
> > http://www.postgresql.org/docs/9.5/static/queries-limit.html.
> >
> > Then (specific to large tables), like it is also mentioned in the 
> > PSQL
> > doc:
> > "The rows skipped by an OFFSET clause still have to be computed 
> > inside the server; therefore a large OFFSET might be inefficient."
> >
> > So using LIMIT, ORDER BY and OFFSET may not be the best way to 
> > harvest a large table (huge server load), except (maybe) if ORDER BY 
> > is made on an indexed column, without filtering rows (no 'WHERE'), 
> > and if the table is clustered according to the indexed column.
> >
> >> (2) It enables pagination UIs bolted on top of an ADQL service
> > Except for large results (let's say > 100 000 rows for a web 
> > browser), the pagination can be made on the client side.
> > And above 100 000 rows, you will not look at all values but probably 
> > want to order then according to a given column or put additional or 
> > more restrictive constraints.
> >> (3) The OFFSET 0 hack that helps in several database systems to set a
> >>      confused query planner right.
> > May OFFSET 0 confuse the query planner?
> > If not, I would vote for systematically add it on the server side 
> > (the user do not have to know about it, it is internal cuisine).
> >
> > Having said this does not means that we are against its 
> > implementation in ADQL.
> >
> > Cheers,
> >
> > François-Xavier Pineau (CDS) and Laurent Michel (OAS)
> >
> >>
> >> Given that, and given I've missed OFFSET even back in 2010 or so 
> >> (when I wrote the example 
> >> http://dc.g-vo.org/tap/examples#Crossmatchforaguidestar), I'm a bit 
> >> surprised OFFSET hasn't been part of my TAP implementation notes.
> >>
> >> I don't remember why I dropped it -- perhaps because, equally 
> >> surprisingly, OFFSET is not part of SQL92.  Nevertheless, I'd like 
> >> to propose the addition of (at first optional, mandatory starting 
> >> 3.0) support for OFFSET in ADQL 2.1.
> >>
> >> This would entail giving a new rule
> >>
> >>    <offset_spec> ::= "OFFSET" <unsigned_integer>
> >>
> >> and changing the query_specification production to
> >>
> >>    <query_specification> ::=
> >>      "SELECT" [ <set_quantifier> ] [ <set_limit> ]
> >>        <select_list> <table_expression> [ <offset_spec> ]
> >>
> >> -- which would correspond to the postgres syntax of offset.
> >>
> >> Since it's not part of SQL 92, a bit of prose would be required in 
> >> the body of the spec, which I'd be happy to contribute if people 
> >> don't protest.
> >>
> >> But: How does everyone else feel about this?  In particular: Is 
> >> this an issue for non-postgres databases (and I'm mainly thinking 
> >> of MySQL and Oracle here; SQLlite has the feature and even syntax 
> >> as proposed here)?
> >>
> >> Cheers,
> >>
> >>            Markus
> >
> >