ADQL evolution: OFFSET?

Mon Jan 11 21:30:14 CET 2016

In this discussion it seems to be implicit than using an ORDER BY 
expression ensures that the rows will be returned in some defined 
order.  While that is a necessary condition it is not sufficient.  The 
value of the expression must be unique in each row to guarantee a fixed 
ordering.
Given this below are a few cogitations on the idea of whether we should 
couple OFFSET and ORDER BY.

- So why do we want to have the OFFSET explicitly tied to an ORDER BY?

If we want to guarantee that we can do pagination correctly, this won't 
do it.
If we want to enable users to do pagaination correctly they will still 
need to carefully pick the expression they
use in the order by clause.  So it's not really much easier.

- Are there costs to requiring order by?

Order by's can dramatically slow down queries, so it's not entirely 
free.  Since some static databases actually do return rows in a 
consistent order (for a given query), this could be a performance hit.  
And of course if the database is not static, then even with an order by 
in the query one is not guananteed to get consistent results since new 
rows may have been added or deleted.

- What about supporting OFFSET where the underlying database requires 
the OFFSET to be in an ORDER BY clause: Can we handle lone OFFSETs?

I'm not familiar with the exact syntax required, but perhaps something like:
    ORDER BY 'x' OFFSET 100
could be used if there is no existing ORDER BY clause in the query.  Any 
constant expression would do except that I believe some databases use 
simple integer constants to refer to a column number in the result.  But 
this would need to be tested.

- So my feeling is that we shouldn't go out of our way to couple OFFSET 
and ORDER BY.

- The SQL 2008 standard construct is

    OFFSET start ROW[S]  (at least according to POSTGRESQL which 
supports many variants here)

While ADQL is based upon an earlier version of the SQL standard, it 
seems gratuitously silly to not follow the agreed standard when we're 
adding something new.  It sets a tone that we don't care about other 
standards development.

- In the vein of the previous comment, we might also consider adding 
support for the standard
    FETCH FIRST count ROWS ONLY
as a standard if verbose synonym for TOP.    [I don't like the idea of 
adding support for LIMIT since it is another non-standard
keyword, though since the HEASARC's TAP is running in Postgres it's easy 
enough for us to support.]

     Tom McGlynn

François-Xavier Pineau wrote:
> Dear Markus, dear all,
>
> Here after our (Laurent Michel and I) feelings about this.
>
> On 01/11/2016 11:31 AM, Markus Demleitner wrote:
>> Dear DAL list,
>>
>> One of the users of DaCHS' ADQL engine requested the addition of
>> OFFSET, and indeed I think there's at least three good reasons to do
>> so:
>>
>> (1) It aids harvesting, which in particular with obscore is IMHO a
>>      very nice thing to have
> We are not DBMS experts but we are slightly concerned by the possible 
> consequences of the usage of
> OFFSET to harvest large tables, e.g a catalogue like 2MASS.
>
> First point (not specific to large tables) underlined by Gerard: to 
> obtain consistent results, one have to use OFFSET together with
> ORDER BY (at least in PostgresSQL), see PSQL doc here: 
> http://www.postgresql.org/docs/9.5/static/queries-limit.html.
>
> Then (specific to large tables), like it is also mentioned in the PSQL 
> doc:
> "The rows skipped by an OFFSET clause still have to be computed inside 
> the server; therefore a large OFFSET might be inefficient."
>
> So using LIMIT, ORDER BY and OFFSET may not be the best way to harvest 
> a large table (huge server load),
> except (maybe) if ORDER BY is made on an indexed column, without 
> filtering rows (no 'WHERE'), and
> if the table is clustered according to the indexed column.
>
>> (2) It enables pagination UIs bolted on top of an ADQL service
> Except for large results (let's say > 100 000 rows for a web browser), 
> the pagination can be made on the client side.
> And above 100 000 rows, you will not look at all values but probably 
> want to order then according to a given column
> or put additional or more restrictive constraints.
>> (3) The OFFSET 0 hack that helps in several database systems to set a
>>      confused query planner right.
> May OFFSET 0 confuse the query planner?
> If not, I would vote for systematically add it on the server side (the 
> user do not have to know about it, it is internal cuisine).
>
> Having said this does not means that we are against its implementation 
> in ADQL.
>
> Cheers,
>
> François-Xavier Pineau (CDS) and Laurent Michel (OAS)
>
>>
>> Given that, and given I've missed OFFSET even back in 2010 or so
>> (when I wrote the example
>> http://dc.g-vo.org/tap/examples#Crossmatchforaguidestar), I'm a bit
>> surprised OFFSET hasn't been part of my TAP implementation notes.
>>
>> I don't remember why I dropped it -- perhaps because, equally
>> surprisingly, OFFSET is not part of SQL92.  Nevertheless, I'd like to
>> propose the addition of (at first optional, mandatory starting 3.0)
>> support for OFFSET in ADQL 2.1.
>>
>> This would entail giving a new rule
>>
>>    <offset_spec> ::= "OFFSET" <unsigned_integer>
>>
>> and changing the query_specification production to
>>
>>    <query_specification> ::=
>>      "SELECT" [ <set_quantifier> ] [ <set_limit> ]
>>        <select_list> <table_expression> [ <offset_spec> ]
>>
>> -- which would correspond to the postgres syntax of offset.
>>
>> Since it's not part of SQL 92, a bit of prose would be required in
>> the body of the spec, which I'd be happy to contribute if people
>> don't protest.
>>
>> But: How does everyone else feel about this?  In particular: Is this
>> an issue for non-postgres databases (and I'm mainly thinking of MySQL
>> and Oracle here; SQLlite has the feature and even syntax as proposed
>> here)?
>>
>> Cheers,
>>
>>            Markus
>
>