ADQL 2.1: Preferred crossmatch syntax

Wed Nov 8 13:38:41 CET 2017

Hi Pat,

The issue I see here is that when you put the DISTANCE condition into the ON part of JOIN you always join with a specific catalog. This is probably fine if you only want to support cross-matches with clear distance cuts. (But again, conside that ON criteria can contain boolean combinations of other criteria other than DISTANCE which increases the change of writing queries that are very hard to execute.) As soon as you want to take astrometric error into account, the whole problem becomes more complicated and clear cut distances won't be enough to specify the cross-match.

After all, at minimum, what you have to be able to describe in a cross-match query are:
- the list of tables and the columns that form the basis of the matching (coordinates, astrometric error, etc.)
- the cross-match algorithm to use
- the conditions to accept a match

If you specify the DISTANCE in the ON part, then you basically define a graph of catalogs but there's too many ways to do this. For instance, you can prescribe the distance cut for all pairs of catalogs, just the necessary number of pairs, etc. You can also come up with two disjoint cross-match problems in the same query and then join the two on some other column (ID, for instance). These are all complications that make query evalution (or just validation) very hard.

On the other hand, if you introduce some xmatch syntax that limits these possibilites but support basically all scientificly relevant scenarios then you can win a lot because it will be much easier to validate and evaluate queries. This is why our solution was to use something that's not restricted at all to style of standard SQL but is an extension that much better suits our needs.

Other than the minimum list above, I can imagine these things for a future implementation:
- go beyond distance-based cross-matching (this was a must for us because we implemented the Bayesian algorithm)
- allow outer joins, which we knows is a combinatorically exploding problem so some restrictions should be made
- region of interest to limit cross-match to a given area of the sky
- specify footprints of the catalogs so that drop-out detection could be made
- support extended object matching (ie. find galaxies in x-ray clusters)

I understand that these are all beyond the scope of the current adql standard but with a smart design a simpler syntax would be easier to extend and adapt to support future features. Figuring out a road-map toward the more complex features doesn't cost much, whereas fitting new things into a bad syntax is quite hard. SQL is actually a bad example because it became quite convoluted due to the various ad-hoc syntax extensions over the year. C# had a much better approach, it had a road-map and design of future syntax elements from the beginning and they're adding these new language features as the implementation progresses.

-Laszlo

-----Original Message-----
From: Patrick Dowler [mailto:pdowler.cadc at gmail.com] 
Sent: Wednesday, November 8, 2017 12:56 AM
To: Dobos, Laszlo <dobos at complex.elte.hu>
Cc: <dal at ivoa.net> <dal at ivoa.net>; Tamas Budavari <budavari at jhu.edu>
Subject: Re: ADQL 2.1: Preferred crossmatch syntax

Laszlo,

I'm not a big fan of using DISTANCE for this, but I am in favour of putting it in the JOIN rather than the WHERE clause. Can you explain which exactly is causing the join order to be prescribed? I certainly don't see an obvious way to translate this into efficient SQL that uses an index, but it looks to me like the use of DISTANCE is making that hard, not the location in the JOIN clause.

Pat

On 7 November 2017 at 05:06, Francois-Xavier Pineau <francois-xavier.pineau at astro.unistra.fr> wrote:
> Laszlo, DAL,
>
> Before a more complete answer, just one question: with your notation, 
> is it possible to ask for all sources of table1 matching or not the 
> result of a cross-match between table 2 and table t3?
>
> I.e. it is possible to ask for
>
> t1 LEFT JOIN (t2 INNER JOIN t3) or, equivalently t2 INNER JOIN t3 
> RIGHT OUTER JOIN t3
>
> e.g. using a sub-crossmatch in a IN without having to create a 
> (temporary) table?
>
>
> fx
>
>
>
> Le 04/11/2017 à 19:30, Dobos, Laszlo a écrit :
>>
>> Hi guys,
>>
>> I'm not on the DAL list but Tamas Budavari forwarded the mail to me 
>> so let me chip in. In the new skyquery, we put the cross-match 
>> definition into the FROM clause but not as a list of joined tables 
>> but rather something like a table-valued function. Putting it into 
>> the FROM clause is much better than in the WHERE clause, not because 
>> it's much easier to parse but because if it's in the WHERE clause, 
>> it's quite hard to sort the other conditions out if there's complex 
>> boolean algebra allowed in the clause. If xmatching has some special, 
>> custom syntax, then the possibility of writing queries that cannot be 
>> executed can be minimized. Also, putting it into the xmatch clause 
>> makes query rewriting much easier because you simply replace the 
>> xmatch part with the output of the xmatch operation and leave all 
>> other joins in place to execute the final query.
>>
>> Here's an example:
>>
>> SELECT s.objid, s.ra, s.dec, g.objid, g.ra, g.dec, x.ra, x.dec INTO 
>> twowayxmatch FROM XMATCH
>>       (MUST EXIST IN SDSSDR7:PhotoObjAll AS s WITH(POINT(s.ra, 
>> s.dec), ERROR(0.1, 0.1, 0.1)),
>>        MUST EXIST IN GALEXGR6:PhotoObjAll AS g WITH(POINT(g.ra, 
>> g.dec), ERROR(0.2, 0.2, 0.2)),
>>        LIMIT BAYESFACTOR TO 1e3) AS x WHERE s.ra BETWEEN 0 AND 5 AND 
>> s.dec BETWEEN 0 AND 5
>>      AND g.ra BETWEEN 0 AND 5 AND g.dec BETWEEN 0 AND 5
>>
>> The MUST EXIST will soon be extended to MAY EXIST and MUST NOT EXIST 
>> (drop-outs, but for those, you need the footprints). This syntax is 
>> quite extensible, because you can add more catalog-specific options 
>> to the
>> WITH()
>> part. For instance, we could support extended source matching by 
>> replacing POINT with ELLIPSE, etc. LIMIT BAYESFACTOR could be 
>> replaced with LIMIT SEPARATION etc. to use a different algorithm.
>>
>> We also support region-based filtering with the REGION clause. This 
>> is intentionally not on defined on a per-table basis to avoid query 
>> plan complexities.
>>
>> SELECT s.objid, s.ra, s.dec, g.objid, g.ra, g.dec, x.ra, x.dec INTO 
>> twowayxmatch FROM XMATCH
>>       (MUST EXIST IN SDSSDR7:PhotoObjAll AS s WITH(POINT(s.ra, 
>> s.dec), ERROR(0.1, 0.1, 0.1)),
>>        MUST EXIST IN GALEXGR6:PhotoObjAll AS g WITH(POINT(g.ra, 
>> g.dec), ERROR(0.2, 0.2, 0.2)),
>>        LIMIT BAYESFACTOR TO 1e3) AS x REGION 'CIRCLE J2000 0 0 5'
>>
>> You can try it here: http://test.voservices.net/skyquery . It doesn't 
>> support adql yet but I'm working on that.
>>
>> -Laszlo
>>
>> -----Original Message-----
>> From: dal-bounces at ivoa.net [mailto:dal-bounces at ivoa.net] On Behalf Of 
>> Mark Taylor
>> Sent: Friday, November 3, 2017 5:27 PM
>> To: Francois-Xavier PINEAU <francois-xavier.pineau at astro.unistra.fr>
>> Cc: dal at ivoa.net
>> Subject: Re: ADQL 2.1: Preferred crossmatch syntax
>>
>>
>> On Fri, 3 Nov 2017, Francois-Xavier PINEAU wrote:
>>
>>> Dear Mark and DAL,
>>>
>>> Although I agree with the approach, there is one point that bothers me.
>>>
>>> I know that not everyone agrees on the matter, but I personally 
>>> prefer to declare the cross-match condition in the JOIN (like in 
>>> your InteropOct2015DAL
>>> presentation) rather than in the WHERE.
>>>
>>> In fact, in relational algebra, I prefer to see the cross-match as a 
>>> theta-join, not as a selection on the result of a cross-product.
>>
>> OK, I don't really have an opinion on that, I never really know about 
>> what syntax is best for joins, or whether it makes a difference.
>> [And I'm afraid I wouldn't know a theta-join if one trod on my foot]
>>
>> Is it likely to make a difference in performance terms which of these 
>> two ways it's done?  I.e. do we need to specify here what's the 
>> preferred way to use a constraint like DISTANCE()<threshold, or is 
>> this something we can harmlessly leave to the taste of the query 
>> author?
>>
>> Mark
>>
>> --
>> Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
>> m.b.taylor at bris.ac.uk +44-117-9288776  
>> http://www.star.bris.ac.uk/~mbt/
>
>

--
Patrick Dowler
Canadian Astronomy Data Centre
Victoria, BC, Canada