ADQL 2.1: Preferred crossmatch syntax

Dobos, László dobos at complex.elte.hu
Mon Nov 6 19:30:27 CET 2017


If the condition is defined this way (either in the ON part of the JOIN or in the WHERE clause) you lose the commutativity and associativity of the cross-match operation and prescibe an order in which the cross-matching should be performed. This also prevents a bunch of optimization options. For example, cross-matching must be started with the smallest catalog to avoid carrying too much data around between the steps that won't end up in the resultset.

-Laszlo

-----Original Message-----
From: dal-bounces at ivoa.net [mailto:dal-bounces at ivoa.net] On Behalf Of Markus Demleitner
Sent: Monday, November 6, 2017 9:53 AM
To: dal at ivoa.net
Subject: Re: ADQL 2.1: Preferred crossmatch syntax

Hi DAL,

On Fri, Nov 03, 2017 at 09:26:59PM +0000, Mark Taylor wrote:
> On Fri, 3 Nov 2017, Francois-Xavier PINEAU wrote:
> > Although I agree with the approach, there is one point that bothers me.
> > 
> > I know that not everyone agrees on the matter, but I personally 
> > prefer to declare the cross-match condition in the JOIN (like in 
> > your InteropOct2015DAL presentation) rather than in the WHERE.
> >
> > In fact, in relational algebra, I prefer to see the cross-match as a 
> > theta-join, not as a selection on the result of a cross-product.

+1 to FX' proposal; it's a good thing to be explicit on what's a join
condition and what's a constraint on the resulting relation.

Compare:

SELECT
  some_catalog AS t
  JOIN gaia AS g
  ON (
    DISTANCE(t.pos, g.pos)<1/3600.
    AND ABS(t.vmag-g.phot_mag_G_mean)<0.1)
WHERE
  parallax>10

(for some fictional mapping of the data that has ra and dec in
points) with

SELECT
  some_catalog AS t,
  gaia AS g
WHERE
  DISTANCE(t.pos, g.pos)<1/3600.
  AND ABS(t.vmag-g.phot_mag_G_mean)<0.1)
  AND parallax>10

-- in the second case, the actual physics ("nearby stars") is harder to discern, as is what the actual match criteria are.

Now scale that to four tables and you'll see one good reason for FX'
preference.

> Is it likely to make a difference in performance terms which of these 
> two ways it's done?  I.e. do we need to specify here what's the 
> preferred way to use a constraint like DISTANCE()<threshold, or is 
> this something we can harmlessly leave to the taste of the query 
> author?

In principle, the query planner should even it all out.

In practice, it's hard to predict.  But when we say "join conditions should sit next to the join clause (and not in some WHERE clause possibly lexically far remote)", that can help implementors when tuning their query planners' parameters.

            -- Markus



More information about the dal mailing list