ADQL XMATCH

Patrick Dowler pdowler.cadc at gmail.com
Mon Apr 11 19:57:16 CEST 2016


TL;DR - I think that we should redefine all the geometry functions
without coord sys now and (since overloading seems to be OK) we can
keep the old deprecated ones if we have to. Then the 2-arg DISTANCE
function with point args is my preferred solution. I don't see this
strictly as syntactic sugar to be used instead of a crafty CONTAINS
(equiv as a predicate) because the user can also add DISTANCE(...) to
the select list.

Long version:

While I agree that something like DISTANCE is preferrable to XMATCH
because it correctly conveys exactly what is going on, I don't like
the 4-arg version because it foils implementations that have spherical
geometry indexing, or at least makes them really messy with new
failure modes:

We have several catalogues in our TAP service with the coordinates in
a column described with xtype="adql:POINT" (lets ignore the details of
the adql prefix for now).  If the query on those tables uses that
column, the relevant indexing comes into play. It is true that the
tables also have separate RA and DEC columns and in principle I could
detect DISTANCE(RA, DEC, uploaded.c1, uploaded.c2) and replace RA, DEC
with the POS column, but what do I do if:

- query refers to the wrong columns in the table (e.g. DISTANCE(foo,
bar, uploaded.c1, uploaded.c2)
- query just gets them in the wrong order (e.g. DISTANCE(DEC, RA,
uploaded.c1, uploaded.c2)

I would be inclined to have the job fail rather than run it. It makes
me wonder why we would make the user put the two coordinates together
(and possibly make mistakes) when I have already put them together
correctly for them.

On the other hand, if a service has a table with just the RA and DEC
columns they can still advertise in the TAP_SCHEMA that they have a
POS column, and then when they see DISTANCE(POS, ...) they can easily
replace POS with whatever reference to RA and DEC are correct and
optimal. It is always easier to expand a single symbol into the
internal implementation than to go the other way. Sure, upload tables
may have a column with point(s) or separate columns with coordinates,
so with DISTANCE(<point>, <point>) one would typically write

DISTANCE(POS, POINT(uploaded,c1, upload.c2))

A 2-arg DISTANCE function and services declaring a POS column (maybe
instead of RA and DEC) are adding value and making it easier for the
user. A 4-arg DISTANCE function makes adding value impossible and
introduces ways to make essentially incorrect queries (admittedly,
there are plenty of ways to do that already :-).

So, I am a fan of the 2-arg DISTANCE but not of the { } syntax, which
strikes me as non-SQL. In PG, for example, you can write geometry in
internal syntax like that but (i) is has to be a string and (ii) you
almost always have to provide a cast to get the value you want. Worst
case is that users have to write DISTANCE(POINT(c1, c2), POINT(c3,
c4)) if the service/implementation doesn't provide the added value
necessary.

PS-Yes, I means exactly that POINT function with 2 args. We already
realised a long time ago that  including the coordinate system in the
functions was a huge mistake and since then we have been working to
remove that (eg SIA-2.0 and DALI-1.1 do not include it and DALI
defines point exactly like above, and the next TAP revision will be
consistent with that). I personally think that we should just redefine
all the geometry functions without coord sys now and (since
overloading seems to be OK) we can keep the old deprecated ones if we
have to.



-- 
Patrick Dowler
Canadian Astronomy Data Centre
Victoria, BC, Canada


More information about the dal mailing list