ADQL XMATCH

Laszlo Dobos dobos at complex.elte.hu
Wed Feb 10 16:26:34 CET 2016


Hi everyone,

It turned out during implementing Tamas' Bayesian algorithm that the best
syntax depends on a lot of factors. For example, many shortcuts can be made
if there's enough meta-data to figure out the existing spatial indices, etc.
One of my goals was to seamlessly integrate the cross-match syntax into SQL
so if necessary, even an INNER JOIN can be added to the query. Below is the
final syntax that we will go on-line with. The idea is that the XMATCH part
is basically a table-valued function (though it isn't implemented as a TFV
in practice) which is somewhat special in the sence that it returns the
match table but at the same the rest of the query can reference the columns
of the catalog tables too. The queries below are taken from my unit tests
and are not necessarily meaningful.

SELECT x.matchID,
       a.objID, a.ra, a.dec, a.g, a.r, a.i, 
       b.cntr, b.ra, b.dec, b.w1mpro, b.w2mpro, b.w3mpro
INTO [$into]
FROM XMATCH(
    MUST EXIST IN TEST:SDSSDR7PhotoObjAll AS a WITH(POINT(ra, dec, cx, cy,
cz), ERROR(0.1, 0.1, 0.1), ZONEID(zoneID)),
    MUST EXIST IN TEST:WISEPhotoObj AS b WITH(POINT(ra, dec, cx, cy, cz),
ERROR(0.1, 0.1, 0.1), ZONEID(zoneID)),
    LIMIT BAYESFACTOR TO 1e3
) AS x
WHERE a.ra BETWEEN 0 AND 1 AND
      b.ra BETWEEN 0 AND 1

The WITH part is taken from standard SQL for table hints. It is currently
used to specify the columns to be used by the algorithm but once we fill in
all the meta-data writing out the WITH part will be optional. Another
advantage of having detailed meta-data is that we can find the best
available representation of the coordinates (Cartesian vs. spherical),
figure out if there's and HTM index etc. The LIMIT BAYESFACTOR TO 1000 part
determines the algorithm used. We will probably add support for simple
distance cuts, extended objects (e.g. match HII regions with galaxies) and
covariance in errors.

Further filtering of the matches can be done in the WHERE clause, this is
where additional matching criteria can be added, for example based on the
SED. Table x contains th eBayes factor so a multi-level Bayesian hierarchy
can also be evaluated, though it's never going to be super simple syntax. We
also support a region syntax to limit the match to a certain area. This will
also be used to specify the survey footprints necessary for drop-out
detection:

SELECT x.matchID,
       a.objID, a.ra, a.dec, a.g, a.r, a.i, 
       b.cntr, b.ra, b.dec, b.w1mpro, b.w2mpro, b.w3mpro
INTO [$into]
FROM XMATCH(
    MUST EXIST IN TEST:SDSSDR7PhotoObjAll AS a WITH(POINT(ra, dec, cx, cy,
cz), ERROR(0.1, 0.1, 0.1), HTMID(htmid), ZONEID(zoneID)),
    MUST EXIST IN TEST:WISEPhotoObj AS b WITH(POINT(ra, dec, cx, cy, cz),
ERROR(0.1, 0.1, 0.1), HTMID(htmid), ZONEID(zoneID)),
    LIMIT BAYESFACTOR TO 1e3
) AS x
REGION 'CIRCLE J2000 0 0 10'

Note, that the region is intentionally constant, otherwise it would be
insanely complicated to come up with an execution plan. The idea is to allow
a URL in place of the region string and fetch the footprints from a web
service, or take it entirely from metadata. Also, I've encountered cases
when people wanted to get a flag that tells if the matched object is closer
to the footprint boundary than the search radius as it may indicate
incomplete matching.

Here's an example of a xmatch combined with an inner join to find spectro
objects:

SELECT x.MatchID, x.ra, x.dec, rr.objID
INTO [$into]
FROM XMATCH(
    MUST EXIST IN TEST:SDSSDR7PhotoObjAll AS a WITH(POINT(ra, dec, cx, cy,
cz), ERROR(0.1), HTMID(htmid)),
    MUST EXIST IN TEST:WISEPhotoObj AS b WITH(POINT(ra, dec, cx, cy, cz),
ERROR(0.1), HTMID(htmid)),
    MUST EXIST IN TEST:GalexPhotoObjAll AS c WITH(POINT(ra, dec, cx, cy,
cz), ERROR(0.2), HTMID(htmid)),
    LIMIT BAYESFACTOR TO 1e3) AS x
    INNER JOIN TEST:SDSSDR7PhotoObjAll_NoHtm rr ON rr.objid = a.objid
REGION 'CIRCLE J2000 0 0 10'

I agree that writing these queries is a bit overkill for the general public
but there's also plans to build a drag-and-drop UI on top of the web
services, similar in style to the openskyquery web page.

Is anyone coming to the BIDS conference this March? We could have dinner
together to discuss the syntax issues.

-Laszlo

-----Original Message-----
From: Juan Gonzalez [mailto:juan.gonzalez at sciops.esa.int] 
Sent: Wednesday, February 10, 2016 2:50 PM
To: dal at ivoa.net
Cc: szalay at jhu.edu; budavari at jhu.edu; dobos at complex.elte.hu; Jesus Salgado
<jesus.salgado at sciops.esa.int>; Bruno Merin <bmerin at sciops.esa.int>; William
O'Mullane <womullan at sciops.esa.int>
Subject: Re: ADQL XMATCH

Dear DAL WG,

In the context of the Gaia archive we are also evaluating this kind of
extensions for integrating crossmatch in ADQL. Our view is that we shall
concentrate the syntax extensions on what is common to all possible
crossmatch computations, and leave the peculiarities of what is implemented
on each different infrastructure to a set of different functions. It is very
likely that we will have different crossmatch functionalities on each
service, and ADQL syntax shall not constraint these of their future
evolution. What we are seeing in this interesting discussion thread seems
like the best prove of this.

Our current view goes in this direction (with create table statements,
supported by services like ours with persistent storage):

CREATE TABLE g10_gal_tycho2_1s AS

   CROSSMATCH g10_gal as g WITH tycho2 as t
     ESTIMATOR spherical_distance (g, t) as distance
   ON distance < 1/3600

This is, to include:

-A set of ESTIMATOR clauses that define the likelihood estimators, etc. 
that provide accuracy estimations based in existing functions in that
service DB.
-An ON clause that sets the cut criteria based in the estimators/measures
computed by the previous clause.

Depending on the computational implementation the service does, some of this
may be mandatory; for instance, if computation is based on a first
positional cut, as in our service, one distance measure shall be mandatory.
Implemented functions can be declared in the Registry record for the
service, so that apps are aware of them.

We have been recently discussing this with Tamas, that has referred us to
the work that you Alex and Laszlo have done as part of SkyQuery. Your
service has tackled crossmatching syntax in the most comprehensive way, and
I strongly believe in Alex statement that we shall think about how to adapt
a powerful syntax to more simple use cases, and not the other way around.

This syntax may be extended as follows, incorporating some additional
estimators as well as other ideas from SkyQuery:

CREATE TABLE g10_gal_tycho2_1s AS

   CROSSMATCH tycho2 as t WITH g10_gal as g

     ESTIMATOR spherical_distance(g, t) as distance,
              ABS(ABS(g.g_mag)-ABS(t.g_mag)) as mag_diff,
          bayesian_likelihood(g, t) as likelihood,
          position_estimate(g, t) as pos_est

   ON distance < 1/3600 AND likelihood > 0.5


Best Regards,
Juan


On 02/10/2016 10:14 AM, Mark Taylor wrote:
> On Tue, 9 Feb 2016, Tom McGlynn (NASA/GSFC Code 660.1) wrote:
>
>> If I understand it, the xmatch function proposed here will return a 1 
>> or 0 based purely upon two positions and a radius.  Presumably the 
>> function returns
>> 1 if the two positions are within the specified radius of each other 
>> and 0 otherwise, but maybe something else has been discussed.  No 
>> other information is used in the xmatch.
> yes.
>
>> In that context I would prefer to use four real variables so that 
>> coordinate system is irrelevant. If there are functions that can 
>> create point objects from coordinates or get the coordinates from 
>> point objects, then the two approaches are equivalent in the 
>> functionality they provide to users, but using reals is simpler to 
>> implement since we can simply assume that whatever the coordinate system
is, all four of the values are using the same one.
> agree.
>
>> However I think this overall approach is flawed.  If we want to 
>> create a logical function then we should do that.  In our 
>> implementations of existing geometry functions at the HEASARC we've 
>> found that Postgres query  optimizer is confused by this idiom where we
use a
>>       function() = integer
>> substitution for a logical value.  If you really want logical values, 
>> then I think we should just implement xmatch that way.  Of course 
>> given that we've already implemented other functions this way that's 
>> probably a boat that's already sailed.
> A logical function certainly makes more sense here; 1=XMATCH(..) is 
> clunky and unintuitive.  However, as I understand it, there is no 
> logical type defined in ADQL, so it's not possible to define a new 
> function like that without significant changes to the ADQL syntax.
>
>> However if the xmatch function is doing what I indicated above, I 
>> think the whole function is superfluous.
>>
>> Rather than
>>      (xmatch(ra1,dec1,ra2,dec2,rad) = 1) or if we use a logical value
>>     (xmatch(ra1,dec1,ra2,dec2,rad))
>>
>> it seems far more natural to use
>>      (distance(ra1,dec1,ra2,dec2) < rad)
>>
>> This is clear and easily implemented.  In our experience it can be 
>> translated into functions that can take advantage of spatial indices.
>  From a user point of view, I think that would be absolutely fine; in 
> fact as you say better than a dedicated XMATCH function because it's 
> more transparent and more flexible.  I was under the impression that 
> constraints written like that were difficult for TAP implementors to 
> use in a way that led to an efficient crossmatch, and that the 
> 1=CONTAINS(POINT,CIRCLE) business was the recommended way to specify a 
> performant crossmatch in ADQL.
> However, I don't know anywhere that's written down in a standard, and 
> maybe I'm just wrong about it.  I'm not at all knowledgeable about the 
> DB end of this, so I'm largely in the dark about what makes sense here 
> from an implementation point of view.
>
> My interest is that I want to be able to write example ADQL queries 
> and provide documentation to my ADQL-using users that tell them how to 
> perform a spatial crossmatch on the sky, without too much ugly syntax.
>
> Mark
>
> --
> Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
> m.b.taylor at bris.ac.uk +44-117-9288776  
> http://www.star.bris.ac.uk/~mbt/


-- 
Juan Gonzalez			    juan.gonzalez at sciops.esa.int
ESAC Science Archive Team
European Space Agency (ESA) - SERCO

European Space Astronomy Centre (ESAC)
28691 Villanueva de la Cañada		   Tel: +34 91 813 14 82
P.O. Box 78, Madrid, SPAIN		   Fax: +34 91 813 13 22
----------------------------------------------------------------


This message and any attachments are intended for the use of the addressee
or addressees only.
The unauthorised disclosure, use, dissemination or copying (either in whole
or in part) of its content is not permitted.
If you received this message in error, please notify the sender and delete
it from your system.
Emails can be altered and their integrity cannot be guaranteed by the
sender.

Please consider the environment before printing this email.




More information about the dal mailing list