ADQL 2.1: Preferred crossmatch syntax

Mark Taylor m.b.taylor at bristol.ac.uk
Thu Nov 2 13:26:43 CET 2017


Dear DAL,

as discussed briefly in Chile, and also at previous interops etc
(see e.g. http://mail.ivoa.net/pipermail/dal/2016-February/007331.html
and its context) I would like to see a preferred/recommended syntax
for positional sky crossmatching in ADQL.

I have drafted a section for possible inclusion in the ADQL 2.1 draft.
Here it is:

   \subsection{Preferred Crossmatch Syntax}

   An especially common operation that astronomers require when working
   with source catalogues is the positional sky crossmatch.
   In its simplest form this is a join between two tables with the
   requirement that the distance along a great circle between the
   sky positions of the two associated rows is less than or equal to
   a given threshold.

   The geometrical operations provided by ADQL offer a number of
   semantically equivalent ways to specify such a condition in
   the WHERE clause,
   for instance using various combinations of POINT, CIRCLE and DISTANCE.
   While a correct implementation will generate the same result for
   any of these specifications, the performance characteristics may
   differ dramatically depending on implementation.
   Given this, it is difficult for (human or machine) ADQL authors
   to know how to phrase an crossmatch with the expectation that it
   will be executed efficiently, and difficult for services to know
   which forms of query to optimise.
   The result can be the unnecessarily slow operation of the common
   sky crossmatch operation.

   The purpose of this section is to propose a preferred form of the WHERE
   clause used for sky crossmatches.  Clients posing crossmatch-like
   queries are advised to phrase them like this rather than semantically
   equivalent alternatives, and services are encouraged to ensure that
   this form of join constraint is evaluated efficiently;
   this might involve identifying such ADQL input clauses and rewriting
   them appropriately for efficient processing on the database backend.
   Alternative semantically equivalent forms however MAY still be
   used by clients, and MUST still be handled correctly by services.

   The preferred form of WHERE clause to constrain a sky position
   crossmatch is
   \begin{verbatim}
      WHERE DISTANCE(lon1, lat1, lon2, lat2) < r_max_deg
   \end{verbatim}
   or equivalently
   \begin{verbatim}
      WHERE DISTANCE(POINT(lon1, lat1), POINT(lon2, lat2)) < r_max_deg
   \end{verbatim}

   In some cases the performance characteristics will depend on which
   way round the coordinates are specified.
   The preferred form is that the coordinate pair more likely to be handled
   efficiently (with an index, or from a small table)
   should be in the {\em first\/} position ({\tt lon1}, {\tt lat1})
   and the pair less likely to be handled efficiently
   (without an index, or from a large or maybe uploaded table)
   should be in the {\em second\/} position ({\TT lon2}, {\tt lat2}).


I think this (sub)section should probably go near the end of
section 4.2, but you could argue for other places in sec 4,
or maybe even section 2.

Questions on the details:

   - Is DISTANCE (in 2 flavours) the best syntax to use here?
     I think it's what we agreed on in Cape Town, partly on the
     grounds that its meaning is transparent and it doesn't require
     any new ADQL syntax.  But I don't care that much what the
     syntax is, as long as it's agreed.  The other possibility
     that I know has been used (is currently recommended in TOPCAT,
     but never to my knowledge explicitly endorsed in any standard
     or Note) is
     1=CONTAINS(POINT(lon1, lat1), CIRCLE(lon2, lat2, r_max_deg)).
     But that looks much less user-friendly to me.

   - Is this promotion of a preferred crossmatch form a good idea?
     I think it is, for the reasons sketched above, but if anybody
     is unconvinced I can try to argue the case more strongly.

   - Does the wording about which point goes in first/second position
     make sense?  Is it the right way round?  I know that different
     services prefer different orders (e.g. TAPVizieR vs. DaCHS).
     Again, I don't care which it is, but if it's the sort of thing
     that TAP implementations can benefit from (and I have the
     impression that it is) there should be an agreement.

   - Any other wording changes that people want, fine.  I'm not a
     database expert.

Thanks for considering,

Mark

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776  http://www.star.bris.ac.uk/~mbt/


More information about the dal mailing list