ADQL XMATCH

Wed Feb 10 18:21:54 CET 2016

Two thoughts: one minor and one more significant....

I think the issue of whether we use Points or coordinates is a 
relatively minor one.  Personally I'd think that in a sane world one 
would implement both.  Since one can be transformed to the other with 
essentially a single line of code giving users both to match what
they want to do makes sense to me.   The time we've spent on this debate 
is probably greater than the time it would take to do this for a typical 
implementation.   E.g. in psusdocode we can implement the point version 
on top of the coordinate version as
      function xmatch(Point a, Point b, radius) {
           return xmatch(a.getCoordinate(0), a.getCoordinate(1), 
b.getCoordinate(0), b.getCoordinate(1), radius)
     }
or if we want to make the Point one more fundamental
     function xmatch(double ra1, double dec1, double ra2, double dec2, 
radius) {
          return xmatch(new Point(ra1,dec1), new Point(ra2, dec2), radius);
     }
Not sure why a rule that one couldn't overload methods was promulgated.  
Given that this changes not just the type but the number of arguments, 
supporting these  overloads should be easy, but if absolutely necessary 
one could have slightly different names, e.g.,
    distance(Point,Point) and distanceC(double,double,double,double).

More significantly: In deciding what functions to provide, it seems like 
we should be primarily be designing ADQL to support our astronomical use 
cases.  Regardless of what we choose it will be easy to build queries 
which will fail to use indices optimally. This is true regardless of 
Point/coordinate distinction or use of the xmatch or whatever.  E.g., in 
my tests of q3c and pgsphere apparently trivial changes in the query 
could determine whether the indexes git used efficiently.  I suspect 
that the same will be true in other Postgres libraries and in 
non-Postgres databases.

Walter gave a talk in Sydney describing elements of the geometry that 
IRSA tables would be able to support and I think we need to build upon 
that so that we have a recommended syntax for use in joining tables that 
we all endeavor to support.  I believe it is premature to base the ADQL 
standard upon our preconceptions about what is easy or hard for the 
query optimizer to support.  The query optimizer is not our customer.  
E.g., as I've mentioned at the HEASARC we found that the xxx()=1 syntax 
itself defeated the optimizer and we had to work to address that.  So 
regardless of what we decide to put in ADQL, we should suggest a 
specific idiom that we will do our best to optimize.  But we must 
recognize that users may employ whatever functions we define in ways 
that are likely to be non-optimized.  And often that will be fine since 
the tables will be small enough or some other constraint will catch the 
eye of the optimizer.

Walter and Theresa's notes make it clear that we're doing looking inside 
the query and adapting it to our specific implementations. I suspect 
that at some level all of us are doing that and will continue to do so.

     Tom

Theresa Dower wrote:
> While the issue of distance() being overloaded in ADQL remains, I wanted to note that for our use of SQL Server at STScI, we do enough query rewriting already that a translation from [something like] distance(....) would be basically the same work as we already have to do with contains().
>
> I echo Alex's concern about calling something simple 'xmatch' when it isn't, or effectively putting the burden of a better crossmatch implementation on service providers. Something like distance(...) would be more honest, though I have no suggestion for quite what to call it without overloading that function.
>
> --Theresa
>
> -----Original Message-----
> From: dal-bounces at ivoa.net [mailto:dal-bounces at ivoa.net] On Behalf Of Walter Landry
> Sent: Wednesday, February 10, 2016 9:46 AM
> To: dal at ivoa.net
> Subject: Re: ADQL XMATCH
>
> Grégory Mantelet <gmantele at ari.uni-heidelberg.de> wrote:
>>      However, this kind of expression is performed by a sequential scan in
>>      the database. As far as I know, there is no way to index or optimize
>>      such constraint in a database (but I may be wrong so correct me if
>>      needed). On the contrary, "contains(point, circle)" can use an index
>>      (using PgSphere+Postgres for instance). So, I agree, it is ugly, but
>>      it is more efficient.
>>
>>      Then, maybe it is also possible to use some trick like detecting
>>      "distance(ra1,dec1,ra2,dec2) < something" inside the ADQL query and
>>      translate it into the equivalent of "contains(point,circle)" in
>>      SQL....but it is really a ugly trick and may not be so trivial to
>>      implement.
> In my parser (I can not speak for others), implementing this is just recognizing this pattern to be semantically the same as CONTAINS.  It would be about the same amount of work as changing the parser to recognize XMATCH.  So not very much at all.  I think this is the easiest, most intuitive way forward, particularly with point literals.
>
> Cheers,
> Walter Landry