Two very loose ends of the ADQL 2.1 PR
Dobos, László
dobos at complex.elte.hu
Tue May 29 18:28:52 CEST 2018
Hi everyone,
I've been working on SkyQuery for years at JHU and it became clear quite early that cross-matching is not an operator like join because it can be defined between more than two tables. So a different wording, maybe "spatial join" would be better and reserve cross-match for later use.
-Laszlo
-----Original Message-----
From: dal-bounces at ivoa.net [mailto:dal-bounces at ivoa.net] On Behalf Of Markus Nullmeier
Sent: Saturday, May 26, 2018 4:17 AM
To: dal at ivoa.net
Cc: Dave Morris <dmr at roe.ac.uk>
Subject: Two very loose ends of the ADQL 2.1 PR
Hello list,
Problem A:
I have been made aware of Section 4.2.7: "Preferred crossmatch syntax"
of the ADQL 2.1 PR. As one of the maintainers of pgSphere, which is actually used by many a data centre to run various other software on top of it to implement ADQL, I claim to have some, if indirect, insight on real-world deployment of ADQL.
While I do not have an opinion on ADQL syntax, I find the following sentence to be highly problematic:
"Clients posing crossmatch-like queries are advised to phrase them
this way rather than semantically equivalent alternatives, and
services are encouraged to ensure that this form of join is executed
efficiently;"
For, in the real world, quite a few existing and very important services will virtually certainly, for ages to come, refrain from the effort to upgrade the ADQL implementations they are using with the necessary updates to rewrite queries accordingly -- however small and seemingly simple these changes appear to be.
But the net result of that sentence will be that some users or even client implementers are going to pick up that "good advice", giving them a spectacularly bad VO experience on many real services, where the underlying database software (whatever it is) will use sequential scans instead of index scans, with the latter of course being orders of magnitude faster.
Besides, there is a rather odd mismatch between the quite strong choice of "advised" for users / clients and the much weaker word "encouraged"
for services.
My recommendation is therefore to remove the words
"Clients posing crossmatch-like queries are advised to phrase them
this way rather than semantically equivalent alternatives"
for good. Also, if the net effect of pushing some kind of crossmatch syntax (by the way, what about cone search?) is to have any hope, then a sentence such as "This syntax MUST be handled as efficient or better as semantically equivalent queries" would be in order.
But I do believe enacting such a "quick fix" would be premature, because it is the antitheses of meaningful interoperability:
The proposed preferred syntax of Section 4.2.7 did not exist before ADQL 2.1. Thus, the most efficient crossmatch syntax of older services is necessarily something else. But by following the old rule "be liberal in what you accept, be conservative in what you send", new ADQL services should really make sure that _any_ efficient crossmatch syntax that had had a significant following in past would be executed in the most efficient way.
However, I personally lack the data about extant efficient ADQL crossmatch queries. Lacking this data, it may be wise to postpone Section 4.2.7 to ADQL 3.0.
Problem B:
To the best of my knowledge, the TAP 1.1 PR does not mention ADQL boxes. I wonder what still having boxes in ADQL means in this context.
But be that as it may, this is not the chief problem of ADQL's box.
I know that all the problems with box are inherited from a long time ago, but still probably then any "dot-one" release should really fix the following misfeatures. First, let me quote the relevant parts of the PR text (Section 4.2.9) below to give everybody reading this full
context:
The BOX function expresses a box on the sky. A BOX is a special
case of POLYGON, defined purely for convenience, and
it corresponds semantically to the equivalent term, Box,
defined in the STC specification.
It is specified by a center position and size (in both axes)
defining a cross centered on the center position and with arms
extending, parallel to the coordinate axes at the center position,
for half the respective sizes on either side. The box’s sides are
line segments or great circles intersecting the arms of the cross
in its end points at right angles with the arms.
A small nitpick for warming up: the phrase
"[...] it corresponds semantically to the equivalent term, Box,
defined in the STC specification"
is a bit weird, because the ADQL box is a specialised syntax to specify a spherical polygon (with great circle segments a edges), but somehow STC's box probably allows for other kinds of edges than great circle segments. This also applies to other ADQL geometries(!). Maybe one could somehow, more correctly, state that ADQL geometries are subset of the possible geometries envisaged by STC, rather than being "equivalent".
Now, the real problem is that "box" is perfectly ill-defined. The text speaks of
"arms extending, parallel to the coordinate axes at the center
position",
but nowhere it defines what these "arms" should be. There are at least two equally plausible interpretations that easily come to mind:
a) The "arms" are great circles. A very good argument for that is
that the text requires them to be parallel to the coordinate
axes only at the so-called centre position of the box.
b) The "arms" are circles with constant RA, or constant DEC,
respectively. A very good argument for that is that the text
speaks of the actual edges of an ADQL box as
"line segments or great circles",
presumably another category than "arms".
[By the way, the "line segment" expression, nowhere else
to be found in the ADQL 2.1 PR, is obviously a very old
copy-and-paste leftover from the then current STC
document, where it alludes to curves on the unit sphere
that are not great circles.]
(Note that a) and b) are different only for the "arms parallel to the coordinate axes" tangential to circles of constant DEC.)
I guess other interpretations might have their merits, too. Anyway, the question now is what may be done with this bane to interoperability.
First, everybody reading the above carefully should agree that ADQL 2.1 must NOT pass with "box" being in this dire state.
Second, from the above follows that because of its woefully incomplete specification, there has _never_ been a compliant implementation of ADQL's box, by anybody.
[Now, I actually have spoken to people who did claim to have had
to-the-spec implementations at some point in time, but there was
not sufficient time to discuss which of the above (or even
another) interpretation they implemented, and if they thus had
implemented the same interpretation.
Also, they interestingly had no intention at all to put these
implementations forward in any way, they rather had a good laugh
when a conclusion along the lines 'happy to have that time
wasted' came up.]
In the real world, nowadays actually many ADQL implementations just offer coordinate box semantics for "box" [they are a far cry from the ADQL text, especially because they are _not_ spherical polygons].
Coordinate boxes are, from what I understand, requested by a sizeable fraction of users. But, for what it's worth, they are not universally appreciated by ADQL implementers (see also the TAP 1.1 issue above).
At least one widely used ADQL implementation _does_ create a four-sided polygon for "box", but it implements very simple calculations that are totally incompatible with _any_ interpretation of the ADQL text, 2.0 or
2.1 PR.
Probably the original motivation for the failed attempt of ADQL's box was the idea to have "something similar to a coordinate box that works around the poles". I wonder if there is any meaningful use case for that -- one can always use a spherical circle to probe the neighbourhood of a coordinate. But even if the case for a use case could be made, those who are proposing such a thing should come up with a sound definition of "box", or whatever else such a convenience polygon construction function would be called.
Furthermore, there really should be a freely licensed and sufficiently documented reference implementation for that before such a thing would be standardised, because the numerical calculations for either of the above interpretations a) or b) are quite involved and proportionally error-prone.
The most correct solution for the "box" problem can thus only be 1. to remove it from ADQL 2.0 via an erratum, because of its complete
ill-specification (see above).
2. to remove it from extant ADQL implementations, 2.0 or otherwise,
with error messages that clearly explain the problem at hand.
This is actually a very good service to users, who in many cases
_today_ are getting results form their queries that are very much
different from what they expect. "Be arbitrary in what you send"
is not to be recommended for interoperability.
3. to discuss the necessity to specify coordinate boxes, imperatively
with a different name, such as "cbox". For a start, remember that
these kinds of geometries are optional ADQL features.
4. if 3. is answered in the positive, to discuss putting coordinate
boxes into ADQL 2.1 and TAP 1.1 to accommodate for users who appear
to have a need for them.
Best regards,
Markus Nullmeier
More information about the dal
mailing list