Two very loose ends of the ADQL 2.1 PR
Markus Nullmeier
mnullmei at ari.uni-heidelberg.de
Sat May 26 04:17:29 CEST 2018
Hello list,
Problem A:
I have been made aware of Section 4.2.7: "Preferred crossmatch syntax"
of the ADQL 2.1 PR. As one of the maintainers of pgSphere, which is
actually used by many a data centre to run various other software on
top of it to implement ADQL, I claim to have some, if indirect, insight
on real-world deployment of ADQL.
While I do not have an opinion on ADQL syntax, I find the following
sentence to be highly problematic:
"Clients posing crossmatch-like queries are advised to phrase them
this way rather than semantically equivalent alternatives, and
services are encouraged to ensure that this form of join is executed
efficiently;"
For, in the real world, quite a few existing and very important services
will virtually certainly, for ages to come, refrain from the effort to
upgrade the ADQL implementations they are using with the necessary
updates to rewrite queries accordingly -- however small and seemingly
simple these changes appear to be.
But the net result of that sentence will be that some users or even
client implementers are going to pick up that "good advice", giving
them a spectacularly bad VO experience on many real services, where
the underlying database software (whatever it is) will use sequential
scans instead of index scans, with the latter of course being orders
of magnitude faster.
Besides, there is a rather odd mismatch between the quite strong choice
of "advised" for users / clients and the much weaker word "encouraged"
for services.
My recommendation is therefore to remove the words
"Clients posing crossmatch-like queries are advised to phrase them
this way rather than semantically equivalent alternatives"
for good. Also, if the net effect of pushing some kind of crossmatch
syntax (by the way, what about cone search?) is to have any hope, then
a sentence such as "This syntax MUST be handled as efficient or better
as semantically equivalent queries" would be in order.
But I do believe enacting such a "quick fix" would be premature, because
it is the antitheses of meaningful interoperability:
The proposed preferred syntax of Section 4.2.7 did not exist before
ADQL 2.1. Thus, the most efficient crossmatch syntax of older services
is necessarily something else. But by following the old rule "be liberal
in what you accept, be conservative in what you send", new ADQL services
should really make sure that _any_ efficient crossmatch syntax that had
had a significant following in past would be executed in the most
efficient way.
However, I personally lack the data about extant efficient ADQL
crossmatch queries. Lacking this data, it may be wise to postpone
Section 4.2.7 to ADQL 3.0.
Problem B:
To the best of my knowledge, the TAP 1.1 PR does not mention ADQL
boxes. I wonder what still having boxes in ADQL means in this context.
But be that as it may, this is not the chief problem of ADQL's box.
I know that all the problems with box are inherited from a long time
ago, but still probably then any "dot-one" release should really
fix the following misfeatures. First, let me quote the relevant parts
of the PR text (Section 4.2.9) below to give everybody reading this full
context:
The BOX function expresses a box on the sky. A BOX is a special
case of POLYGON, defined purely for convenience, and
it corresponds semantically to the equivalent term, Box,
defined in the STC specification.
It is specified by a center position and size (in both axes)
defining a cross centered on the center position and with arms
extending, parallel to the coordinate axes at the center position,
for half the respective sizes on either side. The box’s sides are
line segments or great circles intersecting the arms of the cross
in its end points at right angles with the arms.
A small nitpick for warming up: the phrase
"[...] it corresponds semantically to the equivalent term, Box,
defined in the STC specification"
is a bit weird, because the ADQL box is a specialised syntax to specify
a spherical polygon (with great circle segments a edges), but somehow
STC's box probably allows for other kinds of edges than great circle
segments. This also applies to other ADQL geometries(!). Maybe one could
somehow, more correctly, state that ADQL geometries are subset of the
possible geometries envisaged by STC, rather than being "equivalent".
Now, the real problem is that "box" is perfectly ill-defined. The text
speaks of
"arms extending, parallel to the coordinate axes at the center
position",
but nowhere it defines what these "arms" should be. There are at
least two equally plausible interpretations that easily come to mind:
a) The "arms" are great circles. A very good argument for that is
that the text requires them to be parallel to the coordinate
axes only at the so-called centre position of the box.
b) The "arms" are circles with constant RA, or constant DEC,
respectively. A very good argument for that is that the text
speaks of the actual edges of an ADQL box as
"line segments or great circles",
presumably another category than "arms".
[By the way, the "line segment" expression, nowhere else
to be found in the ADQL 2.1 PR, is obviously a very old
copy-and-paste leftover from the then current STC
document, where it alludes to curves on the unit sphere
that are not great circles.]
(Note that a) and b) are different only for the "arms parallel to
the coordinate axes" tangential to circles of constant DEC.)
I guess other interpretations might have their merits, too. Anyway, the
question now is what may be done with this bane to interoperability.
First, everybody reading the above carefully should agree that ADQL 2.1
must NOT pass with "box" being in this dire state.
Second, from the above follows that because of its woefully incomplete
specification, there has _never_ been a compliant implementation of
ADQL's box, by anybody.
[Now, I actually have spoken to people who did claim to have had
to-the-spec implementations at some point in time, but there was
not sufficient time to discuss which of the above (or even
another) interpretation they implemented, and if they thus had
implemented the same interpretation.
Also, they interestingly had no intention at all to put these
implementations forward in any way, they rather had a good laugh
when a conclusion along the lines 'happy to have that time
wasted' came up.]
In the real world, nowadays actually many ADQL implementations just
offer coordinate box semantics for "box" [they are a far cry from
the ADQL text, especially because they are _not_ spherical polygons].
Coordinate boxes are, from what I understand, requested by a sizeable
fraction of users. But, for what it's worth, they are not universally
appreciated by ADQL implementers (see also the TAP 1.1 issue
above).
At least one widely used ADQL implementation _does_ create a four-sided
polygon for "box", but it implements very simple calculations that are
totally incompatible with _any_ interpretation of the ADQL text, 2.0 or
2.1 PR.
Probably the original motivation for the failed attempt of ADQL's box
was the idea to have "something similar to a coordinate box that works
around the poles". I wonder if there is any meaningful use case for
that -- one can always use a spherical circle to probe the neighbourhood
of a coordinate. But even if the case for a use case could be made,
those who are proposing such a thing should come up with a sound
definition of "box", or whatever else such a convenience polygon
construction function would be called.
Furthermore, there really should be a freely licensed and sufficiently
documented reference implementation for that before such a thing would
be standardised, because the numerical calculations for either of the
above interpretations a) or b) are quite involved and proportionally
error-prone.
The most correct solution for the "box" problem can thus only be
1. to remove it from ADQL 2.0 via an erratum, because of its complete
ill-specification (see above).
2. to remove it from extant ADQL implementations, 2.0 or otherwise,
with error messages that clearly explain the problem at hand.
This is actually a very good service to users, who in many cases
_today_ are getting results form their queries that are very much
different from what they expect. "Be arbitrary in what you send"
is not to be recommended for interoperability.
3. to discuss the necessity to specify coordinate boxes, imperatively
with a different name, such as "cbox". For a start, remember that
these kinds of geometries are optional ADQL features.
4. if 3. is answered in the positive, to discuss putting coordinate
boxes into ADQL 2.1 and TAP 1.1 to accommodate for users who appear
to have a need for them.
Best regards,
Markus Nullmeier
More information about the dal
mailing list