Two very loose ends of the ADQL 2.1 PR

Sat May 26 11:21:14 CEST 2018

Markus,

On Sat, 26 May 2018, Markus Nullmeier wrote:

> Problem A:
>
> I have been made aware of Section 4.2.7: "Preferred crossmatch syntax"
> of the ADQL 2.1 PR. As one of the maintainers of pgSphere, which is
> actually used by many a data centre to run various other software on
> top of it to implement ADQL, I claim to have some, if indirect, insight
> on real-world deployment of ADQL.
>
> While I do not have an opinion on ADQL syntax, I find the following
> sentence to be highly problematic:
>   "Clients posing crossmatch-like queries are advised to phrase them
>    this way rather than semantically equivalent alternatives, and
>    services are encouraged to ensure that this form of join is executed
>    efficiently;"
> For, in the real world, quite a few existing and very important services
> will virtually certainly, for ages to come, refrain from the effort to
> upgrade the ADQL implementations they are using with the necessary
> updates to rewrite queries accordingly -- however small and seemingly
> simple these changes appear to be.

This is a new item in ADQL 2.1, so it does not affect those
implementing or using ADQL 2.0 services.  The legacy services that
you mention will be serving ADQL 2.0, and hence will not be under
any expectation of implementing this suggested syntax efficiently.
Services which, on the other hand, decide to upgrade to ADQL 2.1
ought to take account of this section of the ADQL specification
along with all the rest of it.

If that point isn't clear, it could be made explicit in the text
by adding a comment like

   "While ADQL 2.0 services are also encouraged to implement this syntax
    efficiently, clients should be aware there is no general expectation
    that such queries will execute efficiently on such legacy services."

or maybe just writing instead

  "Clients posing crossmatch-like queries in ADQL 2.1 are adivised ..."

> But the net result of that sentence will be that some users or even
> client implementers are going to pick up that "good advice", giving
> them a spectacularly bad VO experience on many real services, where
> the underlying database software (whatever it is) will use sequential
> scans instead of index scans, with the latter of course being orders
> of magnitude faster.

The idea is that both users/clients and service providers will
pay attention to the advice, so that crossmatches can run efficiently
without too much implementation effort or guesswork on either side.
Since this advice only applies to a new version of the standard
(not legacy services), that doesn't seem too far-fetched.

> Besides, there is a rather odd mismatch between the quite strong choice
> of "advised" for users / clients and the much weaker word "encouraged"
> for services.

I may have put too much nuance into the language here.  The intention
is not stronger/weaker force, but that users are free to take or
ignore the "advice" (since doing it wrong will mostly affect themselves),
but services have a kind of moral obligation to do this, in order
to provide good service to their users.  If non-native English
speakers would like to offer a clearer or less ambiguous form of
words, I don't object.

> My recommendation is therefore to remove the words
>   "Clients posing crossmatch-like queries are advised to phrase them
>    this way rather than semantically equivalent alternatives"
> for good. Also, if the net effect of pushing some kind of crossmatch
> syntax (by the way, what about cone search?) is to have any hope, then
> a sentence such as "This syntax MUST be handled as efficient or better
> as semantically equivalent queries" would be in order.

I am reluctant to put a MUST relating to performance details.
It's probably untestable and unenforcable.

> But I do believe enacting such a "quick fix" would be premature, because
> it is the antitheses of meaningful interoperability:
>
> The proposed preferred syntax of Section 4.2.7 did not exist before
> ADQL 2.1. Thus, the most efficient crossmatch syntax of older services
> is necessarily something else. But by following the old rule "be liberal
> in what you accept, be conservative in what you send", new ADQL services
> should really make sure that _any_ efficient crossmatch syntax that had
> had a significant following in past would be executed in the most
> efficient way.

Of course it's good if services can implement all crossmatch
syntax variants efficiently.  But at present there is a large and
ill-defined set of these, so it's a heavy requirement to put on
services, and many of them don't or can't do that, so many
clients have a poor experience.

> However, I personally lack the data about extant efficient ADQL
> crossmatch queries. Lacking this data, it may be wise to postpone
> Section 4.2.7 to ADQL 3.0.

The trouble is that nobody else has this data either.  There is no
information or suggestion anywhere in the existing specifications,
or other IVOA documents, about what such "efficient crossmatch
syntax" might be, and in my experience, different implementors
have different and contradictory ideas about the right or obvious
way to phrase such queries.  The effect is that, as things currently
stand (ADQL 2.0), users are completely in the dark about how to write
a crossmatch query that will use an index scan rather than a
sequential scan, and things that work well on one service will
work very badly on others.  Users can only gain this knowledge
concerning a particular service by either trial and error or
talking to some expert on the service in question.

This new section is an effort to address that problem.
I still think it's a good idea, and that it will result in a better
TAP user experience at a low cost for service implementors.
But other opinions are welcome!

Mark

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776  http://www.star.bris.ac.uk/~mbt/