TAP, ADQL and field names inconsistencies between providers.

Mon Dec 7 17:28:14 CET 2020

Hello John,

It is indeed the right place to speak about ADQL and TAP.

In your email, you mentioned several things that I'll try to sum up here 
just to be sure we are speaking about the same (so, tell me if I 
forgot/mis-understood any):

     a) qualified field name VS just the field name

     b) with or without double quotes ?

     c) why tables/columns names are different from one service to 
another? (example of Gaia-eDR3)

The short story:

According to me, ADQL and TAP standardize the ground bases, but few 
details are left to implementations on purpose. As you noticed, few 
aspects are slightly different from one service to another. These 
differences can be due to a misinterpretation of the standard, to a bug 
or to a will from the service provider (either to add a feature, to be 
more permissive/restrictive, or due to unfortunate technical constraints).

Now, the long version:

   a) qualified field name VS just the field name
&
   b) with or without double quotes ?
----------------------------------------------------------------

If you look at the grammar of ADQL REC-2.0 (and even the PR-2.1), 
qualifying the column name by its table name (or the table name by its 
schema name) is not required. Similarly, the identifiers (i.e. 
schema/table/column names and aliases) may be escaped (i.e. within 
double quotes), but again, it is generally not a MUST.

However, here are some exceptions:

     * A column or table name MUST be qualified if ambiguous with 
another such identifier.
       (rule based on common sense ; but I can not see any mention of 
this rule in the ADQL document....this should probably be added in there)

     * Escaping an identifier is REQUIRED in the following cases 
(section 2.1.2 of ADQL-2.0):

         - the identifier is a reserved keyword (e.g. DISTANCE, REGION, 
RAND, SELECT, ...),
         - the identifier does not follow the correct syntax (i.e. MUST 
begin with a letter {aA-zZ}, subsequent characters MAY be letters, 
underscores or digits {0-9}),
         - to preserve case sensitivity.

So, all the following are allowed in ADQL (as in most SQL dialects):

"schema"."table"."column"
   "schema"."table".column
   "schema".table."column"
   "schema".table.column
   schema."table"."column"
   schema."table".column
   schema.table."column"
   "table"."column"
   "table".column
   table."column"
   table.column
   column

...but depending on your specific query and on the identifiers you are 
going to use, some of the above variants won't be possible (cf exceptions).

Then, it is theoretically up to the user to decide if he prefers to 
qualify and escape its identifiers all the time or in some cases. 
Personally, I prefer to have the shortest form of identifiers as 
possible, as you apparently do. But some users prefer to have the 
longest version for personal or educational reasons (e.g. to be as much 
explicit as possible).

But obviously, because all ADQL engines are not implemented the same, 
there are different behaviors on some points. As said above the reasons 
for these differences are various and as I did not implement all of them 
I can not speak for all. I can only think to three reasons: 
mis-interpretation of the standard, a bug, a choice of the service provider.

Maybe the best thing to do would be to ask them to check their 
implementation. I can at least speak for the ARI's, Simbad's and 
VizieR's TAP services as they rely on my ADQL+TAP library. On purpose, I 
chose to no force users to escape and/or qualify identifiers if not needed.

Then, to take the specific example of VizieR, table names are the unique 
catalogue identifiers in the VizieR database (for consistency with the 
major and original part of the VizieR database/services). Because of 
this choice, table names do not follow the correct ADQL syntax for 
identifiers, and that's why, table names must be escaped. Besides, from 
its beginning VizieR has its own way to deal with catalogues hence some 
different column names (e.g. RA_ICRS, ...).

c) why tables/columns names are different from one service to another? 
(example of Gaia-eDR3)
------------------------------------------------------------------------------------------------------------------------------------------

I can think about at least three reasons:

     * Database structure constraints (e.g. Vizier organizes tables by 
VizieR catalogue identifiers, some DBMS like MySQL do not support schema 
which forces the implementation to have a table name including the 
schema name, etc....).

     * Technical reasons (e.g. name not allowed by the DBMS, a 
schema/table already exists with this name, ....).

     * No agreement between data-centers to name the schema, tables and 
columns the same way. It is true that in some cases it would easier the 
users' life. In the case of Gaia, there is a such effort only between 
partner data-centers (ESAC, CDS, ARI, ...) but sometimes (e.g. CDS with 
VizieR) there are technical/historical reasons that can not be 
punctually changed for just one exception (or the whole 
architecture/system would have to be entirely changed).

------------------

Sorry for not giving you any concrete solution. I think there is really 
a place for improvement in ADQL and TAP, but I personally do not know 
how to fix this particular issue.

As editor of the ADQL standard (since few time), I do not think we 
should force one behavior more than another. After all, this flexibility 
also exists in SQL which is the backend language of all the currently 
existing TAP implementations I know. But we could maybe add a 
recommendation about escaping and qualifying identifiers. Let's say 
something like:

     "ADQL implementations SHOULD not force identifiers to be escaped 
and/or qualified unless necessary.".

@other ADQL authors: suggestions/comments?

Finally, I personally think it is already a very good thing that the 
ADQL syntax is not significantly different from one TAP service to 
another. I am the developer of one ADQL+TAP implementation library and I 
have to deal with all the different DBMS syntaxes (e.g. PostgreSQL, 
MySQL, SQLServer). I can ensure you that it is even more confusing to 
switch from one DBMS to another. Besides their syntax difference for 
some statements (e.g. JOIN, SELECT TOP), there is also sometimes a 
difference on how to write identifiers (e.g. backquotes instead of 
double quotes, lower/upper case, ....). I do not say that ADQL is 
perfect but at least, implementations are mostly consistent with each other.

Best regards,

Grégory Mantelet
--------------------------------------------------
Software Developer at CDS
gregory.mantelet at astro.unistra.fr

On 05/12/2020 00:59, cpmjg at tutanota.com wrote:
> I believe this is the working group for suggesting the standardisation of such things?
>
> The release of GAIA EDR3 has highlighted this with its separate TAP providers with not only more than one field naming scheme but at least three different pedantic syntaxes for the naming of said!
>
> TAP VizieR has different field header names, possibly the nature of VizieR itself causes this, but it is a shame they don't keep the same field heading names as the provider of the table.
>
> Then it is also served by ARI and AIPS TAP services, and it is all ADQL, EXCEPT
>
> one uses
>
> SELECT "gaiaedr3source".ra
> FROM "gaiaedr3source"
>
> and all WHERE and similar statements need the same syntax as select,
>
> another uses
>
> SELECT "gaiaedr3source"."ra"
> FROM "gaiaedr3source"
>
> again, WHERE etc need the same syntax as SELECT
>
> and yet another uses
>
> SELECT ra
> FROM gaiaedr3source
>
> and things like ra < 9 or ORDER BY ra work.
>
> There really needs to be some standardisation on ADQL more than there is, this point seem to get missed.  Sometimes, if there are no examples provided (and if there are examples it is always the thoroughly uninformation "SELECT TOP 10" without specific fields noted in order to figure out the syntax)
>
> Now I'm not a programmer or a technician nor a professional astronomer but the fact that once I manage to get my brain around cobbling some ADQL together that may actually work I can spend ages trying to debug it only to realise after a while it's the field name syntax!  As most error messages unhelpfully say "expecting this that and the other command following )" or something, it is not always clear.
>
> In other words, there is a lack of standard here in the ADQL standard and it needs to be addressed.
>
> I of course prefer
>
> SELECT ra
> FROM source
> WHERE ra <= 2 AND ra IS NOT NULL
>
> (I'm not sure there's a standard on null either, and it is hard to look up the proper syntax on some sites)
>
> I will likely get told that things like JOINs make that poor code, well it doesn't, the few sites that use this shorter version (eg IRSA IPAC SQL search) where you may want to join you can simply go
>
> SELECT source1.ra, source2.ra, mag, z
> FROM source1, source2
>
> as needed, but still only use unique to each table fieldnames without having to put the source in for those.
>
> And whilst I'm having a gripe some of the very most common fieldnames need to be standardised in these tables across all tables, not just different providers of the same tables.
>
> RA_ICRS and RA_J2000 for instance, instead of the amazing number of variations on a theme that something simple like these two can be between different datasets, it'd make porting scripts a heck of a lot easier.  In which case ra should probably be deprecated, or made to default to mean J2000.
>
> Of course there wouldn't be many of these, just the verymost common things like RA and Dec in degrees.
>
> Today, having given up on TAP VizieR as not doubt it was being choked on GAIA EDR3 queries, I decided to try both ARI and AIPS tap servers.  You can look things up and cut and paste the field names for each, and even double click on field names to autopopulate the input dialogue on one of them, but you shouldn't have to.  Typing is often quicker then clicking through long lists.
>
> But I copied and pasted the ADQL from VizieR to the other two.  Well, VizieR used different fieldnames, so in the ARI I renamed them, and that didn't need the quotes nor the proceeding table name for fields, just the fieldname, so I deleted those all one by one, and then I copied that and pasted it into AIPS as that had the same field names, but no, that needed them all to be preceded by the table name again, with a . between the two!  And then it had to be put in quotes!  Both table name and field name, just just table name as in VizieR.
>
> It really looks sloppy and unprofessional for a standard protocol and script for said protocol to have about five different ways to SELECT etc fields!
>
> "table"."field"
>
> "table".field
>
> table.field
>
> field (with the table implicit from the FROM entry but added to the field name when more than one table used and there are common fieldnames in those tables, which I must say I find easiest to follow and understand and most of all read when checking the script!).
>
> and one I can't remember, possibly "field".
>
> Yours
>
> John
>
> John Greaves
>
> Not an IAU member, not an IVOA member, not a scientist, the do make me pay tax though if that helps.
>