TAP, ADQL and field names inconsistencies between providers.

cpmjg at tutanota.com cpmjg at tutanota.com
Tue Dec 8 14:17:55 CET 2020


Hello

Thank you very much for your reply.  I expected much of the common theme of your response, that although the IAU and VO teams etc are after a 'standard' it is no more likely to work than the standard for FITS images which also has so many allowable variations so as to suit everyone (admittedly sometimes such flexibility is needed by a specific survey system, but not always), in that the word 'standard' is a hopeful one.

Of course as a lay person I am not appreciative of this because the vast majority of my troubles learning how to use SQL in astronomy for data, which incidentally also means my only reason for actually learning SQL at all as I have no other reason to know it in my life, are syntax ones.  I will occasionally make what I have always called a 'logic error' in that I make a syntax clean request that is either actually meaningless or misses everything I want (usually just being stupid or word blindness like having an OR instead of an AND and no matter how many times I'm checking by re-reading I keep missing that one!) or are perfectly logical but in practice have a computational time longer than the current age of the universe.  However, most of my errors are always syntax errors and most of those are simply due to inconsistency between platforms.

Therefore I had become a little annoyed with the 'standard' and 'protocols' that were deemed to make things easier.

Granted, I do get along with ADQL a fair bit better than I did with the system that first meant I had to self teach myself SQL, namely CASJOBS.  I've also used the IRSA IPAC systems once or twice, and for UKIDSS data the system set up for WFCAM data.  The latter two aren't too bad.  CASJOBS was/is proprietary MS as far as I can tell and doesn't quite behave right on some things, I seem to remember JOINs can be non-standard in their implementation rather than syntax, so they can be annoying as when they don't work it can only become clear on timeout, which can take a long time to occur.  You need to bleed these datasets nowadays, increasingly they don't play the databse game and include tons of engineering and other non-primary final souce catalogue data all in one table despite the fact that you can tell from quite a lot of professional papers that many flags and raw data kept altogether in one table are not used in data selection (especially amongst the Bayesian brigade).

Anyway, I found ADQL a bit more streamlined, but have been increasingly getting annoyed that TAP and ADQL are not a standard.  I follow what you are saying and I also follow the reasons you give, thank you very much for explaining in such a comprehensive and followable manner.

I'll just have to learn to put up with it!

I do often use the "forms" to start off nowadays, the helpful ones like in TAP Vizier where you can put in constraints via tickboxes and some minor conditionals enterable in a little dialogue box at the end of a field and/or (like in IRSA IPAC) you can modify a full autogenerated query's script in a dialogue box.  Although nulls can be a real nuisance and also dealt with non-standard (eg -999.99 or -99999.9 or zero or just empty), but that's computers for you, when they "invented" them they forgot to include a 'number' for null so it can still be hard work distinguishing it from zero as many things will add zero on transfer instead of 'blank' because blank is either empty, which can give "missing value" errors or a space, which isn't a numeric character.  And of course zero doesn't mean null, zero is a value, especially in astronomy which can have so many valid negative numbers in tandem with positive ones (absolute magnitude, [Fe/H]).

Then I modify the result in the dialogue box or usually just append.

And of course there are often schema browsers and sometimes 'cookbooks' with examples.

It just annoys me that most of the trouble I have is with field name selecting and conditionals on them where errors in the conditional are hidden by simple naming syntax differences causing errors!

It is a pity that even though as you say just the field name can be the default except in cases of ambiguity and/or reserved words (the latter not a new language issue) no one accepts this or includes it as a subset.  I do find this idea of making people learn proper SQL and become a competent programmer instead of being able to concentrate on learning the science and mathematics of astronomy!  You'll have to be a programmer before long to do Astronomy.  I've published in my time, on rare occasions formally (as oppose to amateur or pro-am venues), but am finding it increasingly difficult to pre-process the data into a usable processable form, at least with any confidence of having obtained a complete sample.

This leads into what is of course a somewhat subjective viewpoint of mine, the 'script kiddies' are taking over Astronomy.  It's python this, it's python that, etc, etc.  Now many will disagree but the universally transportable format for data is comma delimited, it will let you work with commercial applications meant for business that do not know of astronomical standards but still can have great utility, they will import and export, as will most computer languages, and some of the code I have found available over the year for certain things is actually still 16 bit DOS (which I have to emulate).  I have been through several tutorials and I cannot get them to do what they say when it comes to importing .csv files, before I even get to exporting to a new file, and for BIG data I need to bleed out what I need from a far too big file to a smaller but big file with only relevant data.  Even before I have to worry about my personal difficulties with object oriented modular programming (and don't get me started on the C like obtuseness of most programming language commands as if gearing it delibaretly towards the mathematically minded.  I could once upon a time do the same thing in COBOL easily and I only learnt that for ten minutes three decades ago).  I currently have to use old BASIC to transform one table to another sometimes just to get the data right for the actual analysis program to take in!

So TAP and ADQL is incredibly useful for people like me, sometimes through TopCat which has fine VO integration for data.  But it is harder work than it needs to be for a 'standard', hence the gripe.

By the way, the VizieR thing still confuses me, as VizieR seems to always self generate a J2000 coordinate for data that is not included in the source (this is especially true for GAIA data which only carries the 201X epochal position for coordinates in most servers but CDS VizieR adds a calculated J2000 into many astrometric datasets using proper motion, as well as one on the fly by default if you forget to turn that off).  This leads in VizieR XMatch to all sorts of RA names being linkable if you have an RAJ2000 and DEJ2000 in every dataset.  However I assume this isn't true for older datasets in the archive.  Which means sometimes you can JOIIN an RA and Dec against two databases with totally different RA and Dec named fields, presumably as they are indexed fields, but I have had cases where it hasn't worked because you can't get it to recognise a field name for a coordinate, or rather that field has been missed in the indexing of datafiles.  Possibly just me getting confused.  When things work and when they don't often appears to be arbitrary, however, despite being a standard.

Anyway, this response has turned into more of an expression of frustration than information, but thanks to your email I now understand the situation :-

IVOA and TAP and ADQL usage is run by a "Guiding" committee as opposed to a "Steering" committee.

Yours

John

John Greaves

PS and just FYI I bled the GAIA EDR3 ra, dec, plx, pmra, pmdec, radveldr2 and Gmag for all sources where these were non-null (not negative for parallax) having it running simultaneously in TAP VizieR, ARI TAP and AIPS TAP all via their browser interfaces.  VizieR went to pending and seemed to hang, possibly too many users, AIPS either wasn't working or timed out or just slow but ARI surprised me by completing the full search and returning just over 7 million entries from the full 1.8 billion dataset in about an hour or so which was a pleasant surprise!  I'd expected it to be an overnight run.

Dec 7, 2020, 4:28 PM by gregory.mantelet at astro.unistra.fr:

> Hello John,
>
> It is indeed the right place to speak about ADQL and TAP.
>
> In your email, you mentioned several things that I'll try to sum up here just to be sure we are speaking about the same (so, tell me if I forgot/mis-understood any):
>
>     a) qualified field name VS just the field name
>
>     b) with or without double quotes ?
>
>     c) why tables/columns names are different from one service to another? (example of Gaia-eDR3)
>
>
> The short story:
>
> According to me, ADQL and TAP standardize the ground bases, but few details are left to implementations on purpose. As you noticed, few aspects are slightly different from one service to another. These differences can be due to a misinterpretation of the standard, to a bug or to a will from the service provider (either to add a feature, to be more permissive/restrictive, or due to unfortunate technical constraints).
>
>
> Now, the long version:
>
>   a) qualified field name VS just the field name
> &
>   b) with or without double quotes ?
> ----------------------------------------------------------------
>
> If you look at the grammar of ADQL REC-2.0 (and even the PR-2.1), qualifying the column name by its table name (or the table name by its schema name) is not required. Similarly, the identifiers (i.e. schema/table/column names and aliases) may be escaped (i.e. within double quotes), but again, it is generally not a MUST.
>
> However, here are some exceptions:
>
>     * A column or table name MUST be qualified if ambiguous with another such identifier.
>       (rule based on common sense ; but I can not see any mention of this rule in the ADQL document....this should probably be added in there)
>
>     * Escaping an identifier is REQUIRED in the following cases (section 2.1.2 of ADQL-2.0):
>
>         - the identifier is a reserved keyword (e.g. DISTANCE, REGION, RAND, SELECT, ...),
>         - the identifier does not follow the correct syntax (i.e. MUST begin with a letter {aA-zZ}, subsequent characters MAY be letters, underscores or digits {0-9}),
>         - to preserve case sensitivity.
>
> So, all the following are allowed in ADQL (as in most SQL dialects):
>
> "schema"."table"."column"
>   "schema"."table".column
>   "schema".table."column"
>   "schema".table.column
>   schema."table"."column"
>   schema."table".column
>   schema.table."column"
>   "table"."column"
>   "table".column
>   table."column"
>   table.column
>   column
>
> ...but depending on your specific query and on the identifiers you are going to use, some of the above variants won't be possible (cf exceptions).
>
> Then, it is theoretically up to the user to decide if he prefers to qualify and escape its identifiers all the time or in some cases. Personally, I prefer to have the shortest form of identifiers as possible, as you apparently do. But some users prefer to have the longest version for personal or educational reasons (e.g. to be as much explicit as possible).
>
> But obviously, because all ADQL engines are not implemented the same, there are different behaviors on some points. As said above the reasons for these differences are various and as I did not implement all of them I can not speak for all. I can only think to three reasons: mis-interpretation of the standard, a bug, a choice of the service provider.
>
> Maybe the best thing to do would be to ask them to check their implementation. I can at least speak for the ARI's, Simbad's and VizieR's TAP services as they rely on my ADQL+TAP library. On purpose, I chose to no force users to escape and/or qualify identifiers if not needed.
>
> Then, to take the specific example of VizieR, table names are the unique catalogue identifiers in the VizieR database (for consistency with the major and original part of the VizieR database/services). Because of this choice, table names do not follow the correct ADQL syntax for identifiers, and that's why, table names must be escaped. Besides, from its beginning VizieR has its own way to deal with catalogues hence some different column names (e.g. RA_ICRS, ...).
>
>
> c) why tables/columns names are different from one service to another? (example of Gaia-eDR3)
> ------------------------------------------------------------------------------------------------------------------------------------------
>
> I can think about at least three reasons:
>
>     * Database structure constraints (e.g. Vizier organizes tables by VizieR catalogue identifiers, some DBMS like MySQL do not support schema which forces the implementation to have a table name including the schema name, etc....).
>
>     * Technical reasons (e.g. name not allowed by the DBMS, a schema/table already exists with this name, ....).
>
>     * No agreement between data-centers to name the schema, tables and columns the same way. It is true that in some cases it would easier the users' life. In the case of Gaia, there is a such effort only between partner data-centers (ESAC, CDS, ARI, ...) but sometimes (e.g. CDS with VizieR) there are technical/historical reasons that can not be punctually changed for just one exception (or the whole architecture/system would have to be entirely changed).
>
>
> ------------------
>
> Sorry for not giving you any concrete solution. I think there is really a place for improvement in ADQL and TAP, but I personally do not know how to fix this particular issue.
>
> As editor of the ADQL standard (since few time), I do not think we should force one behavior more than another. After all, this flexibility also exists in SQL which is the backend language of all the currently existing TAP implementations I know. But we could maybe add a recommendation about escaping and qualifying identifiers. Let's say something like:
>
>     "ADQL implementations SHOULD not force identifiers to be escaped and/or qualified unless necessary.".
>
> @other ADQL authors: suggestions/comments?
>
>
> Finally, I personally think it is already a very good thing that the ADQL syntax is not significantly different from one TAP service to another. I am the developer of one ADQL+TAP implementation library and I have to deal with all the different DBMS syntaxes (e.g. PostgreSQL, MySQL, SQLServer). I can ensure you that it is even more confusing to switch from one DBMS to another. Besides their syntax difference for some statements (e.g. JOIN, SELECT TOP), there is also sometimes a difference on how to write identifiers (e.g. backquotes instead of double quotes, lower/upper case, ....). I do not say that ADQL is perfect but at least, implementations are mostly consistent with each other.
>
> Best regards,
>
> Grégory Mantelet
> --------------------------------------------------
> Software Developer at CDS
> gregory.mantelet at astro.unistra.fr
>
>
>
> On 05/12/2020 00:59, cpmjg at tutanota.com wrote:
>
>> I believe this is the working group for suggesting the standardisation of such things?
>>
>> The release of GAIA EDR3 has highlighted this with its separate TAP providers with not only more than one field naming scheme but at least three different pedantic syntaxes for the naming of said!
>>
>> TAP VizieR has different field header names, possibly the nature of VizieR itself causes this, but it is a shame they don't keep the same field heading names as the provider of the table.
>>
>> Then it is also served by ARI and AIPS TAP services, and it is all ADQL, EXCEPT
>>
>> one uses
>>
>> SELECT "gaiaedr3source".ra
>> FROM "gaiaedr3source"
>>
>> and all WHERE and similar statements need the same syntax as select,
>>
>> another uses
>>
>> SELECT "gaiaedr3source"."ra"
>> FROM "gaiaedr3source"
>>
>> again, WHERE etc need the same syntax as SELECT
>>
>> and yet another uses
>>
>> SELECT ra
>> FROM gaiaedr3source
>>
>> and things like ra < 9 or ORDER BY ra work.
>>
>> There really needs to be some standardisation on ADQL more than there is, this point seem to get missed.  Sometimes, if there are no examples provided (and if there are examples it is always the thoroughly uninformation "SELECT TOP 10" without specific fields noted in order to figure out the syntax)
>>
>> Now I'm not a programmer or a technician nor a professional astronomer but the fact that once I manage to get my brain around cobbling some ADQL together that may actually work I can spend ages trying to debug it only to realise after a while it's the field name syntax!  As most error messages unhelpfully say "expecting this that and the other command following )" or something, it is not always clear.
>>
>> In other words, there is a lack of standard here in the ADQL standard and it needs to be addressed.
>>
>> I of course prefer
>>
>> SELECT ra
>> FROM source
>> WHERE ra <= 2 AND ra IS NOT NULL
>>
>> (I'm not sure there's a standard on null either, and it is hard to look up the proper syntax on some sites)
>>
>> I will likely get told that things like JOINs make that poor code, well it doesn't, the few sites that use this shorter version (eg IRSA IPAC SQL search) where you may want to join you can simply go
>>
>> SELECT source1.ra, source2.ra, mag, z
>> FROM source1, source2
>>
>> as needed, but still only use unique to each table fieldnames without having to put the source in for those.
>>
>> And whilst I'm having a gripe some of the very most common fieldnames need to be standardised in these tables across all tables, not just different providers of the same tables.
>>
>> RA_ICRS and RA_J2000 for instance, instead of the amazing number of variations on a theme that something simple like these two can be between different datasets, it'd make porting scripts a heck of a lot easier.  In which case ra should probably be deprecated, or made to default to mean J2000.
>>
>> Of course there wouldn't be many of these, just the verymost common things like RA and Dec in degrees.
>>
>> Today, having given up on TAP VizieR as not doubt it was being choked on GAIA EDR3 queries, I decided to try both ARI and AIPS tap servers.  You can look things up and cut and paste the field names for each, and even double click on field names to autopopulate the input dialogue on one of them, but you shouldn't have to.  Typing is often quicker then clicking through long lists.
>>
>> But I copied and pasted the ADQL from VizieR to the other two.  Well, VizieR used different fieldnames, so in the ARI I renamed them, and that didn't need the quotes nor the proceeding table name for fields, just the fieldname, so I deleted those all one by one, and then I copied that and pasted it into AIPS as that had the same field names, but no, that needed them all to be preceded by the table name again, with a . between the two!  And then it had to be put in quotes!  Both table name and field name, just just table name as in VizieR.
>>
>> It really looks sloppy and unprofessional for a standard protocol and script for said protocol to have about five different ways to SELECT etc fields!
>>
>> "table"."field"
>>
>> "table".field
>>
>> table.field
>>
>> field (with the table implicit from the FROM entry but added to the field name when more than one table used and there are common fieldnames in those tables, which I must say I find easiest to follow and understand and most of all read when checking the script!).
>>
>> and one I can't remember, possibly "field".
>>
>> Yours
>>
>> John
>>
>> John Greaves
>>
>> Not an IAU member, not an IVOA member, not a scientist, the do make me pay tax though if that helps.
>>



More information about the dal mailing list