RegTAP bug with delimited column identifiers
Markus Demleitner
msdemlei at ari.uni-heidelberg.de
Thu Aug 5 14:15:40 CEST 2021
Dear Registry,
ADQL delimited identifiers are a pain.
My old mantra just became underscored right now while I was adding a
get_tables method to pyVO's registry interface. That pulls table
metadata from rr.res_table and rr.table_column. And breaks hard on
VizieR.
You see, RegTAP in sect. 8.6 says:
The following columns MUST be lowercased during ingestion: ivoid,
table_name, table_type, table_utype.
In the presence of SQL delimited identifiers, blindly lowercasing
table_name is wrong and actively breaks things. For instance, VizieR
has a table J/A+A/437/789/table2; once that's lowercased, it requires
a lot of VizieR knowledge to recover that form.
But to write a query you need that exact form because to get away
with that alphabet soup as a table name, you have to use delimited
identifiers in ADQL queries, and delimited identifiers prevent the
(about as unfortunate) case folding SQL has almost everywhere else.
Hence, rr.res_table is close to useless with the largest TAP service
on the planet. The only reason nobody noticed is that most everyone
is turning to GloTS for this kind of thing. But as you know I've
been trying to wean people off the non-standard GloTS for ages, at
one of these days it'll have to happen.
I can't quite say any more why I would have required case
normalisation in table_name; I probably figured "it's machine
readable, and SQLs folds case, so let's be consistent with almost
everything else machine-readable in RegTAP".
I suppose if I wrote this again I'd say "lowercase regular
identifiers, leave SQL delimited identifiers alone". However, that
still wouldn't fix things right now because VizieR doesn't include
quotes in their VODataService table names just yet.
But I'd like to have a fix about now in order to make my get_tables()
method useful when the pyvo PR goes in. And the baroque
normalisation rules of SQL are a bit too much for an Erratum anyway
IMHO.
Also, we are now a lot more explicit that the content of table/name
must be exactly as it should be used in TAP, and so there's really no
reason for any more normalisation. Hence... How would people feel if
I wrote a RegTAP erratum to the effect:
In section 8.6, from the sentence: “The following columns MUST be
lowercased during ingestion: ivoid, table_name, table_type,
table_utype.” remove “table_name”. That is, with this erratum,
table_name is ingested literally.
I think that won't break anything operationally; if anyone had relied
on these table names, they'd have noticed it's broken with many
VizieR tables, I suppose. And less case folding is always a gain.
Of course, I'll provide lots of rationale and impact assessment, but
before going into the trouble, I'd be grateful for encouraging and/or
dissenting voices.
Thanks,
Markus
PS: I'm taking the liberty of trying this out on dc.g-vo.org and then
reg.g-vo.org starting in an hour or so; that's how confident I am
that this is a bona fide improvement with very little risk of
breaking things.
More information about the registry
mailing list