table_name syntax

Thu Apr 30 14:35:34 CEST 2015

Dear Colleagues,

Excuse me for joining the fray again.  But let me see if we can fix
what's already consensus.  I believe we all agree that services have to
declare when they are using delimited identifiers:

The consequence IMHO is that we need to amend REC-TAP 1.0 and
REC-VODataService 1.1.  In REC-TAP 1.0 (PDF page numbers), I believe
the last paragraph of p. 21 should read:

  All names in TAP_SCHEMA MUST follow the syntax rules of the query
  languages supported by the the langues used to query TAP_SCHEMA in
  that they can be used as-is in queries without further
  qualifications.  In particular, were delimited identifiers in ADQL
  or similar languages must be used, the names in TAP_SCHEMA MUST
  come with double quote characters.  Services MUST not provide
  delimited identifiers where regular identifiers are used in the
  database's DDL.

  The descriptions below reference the respective symbols for the
  ADQL 2.0 language; for other query languanges, analogous rules will
  have to be used.

Then, in 2.6.1 (with some redaction), additional text:

  When querying through ADQL, schema_name MUST contain an expansion of
  the ADQL <schema_name> nonterminal.

In 2.6.2, additional text:

  schema_name MUST reference a row in TAP_SCHEMA.schemas.
  When querying through ADQL, table_name MUST contain an expansion of
  the ADQL <table_name> nonterminal suitable for referencing the table
  in a top-level select clause.

In 2.6.3, additional text:

  table_name MUST reference a row in TAP_SCHEMA.tables.
  When querying through ADQL, column_name MUST contain an expansion
  of the ADQL <column_name> nonterminal suitable for referencing the
  column in a plain SELECT <column_name> FROM <table_name> query
  (i.e., including double quotes in the case of delimited
  identifiers).

We'll have to find similar language for VODataService (to have
consistency with the VOSI tables endpoint).  The good news is that
this won't require a schema change unless trolls are around, as
column/name and friends are already of type ts:token.  [The thing
with the trolls: If you were perverse enough to have identifiers like
'"    "' -- multiple spaces -- we'd be in trouble with XSD.  Can we
agree to look the other way?]

Agreed so far?  Where do I carry this stuff now that, I understand,
the TAP Implementation Notes shouldn't be used any more?

Then, on 

On Wed, Apr 29, 2015 at 11:48:20PM +0100, Mark Taylor wrote:
> On Wed, 29 Apr 2015, Dave Morris wrote:
> > You are right, it is not impossible to parse the concatenated string.
> > 
> > It will however, become that bit more tricky once we start to have nested
> > double quotes and full stops inside a variable number of double quote
> > delimited identifiers which are themselves separated by full stops.
> 
> To be concrete about this, I attach a java class that I believe does
> this parsing comprehensively for all legal table_name values,
> including as many double quotes and full stops as you like.
> The whole thing is a few tens of lines of code.

Also, I'd argue that software that's clever enough to care about such
things will presumably have some sort of ADQL grammar built in and
can just parse whatever is in the various columns.  That's a formal,
no-hacks process guaranteed to work (unless the service is broken,
but even then it's better to have breakage in one place than in two
places).

> > The server must have the original unquoted names in order to be
> > able to create the concatenated string, the request was to make
> > these available to the client as two additional columns in the
> > TAP_SCHEMA table.

I'd dispute you can reliably recreate a valid table reference unless
you knew for sure on which parts you need delimiting, at least if you
allow actual implementations to have additional reserved names.  Pat's
remark on the presence of catalog names only exarcebates that, as
does possible query rewriting hacks service operators might do.  If I
were to write an ADQL query bulder, I'd try *very* hard to avoid
having to rewrite table (or column) references, as the intricacies of
their interpretation within backend implementations can be fairly
tricky.

> > It also means that searching for "table name starts with" becomes much much
> > easier.
> 
> I will grant you that; is there much call for that kind of query?

With my funny hat labeled "Registry Guy" on, I'd say that's resource
discovery an hence a Registry problem.  And there I'd say people
should rather use description than table names for discovery as a
rule.  If that sucks in the current registry dataset/protocols, I'd
say that's where we need to fix it.

Other things that came up:

Mark (Tue, 28 Apr 2015 14:51:52 +0100) got me:

[Overconfident me:]
>> _ct is ok -- that works as a regular identifier.  But:
>
> I hesitate to trade BNF with you, but I see in sec 2.1.2 of ADQL 2.0:

...but you're totally right.  I just learned that SQL predates the
grand success of C and hence has a different idea of the lingua franca
identifier.  Thanks for hitting me with the cluestick, _ct is clearly
*not* a regular identifier.

Then, there was Mark responding to Pat (Tue, 28 Apr 2015 17:36:06
+0100):

>> PPS- tap-lint does a test where it quotes a column name it finds in
>> a table in the tap_schema. I noted 2-4 query failures per day in
>> our logs caused by this test. It broke because in postgresql it
>> subverts the way case-insensitive table and column names are
>> implemented (PG just converts everything to lower unless you quote
>> them!). My "fix" was simply to strip the quotes in that case
>> because I Know they can only break it. While the test is probably
>> technically correct, it is really just asking for trouble :-)
>
> hmm - my emerging understanding from this discussion is that
> that taplint test is probably technically *incorrrect*,
> and needs changing.  Can we have a ruling from Markus
> (as our resident delimited-identifier-gotcha expert) on that?

I've not actually looked at what the test really does, but the way
Pat describes it, TOPCAT probably shouldn't be doing it.  If you go
back to the artwork of delimited identifier prose I sent around
yesterday, you'll see that no normalization takes place in delimited
identifiers.  So, "FooBar" will indeed be distinct from "Foobar".

Regular identifiers, however, are normalised.  Postgres in spirit
does the right thing by just having them all lower-cased, so, even if
one puts FooBar (undelimited) into TAP_SCHEMA, you'd have to use
"foobar" to match it with a delimited identifier.

That, however, will fail with standard SQL1992, which instead
stipulates that "every letter that is a lower-case letter [has to be]
replaced by the equivalent upper-case letter or letters" on comparing
regular and delimited identifiers.  So, while for aestethic reasons I
fully support Postgres' decision to lower-case, if we followed
SQL1992 here, ADQL translators to Postgres would have a whole lot of
fun.  So, I propose leaving the entire question of casiness in
comparisons between regular and delimited identifier open (which
right now it kinda is; you could say by default to sql1992 it's not,
in which case we're all doing it wrong).

Hence I think STILTS would be prudent not to fuzz with quotes.
Services that mess up delimited identifiers in their TAP_SCHEMAS will
then get validation errors, which serves them right for using them in
the first place.  With the above... shall we say, clarifications,
that's actually correct (if necessary, I'll ask my lawyer friends to
come up with a reason why they're just clarifications).

All this is another reason for why I'm still arguing for discouraging
delimited identifiers in either TAP+1 or ADQL+1.  To keep *that*
discussion separate from the one here, I'll have a separate mail on
this.

Cheers (and sorry for the length of this mail),

       Markus