ADQL grammar validation

Wed Apr 19 00:14:55 CEST 2017

Grégory Mantelet <gmantele at ari.uni-heidelberg.de> wrote:
> About the current ADQL grammar (the attached file "adql-v0.bnf"), I have listed the following errors/comments:
> 
> * Do not define a rule with just ONE terminal element (e.g. `<ampersand>`,
>    `<asterisk>`) ; a rule defines a syntax not a token/terminal

When I implemented my parser, I found it kind of useful to have
names for all of these elements.  But it is a minor thing.

> * Lack of required and optional space characters (or comment) between
>    tokens and literals. Without this specification,
>    "SELECTmycolumnFROMmytable" would be entirely possible though
>    we perfectly know it is not. On the other hand, writing "2  *  mycolumn"
>    or "2*mycolumn" is completely correct.

Agreed.  I had to add a lot of manual space elements to prevent these
kind of problems [1].  I ended up guessing a lot.

> * Missing a syntax to represent literal values (e.g. "SELECT", letters and
>   digits, operators) ; because of that, there is a possible ambiguity between
>   BNF syntax symbols and grammar literals

I ended up specifying that an identifier is whatever the spec says as
long as it is not in the list of reserved words.

> * l.257 - `<default_function_prefix>` : empty definition ; the comment says
>   the default prefix should be "udf_" but this rule set nothing ; should it be
>   an empty string (i.e. "") or "udf_"?

Yeah, it is a useless rule.  It is more of a suggestion that you
should prefix UDF's with 'udf_'.  It definitely should not be in the
grammar.

> * l.592 - `<space>` : empty definition ; should it be " " or should it also
>   include "\t", "\n" and "\r"? (here comes the literal ambiguity mentioned
>   above)

FYI, I set it to equal any whitespace.

> * Ambiguity with the `AS` keyword ; because it is optional, depending of
>   the parser the following element may always be interpreted as an
>   identifier for the alias, even though it is `FROM` or `(*)` (after a
>   `COUNT` for example)

This is an ickyness that is carried over from SQL 92.  I am hesitant
to get rid of it just because it is ugly.  I feel that will trip up
users.  I think that requiring identifiers to not be a reserved word
cleans up this ambiguity.

>     My preference goes for PEG and alternatively ABNF. Here is why.

My implementation uses a PEG parser, so you will not hear me object ;)

> Nevertheless, a small drawback to know about PEG is that left
> recursion is not allowed. Small re-writing of the ADQL grammar
> (e.g. numerical and boolean operations) would be needed. But as far
> as I know, few parsers are allowing left recursion so most of the
> existing ADQL parsers may have already rewritten such special
> syntaxes, but please, tell me if I am wrong here.  (for those not
> familiar with language parsing, just compare side by side
> "adql_min.bnf" and "adql_min.peg", l.62)

I had to do the same thing.

All in all, this is a big project.  I worry that if we start anew, we
will end up with a grammar that is gratuitously different from SQL 92.

Cheers,
Walter Landry

[1] https://github.com/Caltech-IPAC/libadql

    The grammar is mostly defined in the ADQL_parser directory

    https://github.com/Caltech-IPAC/libadql/tree/master/src/Query/ADQL_parser/ADQL_parser