ADQL grammar validation
Walter Landry
wlandry at caltech.edu
Wed Apr 19 00:14:55 CEST 2017
Grégory Mantelet <gmantele at ari.uni-heidelberg.de> wrote:
> About the current ADQL grammar (the attached file "adql-v0.bnf"), I have listed the following errors/comments:
>
> * Do not define a rule with just ONE terminal element (e.g. `<ampersand>`,
> `<asterisk>`) ; a rule defines a syntax not a token/terminal
When I implemented my parser, I found it kind of useful to have
names for all of these elements. But it is a minor thing.
> * Lack of required and optional space characters (or comment) between
> tokens and literals. Without this specification,
> "SELECTmycolumnFROMmytable" would be entirely possible though
> we perfectly know it is not. On the other hand, writing "2 * mycolumn"
> or "2*mycolumn" is completely correct.
Agreed. I had to add a lot of manual space elements to prevent these
kind of problems [1]. I ended up guessing a lot.
> * Missing a syntax to represent literal values (e.g. "SELECT", letters and
> digits, operators) ; because of that, there is a possible ambiguity between
> BNF syntax symbols and grammar literals
I ended up specifying that an identifier is whatever the spec says as
long as it is not in the list of reserved words.
> * l.257 - `<default_function_prefix>` : empty definition ; the comment says
> the default prefix should be "udf_" but this rule set nothing ; should it be
> an empty string (i.e. "") or "udf_"?
Yeah, it is a useless rule. It is more of a suggestion that you
should prefix UDF's with 'udf_'. It definitely should not be in the
grammar.
> * l.592 - `<space>` : empty definition ; should it be " " or should it also
> include "\t", "\n" and "\r"? (here comes the literal ambiguity mentioned
> above)
FYI, I set it to equal any whitespace.
> * Ambiguity with the `AS` keyword ; because it is optional, depending of
> the parser the following element may always be interpreted as an
> identifier for the alias, even though it is `FROM` or `(*)` (after a
> `COUNT` for example)
This is an ickyness that is carried over from SQL 92. I am hesitant
to get rid of it just because it is ugly. I feel that will trip up
users. I think that requiring identifiers to not be a reserved word
cleans up this ambiguity.
> My preference goes for PEG and alternatively ABNF. Here is why.
My implementation uses a PEG parser, so you will not hear me object ;)
> Nevertheless, a small drawback to know about PEG is that left
> recursion is not allowed. Small re-writing of the ADQL grammar
> (e.g. numerical and boolean operations) would be needed. But as far
> as I know, few parsers are allowing left recursion so most of the
> existing ADQL parsers may have already rewritten such special
> syntaxes, but please, tell me if I am wrong here. (for those not
> familiar with language parsing, just compare side by side
> "adql_min.bnf" and "adql_min.peg", l.62)
I had to do the same thing.
All in all, this is a big project. I worry that if we start anew, we
will end up with a grammar that is gratuitously different from SQL 92.
Cheers,
Walter Landry
[1] https://github.com/Caltech-IPAC/libadql
The grammar is mostly defined in the ADQL_parser directory
https://github.com/Caltech-IPAC/libadql/tree/master/src/Query/ADQL_parser/ADQL_parser
More information about the dal
mailing list