<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Hi Dave,<br>
<br>
As you, following the discussions at the ASTERICS meeting in
Strasbourg, I have already worked quite intensively of this topic
for a week. I was hoping to have some time this week to report all
my explorations and ideas in a more clean way, but I will try to
sumarize my thoughts now anyway. After a brief reading of your
observations on <a
href="https://github.com/ivoa/lyonetia/wiki/BNF-grammar-and-validation">lyonetia</a>,
I agree with most of your ideas: my conclusion was also to start
from scratch a new ADQL grammar based on what all current ADQL
parsers in the VO are already doing and what is defined in the
ADQL document. Below I give some of my conclusions and also a
suggestion of grammar notation to use (for those who do not like
suspens, I jump quickly to this suggestion: PEG).<br>
<br>
<i>FYI, since this email is quite long, here is a preview of its
content:</i><i><br>
<br>
</i><i> A. Few comments about the current ADQL grammar</i><i><br>
</i><i> B. "Toy grammar</i><i>"<br>
</i><i> C. Brief comparison of these notations</i><i><br>
</i><i> D. Testing PEG with the "toy grammar"</i><i><br>
</i><br>
<br>
----------------------------------------------------------------------------<br>
A. Few comments about the current ADQL grammar <br>
----------------------------------------------------------------------------<br>
<br>
About the current ADQL grammar (the attached file "adql-v0.bnf"),
I have listed the following errors/comments:<br>
<br>
* Do not define a rule with just ONE terminal element (e.g.
`<ampersand>`,<br>
`<asterisk>`) ; a rule defines a syntax not a
token/terminal<br>
<br>
* Lack of required and optional space characters (or comment)
between<br>
tokens and literals. Without this specification,<br>
"SELECTmycolumnFROMmytable" would be entirely possible though<br>
we perfectly know it is not. On the other hand, writing "2 *
mycolumn"<br>
or "2*mycolumn" is completely correct.<br>
<br>
* Missing a syntax to represent literal values (e.g. "SELECT",
letters and<br>
digits, operators) ; because of that, there is a possible
ambiguity between<br>
BNF syntax symbols and grammar literals<br>
<br>
* l.165 - `<boolean_primary>` : two lines inverted ; a rule
can not start with<br>
"|" ; line 166 and 167 must be inverted<br>
<br>
* l.257 - `<default_function_prefix>` : empty definition ;
the comment says<br>
the default prefix should be "udf_" but this rule set nothing ;
should it be<br>
an empty string (i.e. "") or "udf_"?<br>
<br>
* l.420 - `<newline>` : empty definition ; the literal
should probably be "\n"<br>
<br>
* l.436 - `<nondoublequote_character>` - empty definition :
should it be a<br>
regular expression such as `[^"]`?<br>
<br>
* l.439 - `<nonquote_character>` : empty definition ; should
it be a regular<br>
expression such as `[^']`?<br>
<br>
* l.642 - `<unqualified_schema name>` : a space character in
the rule name ;<br>
should it be allowed?<br>
<br>
* l.592 - `<space>` : empty definition ; should it be " " or
should it also<br>
include "\t", "\n" and "\r"? (here comes the literal ambiguity
mentioned<br>
above)<br>
<br>
* Ambiguity with the `AS` keyword ; because it is optional,
depending of<br>
the parser the following element may always be interpreted as an<br>
identifier for the alias, even though it is `FROM` or `(*)`
(after a<br>
`COUNT` for example)<br>
<br>
<br>
---------------------------<br>
B. "Toy grammar" <br>
---------------------------<br>
<br>
During my explorations on the different language notations we
could use, I have defined a very reduced subset of ADQL/SQL from
scratch.<br>
- just SELECT, FROM (without JOINs), WHERE (limited to
simple comparisons only) and ORDER BY are defined<br>
- mathematical expressions are just limited to numbers
(integer and double) with no parenthesis and no functions<br>
- no geometrical operations<br>
- an ADQL query MUST end with a semi-colon <i>(see below
why)</i><br>
<br>
With this subset, it is not intended to match what we want
exactly in the ADQL grammar (few syntax elements could be written
in a nicer way) ; it is just a grammar for testing the same
ADQL/SQL syntax with different notations in order to see the
advantages and drawbacks of each of them.<br>
<br>
I have attached this same grammar in the following notations:<br>
* BNF (as originally defined): "adql_min.bnf"<br>
* EBNF: "adql_min.ebnf"<br>
* ABNF: "adql_min.abnf"<br>
* "IVOA-BNF" (the BNF derived from SQL-92 and used in the
ADQL 1.0 document): "adql_min.ivoa_bnf"<br>
* PEG: "adql_min.peg".<br>
<br>
I wrote them so that they can be compared side by side, with
for instance a visual diff. tool or by putting them manually side
by side on your screen. That way, you can see their difference.<br>
<br>
<br>
--------------------------------------------------------<br>
C. Brief comparison of these notations <br>
--------------------------------------------------------<br>
<br>
My preference goes for PEG and alternatively ABNF. Here is
why.<br>
<br>
<i>(warning: maybe boring technical comments incoming... :-) )</i><br>
<br>
- According to me, the <a
href="https://web.archive.org/web/20060925132043/https://www.lrz-muenchen.de/%7Ebernhard/Algol-BNF.html">original
BNF notation</a> suffers too much of ambiguity: no delimitation
for literals (e.g. how to know that | is part of the BNF syntax or
the '|' character, how to specify space characters) and the lack
of grouping and repetition syntaxes (which can be done differently
; see "adql_min.bnf" for examples). Both points are fixed in EBNF
and ABNF and makes the grammar reading more confortable and
compact.<br>
<br>
- The original SQL92 BNF syntax is actually a mix of EBNF,
ABNF and the addition of the syntax "..." (which seems to be used
for repetition...but which one: 0..n or 1..n? there is no
clarification about that). It seems this syntax is defined nowhere
(and trust me, I have searched a lot for this). I think it is
definitely not a good idea to keep this very custom notation.<br>
<br>
<i>Note: By the way, here is where the original SQL-92 seems to
come from: </i><i><a href="https://github.com/ronsavage/SQL">https://github.com/ronsavage/SQL</a></i><i>
; there is also a lex-yacc + perl script to parse and validate </i><i>this
special BNF (I did not succeed to make it work on our ADQL BNF,
but I admit that I have not insisted too much, so feel free to
play with it if you want and if you are more familiar with perl
and yacc than me...which should not be difficult).</i><br>
<br>
- On the contrary to the original BNF and this "SQL-92 BNF" we
are using, there is a standard/normative specification for <a
href="https://www.cl.cam.ac.uk/%7Emgk25/iso-14977.pdf">EBNF</a>
and <a href="http://tools.ietf.org/html/rfc5234">ABNF</a>. I have
a slight preference for ABNF because:<br>
1. the token separator is a space (instead of a comma in
EBNF). <br>
2. it is possible to use character ranges like '%x41-5A'
(i.e. 'a-z' in hexadecimal).<br>
Then, I have to admit that prefixing a group with * in ABNF
instead of suffixing it with * in EBNF for a repetition is a bit
disturbing, but at least, with ABNF, you can specify a range limit
(e.g. '2*5(...)' to repeat the group at least 2 and at most 5
times) ; then, it depends of whether or not we need that.<br>
<br>
- Finally, as I said above, my strong preference goes to PEG:
Parsing Expression Grammar (reference paper: <a
href="http://bford.info/pub/lang/peg">http://bford.info/pub/lang/peg</a>).
It is a quite new notation which is not so far from ABNF. The
evaluation of such expression is performed differently from a BNF
or any other similar notation (lex+yacc, javacc, ...). It has been
designed to be a non-ambiguous notation. For more details, I
invite you to look at the very short abstract and then the paper
if you want to investigate more into details.<br>
<br>
To come back quickly on the required and optional spaces issue, I
also completely agree with you, Dave: it was actually the first
issue I had when I tried to validate our ADQL BNF. In PEG, I
solved this issue by declaring two rules:<br>
- '_' for optional spaces, like between a numeric
expression and a mathematical operator (e.g. '2*3')<br>
- '__' for required spaces, like between 'SELECT' and its
items<br>
I think these rule names (pretty common in PEG) does not spoil the
reading of the grammar while imposing or not the presence of a
space characters (or comment).<br>
<br>
On the contrary to the various BNF described above, PEG allows the
usage of regular expression (but I would recommend to use them
with moderation...they should not replace grammar rules).<br>
<br>
As with common parser-parser grammars, it is possible to use
LOOKAHEAD-like expressions (i.e. "try this set of rules and only
if it fails/succeeds goes on"). This is not possible with BNFs.
This makes look like a hack, but unfortunately, sometimes, when
the grammar becomes complicated and full of possibilities, there
is no other choice....even in my ADQL/SQL subset (for instance,
see the rule "regular_identifier", l.54 in "adql_min.peg").<br>
<br>
Anyway, the main reason why I prefer PEG to the BNF notations,
is because it is much more easy to find a parser in a lot of
different languages (C/C++/C#, Java, Python, Javascript, ... ;
example of parsers: <a href="http://bford.info/packrat/">packrat</a>,
<a href="http://www.romanredz.se/Mouse/index.htm">mouse</a>, <a
href="https://waxeye.org/">waxeye</a>, ...).<br>
<br>
<br>
---------------------------------------------------------<br>
D. Testing PEG with the "toy grammar"<br>
---------------------------------------------------------<br>
<br>
The first parser I discovered was PEGjs (Javascript): <a
href="https://pegjs.org/online">https://pegjs.org/online</a>.
Nice thing about it: you can edit the grammar online and directly
evaluate your expression. I invite you to try it with the attached
grammar "adql_min.pegjs" and the attached test queries
"test-queries". Copy-paste the content of the first one in the
grammar part (left) and the second one in the input part (top
right).<br>
<br>
<i>Note: for the purpose of this exploration, I require to end
ADQL queries with a semi-colon so that I can chain and test
several queries in a row.<br>
<br>
</i>The grammar "adql_min.peg" and "adql_min.pegjs" declare
exactly the same syntactic rules. The only differences between
them are just for PEGjs to know how to build the parser. Here are
these differences:<br>
<br>
1. replacement of the affectation symbol "<-" by "="...this
is required by PEGjs<br>
<i>-> that can be easily done by a small script (e.g.
the unix command 'sed')</i><br>
<br>
2. addition of group labels and javascript snippets. These are
just to build the final parser so that it does what we want. In
this case, I wanted to list SELECT, FROM, WHERE and ORDER BY
content <i>(which you can see in the bottom right part of the
PEGjs online validator)</i>. Even without these labels and
snippets, PEGjs is still able to check your input queries but the
output would be just quite useless.<br>
<i>-> </i><i>this step of the validation could be
done in another way automatically but a bit more time and
development would be required...not sure I will have time before
the Interop. in Shanghai.</i><br>
<br>
Nevertheless, a small drawback to know about PEG is that left
recursion is not allowed. Small re-writing of the ADQL grammar
(e.g. numerical and boolean operations) would be needed. But as
far as I know, few parsers are allowing left recursion so most of
the existing ADQL parsers may have already rewritten such special
syntaxes, but please, tell me if I am wrong here.<br>
<i>(for those not familiar with language parsing, just compare
side by side "adql_min.bnf" and "adql_min.peg", l.62)<br>
</i><br>
<br>
------------------------------------------------------<br>
<br>
Finally, I would say that with PEG and especially with PEGjs, we
could probably design quickly a parser for our ADQL grammar and
from that we could also create an ADQL parser. Ideally, if that
concept is confirmed and nice error messages are possible (it is
still the tricky part of such parsers), I even think to eventually
use a PEG parser for my ADQL Library so that sparing me the time
to maintain my JavaCC grammar (written from scratch from the ADQL
BNF) while having the guarantee that the parser is following the
official ADQL grammar. But of course, with PEGjs, it would more
particularly possible to now have easily a Javascript ADQL parser
in VO web interfaces for TAP.<br>
<br>
Sorry for this very loooong email (congratulations if you reached
the end ;-)) and I hope all these inputs will be however helpful.<br>
<br>
Cheers,<br>
Grégory<br>
<br>
<br>
On 04/18/2017 04:51 PM, Dave Morris wrote:<br>
</div>
<blockquote
cite="mid:cd95222720b6246823f65395c61a5fa8@metagrid.co.uk"
type="cite">Hi DAL,
<br>
<br>
Progress on developing a method for validating the BNF grammar for
ADQL has been stalled for a while.
<br>
<br>
Following on from discussions at the ASTERICS meeting in
Strasbourg, I have been reading up on language grammars and BNF.
Based on what I learned I've put together a proposal for how to go
about creating a machine readable language grammar for ADQL.
<br>
<br>
Short form is :- rather than trying to patch fix the existing BNF
grammar we create a new grammar from scratch, adding each of the
language rules and validating them one at a time.
<br>
<br>
More detail is available on the project wiki.
<br>
<a class="moz-txt-link-freetext" href="https://github.com/ivoa/lyonetia/wiki/BNF-grammar-and-validation">https://github.com/ivoa/lyonetia/wiki/BNF-grammar-and-validation</a>
<br>
<br>
We have found a tool that can validate a BNF (or EBNF) grammar
against a set of examples:
<br>
<a class="moz-txt-link-freetext" href="http://bnfparser2.sourceforge.net/">http://bnfparser2.sourceforge.net/</a>
<br>
<br>
And we have set up a prototype system in a Docker container for
developing and validating a grammar for ADQL:
<br>
<a class="moz-txt-link-freetext" href="https://github.com/ivoa/lyonetia/tree/master/src/docker/bnfparser">https://github.com/ivoa/lyonetia/tree/master/src/docker/bnfparser</a>
<br>
<br>
--------
<br>
<br>
I'd be interested to hear what you think.
<br>
<br>
Is this a realistic approach to creating a machine readable
grammar for ADQL ?
<br>
<br>
Do you have an alternative method you would like to suggest ?
<br>
<br>
At this stage we would welcome any and all suggestions on how best
to create a machine readable grammar and how to implement the
validation process.
<br>
<br>
Thanks,
<br>
Dave
<br>
<br>
--------
<br>
Dave Morris
<br>
Research Software Engineer
<br>
Wide Field Astronomy Unit
<br>
Institute for Astronomy
<br>
University of Edinburgh
<br>
--------
<br>
</blockquote>
<br>
</body>
</html>