<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">Hi Dave,<br>

      <br>

          As you, following the discussions at the ASTERICS meeting in

      Strasbourg, I have already worked quite intensively of this topic

      for a week. I was hoping to have some time this week to report all

      my explorations and ideas in a more clean way, but I will try to

      sumarize my thoughts now anyway. After a brief reading of your

      observations on <a

        href="https://github.com/ivoa/lyonetia/wiki/BNF-grammar-and-validation">lyonetia</a>,

      I agree with most of your ideas: my conclusion was also to start

      from scratch a new ADQL grammar based on what all current ADQL

      parsers in the VO are already doing and what is defined in the

      ADQL document. Below I give some of my conclusions and also a

      suggestion of grammar notation to use (for those who do not like

      suspens, I jump quickly to this suggestion: PEG).<br>

      <br>

      <i>FYI, since this email is quite long, here is a preview of its

        content:</i><i><br>

        <br>

      </i><i>    A. Few comments about the current ADQL grammar</i><i><br>

      </i><i>    B. "Toy grammar</i><i>"<br>

      </i><i>    C. Brief comparison of these notations</i><i><br>

      </i><i>    D. Testing PEG with the "toy grammar"</i><i><br>

      </i><br>

      <br>

----------------------------------------------------------------------------<br>

       A. Few comments about the current ADQL grammar <br>

----------------------------------------------------------------------------<br>

      <br>

      About the current ADQL grammar (the attached file "adql-v0.bnf"),

      I have listed the following errors/comments:<br>

      <br>

      * Do not define a rule with just ONE terminal element (e.g.

      `&lt;ampersand&gt;`,<br>

         `&lt;asterisk&gt;`) ; a rule defines a syntax not a

      token/terminal<br>

      <br>

      * Lack of required and optional space characters (or comment)

      between<br>

         tokens and literals. Without this specification,<br>

         "SELECTmycolumnFROMmytable" would be entirely possible though<br>

         we perfectly know it is not. On the other hand, writing "2  * 

      mycolumn"<br>

         or "2*mycolumn" is completely correct.<br>

      <br>

      * Missing a syntax to represent literal values (e.g. "SELECT",

      letters and<br>

        digits, operators) ; because of that, there is a possible

      ambiguity between<br>

        BNF syntax symbols and grammar literals<br>

      <br>

      * l.165 - `&lt;boolean_primary&gt;` : two lines inverted ; a rule

      can not start with<br>

        "|" ; line 166 and 167 must be inverted<br>

      <br>

      * l.257 - `&lt;default_function_prefix&gt;` : empty definition ;

      the comment says<br>

        the default prefix should be "udf_" but this rule set nothing ;

      should it be<br>

        an empty string (i.e. "") or "udf_"?<br>

      <br>

      * l.420 - `&lt;newline&gt;` : empty definition ; the literal

      should probably be "\n"<br>

      <br>

      * l.436 - `&lt;nondoublequote_character&gt;` - empty definition :

      should it be a<br>

        regular expression such as `[^"]`?<br>

      <br>

      * l.439 - `&lt;nonquote_character&gt;` : empty definition ; should

      it be a regular<br>

        expression such as `[^']`?<br>

      <br>

      * l.642 - `&lt;unqualified_schema name&gt;` : a space character in

      the rule name ;<br>

        should it be allowed?<br>

      <br>

      * l.592 - `&lt;space&gt;` : empty definition ; should it be " " or

      should it also<br>

        include "\t", "\n" and "\r"? (here comes the literal ambiguity

      mentioned<br>

        above)<br>

      <br>

      * Ambiguity with the `AS` keyword ; because it is optional,

      depending of<br>

        the parser the following element may always be interpreted as an<br>

        identifier for the alias, even though it is `FROM` or `(*)`

      (after a<br>

        `COUNT` for example)<br>

      <br>

      <br>

      ---------------------------<br>

       B. "Toy grammar" <br>

      ---------------------------<br>

      <br>

          During my explorations on the different language notations we

      could use, I have defined a very reduced subset of ADQL/SQL from

      scratch.<br>

              - just SELECT, FROM (without JOINs), WHERE (limited to

      simple comparisons only) and ORDER BY are defined<br>

              - mathematical expressions are just limited to numbers

      (integer and double) with no parenthesis and no functions<br>

              - no geometrical operations<br>

              - an ADQL query MUST end with a semi-colon <i>(see below

        why)</i><br>

      <br>

           With this subset, it is not intended to match what we want

      exactly in the ADQL grammar (few syntax elements could be written

      in a nicer way) ; it is just a grammar for testing the same

      ADQL/SQL syntax with different notations in order to see the

      advantages and drawbacks of each of them.<br>

      <br>

          I have attached this same grammar in the following notations:<br>

              * BNF (as originally defined): "adql_min.bnf"<br>

              * EBNF: "adql_min.ebnf"<br>

              * ABNF: "adql_min.abnf"<br>

              * "IVOA-BNF" (the BNF derived from SQL-92 and used in the

      ADQL 1.0 document): "adql_min.ivoa_bnf"<br>

              * PEG: "adql_min.peg".<br>

      <br>

          I wrote them so that they can be compared side by side, with

      for instance a visual diff. tool or by putting them manually side

      by side on your screen. That way, you can see their difference.<br>

      <br>

      <br>

      --------------------------------------------------------<br>

       C. Brief comparison of these notations <br>

      --------------------------------------------------------<br>

      <br>

          My preference goes for PEG and alternatively ABNF. Here is

      why.<br>

      <br>

      <i>(warning: maybe boring technical comments incoming... :-) )</i><br>

      <br>

          - According to me, the <a

href="https://web.archive.org/web/20060925132043/https://www.lrz-muenchen.de/%7Ebernhard/Algol-BNF.html">original

        BNF notation</a> suffers too much of ambiguity: no delimitation

      for literals (e.g. how to know that | is part of the BNF syntax or

      the '|' character, how to specify space characters) and the lack

      of grouping and repetition syntaxes (which can be done differently

      ; see "adql_min.bnf" for examples). Both points are fixed in EBNF

      and ABNF and makes the grammar reading more confortable and

      compact.<br>

      <br>

          - The original SQL92 BNF syntax is actually a mix of EBNF,

      ABNF and the addition of the syntax "..." (which seems to be used

      for repetition...but which one: 0..n or 1..n? there is no

      clarification about that). It seems this syntax is defined nowhere

      (and trust me, I have searched a lot for this). I think it is

      definitely not a good idea to keep this very custom notation.<br>

      <br>

      <i>Note: By the way, here is where the original SQL-92 seems to

        come from: </i><i><a href="https://github.com/ronsavage/SQL">https://github.com/ronsavage/SQL</a></i><i>

        ; there is also a lex-yacc + perl script to parse and validate </i><i>this

        special BNF (I did not succeed to make it work on our ADQL BNF,

        but I admit that I have not insisted too much, so feel free to

        play with it if you want and if you are more familiar with perl

        and yacc than me...which should not be difficult).</i><br>

      <br>

          - On the contrary to the original BNF and this "SQL-92 BNF" we

      are using, there is a standard/normative specification for <a

        href="https://www.cl.cam.ac.uk/%7Emgk25/iso-14977.pdf">EBNF</a>

      and <a href="http://tools.ietf.org/html/rfc5234">ABNF</a>. I have

      a slight preference for ABNF because:<br>

              1. the token separator is a space (instead of a comma in

      EBNF). <br>

              2. it is possible to use character ranges like '%x41-5A'

      (i.e. 'a-z' in hexadecimal).<br>

      Then, I have to admit that prefixing a group with * in ABNF

      instead of suffixing it with * in EBNF for a repetition is a bit

      disturbing, but at least, with ABNF, you can specify a range limit

      (e.g. '2*5(...)' to repeat the group at least 2 and at most 5

      times) ; then, it depends of whether or not we need that.<br>

      <br>

          - Finally, as I said above, my strong preference goes to PEG:

      Parsing Expression Grammar (reference paper: <a

        href="http://bford.info/pub/lang/peg">http://bford.info/pub/lang/peg</a>).

      It is a quite new notation which is not so far from ABNF. The

      evaluation of such expression is performed differently from a BNF

      or any other similar notation (lex+yacc, javacc, ...). It has been

      designed to be a non-ambiguous notation. For more details, I

      invite you to look at the very short abstract and then the paper

      if you want to investigate more into details.<br>

      <br>

      To come back quickly on the required and optional spaces issue, I

      also completely agree with you, Dave: it was actually the first

      issue I had when I tried to validate our ADQL BNF. In PEG, I

      solved this issue by declaring two rules:<br>

              - '_' for optional spaces, like between a numeric

      expression and a mathematical operator (e.g. '2*3')<br>

              - '__' for required spaces, like between 'SELECT' and its

      items<br>

      I think these rule names (pretty common in PEG) does not spoil the

      reading of the grammar while imposing or not the presence of a

      space characters (or comment).<br>

      <br>

      On the contrary to the various BNF described above, PEG allows the

      usage of regular expression (but I would recommend to use them

      with moderation...they should not replace grammar rules).<br>

      <br>

      As with common parser-parser grammars, it is possible to use

      LOOKAHEAD-like expressions (i.e. "try this set of rules and only

      if it fails/succeeds goes on"). This is not possible with BNFs.

      This makes look like a hack, but unfortunately, sometimes, when

      the grammar becomes complicated and full of possibilities, there

      is no other choice....even in my ADQL/SQL subset (for instance,

      see the rule "regular_identifier", l.54 in "adql_min.peg").<br>

      <br>

          Anyway, the main reason why I prefer PEG to the BNF notations,

      is because it is much more easy to find a parser in a lot of

      different languages (C/C++/C#, Java, Python, Javascript, ... ;

      example of parsers: <a href="http://bford.info/packrat/">packrat</a>,

      <a href="http://www.romanredz.se/Mouse/index.htm">mouse</a>, <a

        href="https://waxeye.org/">waxeye</a>, ...).<br>

      <br>

      <br>

      ---------------------------------------------------------<br>

       D. Testing PEG with the "toy grammar"<br>

      ---------------------------------------------------------<br>

      <br>

          The first parser I discovered was PEGjs (Javascript): <a

        href="https://pegjs.org/online">https://pegjs.org/online</a>.

      Nice thing about it: you can edit the grammar online and directly

      evaluate your expression. I invite you to try it with the attached

      grammar "adql_min.pegjs" and the attached test queries

      "test-queries". Copy-paste the content of the first one in the

      grammar part (left) and the second one in the input part (top

      right).<br>

      <br>

      <i>Note: for the purpose of this exploration, I require to end

        ADQL queries with a semi-colon so that I can chain and test

        several queries in a row.<br>

        <br>

      </i>The grammar "adql_min.peg" and "adql_min.pegjs" declare

      exactly the same syntactic rules. The only differences between

      them are just for PEGjs to know how to build the parser. Here are

      these differences:<br>

      <br>

          1. replacement of the affectation symbol "&lt;-" by "="...this

      is required by PEGjs<br>

              <i>-&gt; that can be easily done by a small script (e.g.

        the unix command 'sed')</i><br>

      <br>

          2. addition of group labels and javascript snippets. These are

      just to build the final parser so that it does what we want. In

      this case, I wanted to list SELECT, FROM, WHERE and ORDER BY

      content <i>(which you can see in the bottom right part of the

        PEGjs online validator)</i>. Even without these labels and

      snippets, PEGjs is still able to check your input queries but the

      output would be just quite useless.<br>

              <i>-&gt; </i><i>this step of the validation could be

        done in another way automatically but a bit more time and

        development would be required...not sure I will have time before

        the Interop. in Shanghai.</i><br>

      <br>

      Nevertheless, a small drawback to know about PEG is that left

      recursion is not allowed. Small re-writing of the ADQL grammar

      (e.g. numerical and boolean operations) would be needed. But as

      far as I know, few parsers are allowing left recursion so most of

      the existing ADQL parsers may have already rewritten such special

      syntaxes, but please, tell me if I am wrong here.<br>

      <i>(for those not familiar with language parsing, just compare

        side by side "adql_min.bnf" and "adql_min.peg", l.62)<br>

      </i><br>

      <br>

      ------------------------------------------------------<br>

      <br>

      Finally, I would say that with PEG and especially with PEGjs, we

      could probably design quickly a parser for our ADQL grammar and

      from that we could also create an ADQL parser. Ideally, if that

      concept is confirmed and nice error messages are possible (it is

      still the tricky part of such parsers), I even think to eventually

      use a PEG parser for my ADQL Library so that sparing me the time

      to maintain my JavaCC grammar (written from scratch from the ADQL

      BNF) while having the guarantee that the parser is following the

      official ADQL grammar. But of course, with PEGjs, it would more

      particularly possible to now have easily a Javascript ADQL parser

      in VO web interfaces for TAP.<br>

      <br>

      Sorry for this very loooong email (congratulations if you reached

      the end ;-)) and I hope all these inputs will be however helpful.<br>

      <br>

      Cheers,<br>

      Grégory<br>

      <br>

      <br>

      On 04/18/2017 04:51 PM, Dave Morris wrote:<br>

    </div>

    <blockquote

      cite="mid:cd95222720b6246823f65395c61a5fa8@metagrid.co.uk"

      type="cite">Hi DAL,

      <br>

      <br>

      Progress on developing a method for validating the BNF grammar for

      ADQL has been stalled for a while.

      <br>

      <br>

      Following on from discussions at the ASTERICS meeting in

      Strasbourg, I have been reading up on language grammars and BNF.

      Based on what I learned I've put together a proposal for how to go

      about creating a machine readable language grammar for ADQL.

      <br>

      <br>

      Short form is :- rather than trying to patch fix the existing BNF

      grammar we create a new grammar from scratch, adding each of the

      language rules and validating them one at a time.

      <br>

      <br>

      More detail is available on the project wiki.

      <br>

      <a class="moz-txt-link-freetext" href="https://github.com/ivoa/lyonetia/wiki/BNF-grammar-and-validation">https://github.com/ivoa/lyonetia/wiki/BNF-grammar-and-validation</a>

      <br>

      <br>

      We have found a tool that can validate a BNF (or EBNF) grammar

      against a set of examples:

      <br>

      <a class="moz-txt-link-freetext" href="http://bnfparser2.sourceforge.net/">http://bnfparser2.sourceforge.net/</a>

      <br>

      <br>

      And we have set up a prototype system in a Docker container for

      developing and validating a grammar for ADQL:

      <br>

      <a class="moz-txt-link-freetext" href="https://github.com/ivoa/lyonetia/tree/master/src/docker/bnfparser">https://github.com/ivoa/lyonetia/tree/master/src/docker/bnfparser</a>

      <br>

      <br>

      --------

      <br>

      <br>

      I'd be interested to hear what you think.

      <br>

      <br>

      Is this a realistic approach to creating a machine readable

      grammar for ADQL ?

      <br>

      <br>

      Do you have an alternative method you would like to suggest ?

      <br>

      <br>

      At this stage we would welcome any and all suggestions on how best

      to create a machine readable grammar and how to implement the

      validation process.

      <br>

      <br>

      Thanks,

      <br>

      Dave

      <br>

      <br>

      --------

      <br>

      Dave Morris

      <br>

      Research Software Engineer

      <br>

      Wide Field Astronomy Unit

      <br>

      Institute for Astronomy

      <br>

      University of Edinburgh

      <br>

      --------

      <br>

    </blockquote>

    <br>

  </body>

</html>