[nvo-techwg] Comments on PQL
Thomas McGlynn
Thomas.A.McGlynn at nasa.gov
Thu Jul 9 09:04:53 PDT 2009
John inspired me to finish my comments on the PQL draft which follow
Tom
General comments:
I've tried to go over the PQL document in detail. I think the overall
framework is fine. At the HEASARC we can implement this relatively
quickly. [I think we did everything that's mandatory in our previous
version.]
There are a number of issues in the document and protocols
that I find worrying and I've tried to list them below. In many cases
I've made suggestions for simplifying the document where I think
either the document doesn't need to discuss this, or the syntax
discussed is superfluous.
Regards,
Tom
Some of the key issues for me:
- Both the proposed syntax for the WHERE parameter and
its current description needs significant improvement. Currently
it contradicts the general description of parameters. I've suggested
on alternative, but almost anything would be better than the
current version. We should not let the WHERE parameter drive
the definition of parameters generally.
- The multicone discussion is unclear. The coordinate systems
supported are not discussed (nor in the standard POS for that matter)
and how the positional columns are found is confusing. Users
should be able to explicitly specify the columns in the uploaded
tables where the positions are to be found.
- I believe there is a running confusion in the document regarding
the need for escaping characters. While many characters in PQL defined
strings will need to be escaped when PQL parameters are encoded
in an HTTP request, that is not part of PQL and need not be discussed
here. The only character that might need to be escaped within
PQL itself is ',' (and possibly '/' if we allow range searches
in strings). I'd prefer a backslash quoting for these if we decide
we need it since otherwise there are multiple levels of URL encoding
required in sending and receiving messages. Almost every use
of 'HTTP' or 'URL' in the document is inappropriate.
- I strongly disagree with the behavior suggested in section 2.8.
- The parameter qualifier syntax seems to have no real purpose and doesn't
seem to be needed. An additional parameter could be used to specify
the coordinate system. I'd find this much cleaner.
[I saw no other uses of qualifiers.]
- The document should be careful to restrict itself to PQL. TAP should
be referenced as little as possible except possibly in an appendix.
Specific issues by section.
1.
I don't think the discussion of data models is particularly helpful in
the context
of this introduction. It leaves me more confused that I started.
I'd delete the second paragraph. The same applies to the fourth paragraph
and I'd get rid of the phrase "based upon the generic dataset model" in the
first sentence of the fifth paragraphs.
(para 6) I don't believe the restriction to version 1.0 of TAP is
appropriate.
Nor need PQL and TAP be synchronized. (Neither should ADQL and TAP).
A statement that the current version of the protocol is being defined
to work within version 1.0 of TAP is fine, but we should make no statement
limiting the use. E.g., why should we imply that TAP 2.0 will require
a new version of PQL. Maybe it will, but we can't see the future.
2.
The word 'parameter' is never defined (as far as I see). It should be
defined either in the introduction to this section or a new section 2.1.
E.g. [Just a quick stab at this.]
2.1 A 'parameter' is a key/value pair which is used to constrain
a query request. The parameter is specified as a string 'key=value'.
A PQL
query comprises of set of parameters sent to a service using some
protocol (e.g., TAP).
2.1
The word 'constant' is incorrect. All types of parameters have
constant values in a given query. I think a better word is 'scalar'.
I don't think
URL coding is necessarily the appropriate escaping mechanism. This tends
to cause problems as people can never understand how many times to
escape or unescape. E.g., a '\' escaping mechanism would more cleaning
separate
the escaping that will be required within PQL from that required to generate
PQL queries over HTTP. I think the only reserved character is likely ','
itself.
The list of reserved characters should be specified in a table and
referenced
here. The table should indicate how each special character is used.
2.2
I'd suggest that 2.2 follow 2.3. Then we don't have to have a forward
reference to range valued parameters.
Why the constraint on embedded spaces? That's a significant constraint
and I don't see what it gains us. There is a question of whether the
elements of the list are automatically trimmed:
is x=a,b the same as x=a, b
but I'd like to be able to support a syntax like:
name=2c 273,2c 279
If this restriction is kept, can spaces be specified if properly escaped?
2.3
I'd get rid of the constraint on range searches on string values. I don't
know of any context in which I cannot define the <> relationship on strings.
If ranges are only used in the Where clause, then it may be that they
should not be treated as a special type of parameters. Rather they are
a scalar value that is interpreted specially in that case.
This all leads to the idea that we can specify a formal grammar something
like:
parameter: key=value
key: a string
value: scalarvalue
rangevalue
listvalue
redirectionvalue
scalarvalue: // An empty string
astringcomposedofvalidcharacters
rangevalue: scalarvalue/scalarvalue
listvalue: scalarvalue
rangevalue
listvalue,scalarvalue
redirectionvalue: @redirectionstring
2.4
I think that a single qualifier is all that a parameter should
have. I'd like to understand what use case is being supported by
the idea of supporting multiple qualifiers within a list. I think
this adds potentially immense complexity for very little result.
[More precisely, I'm concerned with different elements in a list
having different qualifiers. Having multiple qualifiers that apply
to the entire list would be fine. Personally I suspect that
most of the use cases for qualifiers should be handled by additional
parameters and the whole thing could be scrapped.]
2.5
This should be combined with 2.2 (especially if that is moved
down after 2.3). I don't think the discussion of ordered versus unordered
lists is helpful. E.g., if we have some list
key=a,b,c,d
where the TAP service will need to process the values in some order
other than what the user specifies, then the discussion of that
belongs there. [I saw nothing later on that mentioned anything
about ordered versus unordered lists. Scrap it.]
2.6
I think this belongs immediately after the list parameters.
The text should be explicit as to whether indirect parameters can
have qualifiers. [Unless we just get rid of qualifiers!!!]
2.7
The phrase 'null string' is ambiguous at least to those of us who
program in Java and perhaps even more cryptic for a Fortraner where
a 0-length string is impossible. I'd suggest a wording like
"If a parameter is specified with no content in the value, e.g.,
"POS=", then the parameter shall be treated as having been set, but the
value
of the parameter may not be used. [Unless we have some case where
one can legally do this we should scrap it. E.g., I assume this should
be illegal with POS, and if it's legal for something else we should make
that the example.]
2.8
This is wrong. E.g., a user might want to find
all of the X-ray observations with long exposures from the HEASARC. So they
send a query with exposure=10000/ to all of our tables. We don't want
to send them back everything from all of our tables, we use the fact that
only our observation tables have exposures to point them to the data they
want. This is a common scenario and we want to support it.
I think that at the very least we need a flag which controls the
behavior, and I would suggest that the default behavior be that if
a table cannot meet all of the required constraints, that it not be queried.
The last sentence of this paragraph is goobledygook to me. It again assumes
something about data models that is not relevant.
2.9
Parameter values may be case sensitive, but are not always. E.g.,
the e/E in the exponential notation and the case of strings in LIKE
comparisons.
Also, we may wish to allow some parameter values where the values are
in a controlled vocabulary to be case insensitive.
So I think the wording is:
"but parameter values may be", not "must be".
2.10
Move the last sentence first. However this should be clarified
to 'Clients should use a given parameter keyword only once.' We should
be clearer between the idea of a parameter as the key/value pair and
the parameter as the key only. We mean the later here, but other places
the value is implicitly part of the parameter.
Stop after 'the response from the service is undefined.' Don't try
to define alternatives.
2.11
The sentence 'Positive, negative...' makes no sense to me. I'd
suggest. 'The legal range for certain numeric parameters may be
restricted. Such restrictions are noted in the discussion of those
parameters.' [But I don't see any later on, so we should just
scrap this.]
There should be some statement discussing the valid character set
for string values. Presumably this can refer to the XML standards.
Many of our databases may only support ASCII characters.
3.
I don't think the introductiory paragraph is helpful.
3.1.1
Personally I think that using a separate parameter for
the input coordinate system is a better idea than
qualifying the POS value.
There should be at least one coordinate system that all services
are required to support (ICRS, J2000?).
I'd stop "If SIZE is omitted ..." with "SHALL supply a default value."
The rest is quality of implemention, not specification. (not also
SHALL versus should).
3.1.2
I hate the region syntax!
Don't see why you want the spaces URL encoded -- I think this is a confusion
of the HTTP appearance and the value within PQL. In PQL we allow
the embedded spaces. When we send the PQL parameter over HTTP, the
embedded spaces are escaped.
3.1.3
There is no specification on how a user can find out which
are the standard parameters [there is for all parameters].
This should be possible through the
metadata and mentioned here.
3.1.4
I think this should probably be independent of TAP. Might wish
to mention database.schema.table versus table options.
3.1.5
This is a mess.
We need to face this now and clean this up. If parameters are separated
into list by commas, then we cannot use commas in the way discussed
here. They
are separating things we want joined.
E.g., suppose we want jmag between 4.5 and 5.5 or less than 3 or greater
than
9 and kmag between 4.5 and 5.5 (adapted from the text).
One viable syntax is:
WHERE=vmag=4.5/5.5|/3|9/,kmag=4.5/5.5
This would be fully consistent with the rest of the text.
Note that we use the '=' within the value of the WHERE parameter, but
that's perfectly legal and not a problem. I've used | as a delimiter
which is nice because it is used in some contexts to mean OR which is
what it means here. We could use ':'s as well. Personally I think
this is clearer than the original.
If you want to continue to use commas, then the general discussion of
parameters
in section 2 needs to be extensively revised.
I think this section should be set of subsections describing
the kinds of qualifiers that can be made and building up
in complexity. If we do this then we could probably get rid
of the idea that ranges are a special kind of parameter, rather
we analyze the / within a scalar field specially in the this
context (just as we do the * within a match to a string).
A possible set:
3.1.5.1 Equality queries
3.1.5.2 Range queries
3.1.5.3 Wildcard string queries
3.1.5.4 Queries of date parameters
3.1.5.5 Compound queries of a single parameter
3.1.5.6 Queries of multiple parameters
[We don't have to number the sections if we don't like going so
deep, but we build up from simple to complex queries in a set of
paragraphs and examples.]
As noted elsewhere, the discussion of URL encoding is inappapropriate here.
The user needs do no special URL encoding. This is needed only when the
PQL query
is sent over HTTP.
3.2
This should be in an appendix. It is not part of the specification
and gets in the way. I will not comment on this section.
4.1
This is needlessly confusing. It has some relevance to the old
cone search but discusses only a small bit of that.
A section that described how to represent an old style cone search
within TAP would be helpful. (RA,DEC->POS, SR->SIZE, VERB ->SELECT)
As written this section serves no significant role and should be deleted.
If a more coherent discussion were written it would need to note that
differences in error handling between TAP and the cone search standard.
Personally, I'd be all in favor of PQL supporting the Cone search parameters
with an explicit transformation of values. Then anyone using cone
search would get the benefits of all the new TAP services.
4.2
This appears to be the only use of the '@' syntax in the parameter values.
Is specially handling of this in section 2 warranted?
The queries may or may not be executed simultaneously. A better
wording might be "allowing a user to request data for an arbitrary
number of positions in a single request."
AFAIK, only data from the table in the FROM clause may be represented in the
output, so this is not really a cross-correlation (i.e., I cannot
see data from the uploaded table). If this is not the
case, then section 3.1.3 needs revision.
"In the most general case ... " paragraph is wordy and too prescriptive.
Just "Any table containing position information may be queried. If a
large table is being queried, a region constraint may be useful if the
uploaded positions are contained within an easily defined fraction of
the sky.
Other constraints may also be applied to the queried table.".
It's a mistake to require the position columns to be defined implicitly
(i.e., by the utype, ucd, name hierarchy)
Users should be allowed to explicitly specificy the columns
used for positions. E.g., we might have:
pos=@table,lii,bii;GALACTIC
I'd personally prefer
pos=@table
coordinateFields=lii,bii
coordinateSystem=GALACTIC
as separate parameters (the coordinateSystem would be used
for non-table uploads too).
The requirement that the ConeID field must be of type char and have
arraysize '*' is bad. This is causing a
lot of grief in current services.
It is unclear to me if non-ICRS coordinates are allowed here. They seem
to be for POS. If so what is the 'precise match for column names' when
the coordinates are not ICRS. This 'precise match' is not viable.
Does it require lower case? Is 'Dec' OK but not "DEC"? Is "declination"?
4.3
Is there anything PQL specific here? My sense is that it is TAP that
makes the requirement that the system implement the metadata tables.
PQL is a way of querying tables that are available. It's TAP or
whatever transport protocol we are using that defines <which> tables
are available.
We should simply note that if and where metadata tables
are available, they are queryable through PQL. We might want
to point out the metadata query that would give us the
standard fields for a given table (if there is such). That's
the only piece of this that is PQL specific.
I note that we have words like "A TAP service must..." which make
it clear that this is out of scope for the PQL document.
4.4
I'd get rid of this in PQL 1.0 -- but that's just me...
4.5
This is the only place in the document that we should describe
URL encoding.... But it is not part of PQL. It's two protocols
down...
the only character that requires escaping in PQL (as far
as I can see) is comma (,). Everything else is HTTP protocol
specific.
I'd make this an appendix.... No on second thought I'd toss it
out. PQL doesn't have anything to do with HTTP. It's layered on
TAP (and perhaps eventually other things). TAPs where we
worry about HTTP.
.
.
More information about the dal
mailing list