[TAP] case sensitivity of query

Tue Mar 3 10:43:30 PST 2009

I'll stay out of the Sync/Async debate, but I'd like to clarify some 
points regarding case sensitivity.

There are at least 4 classes of strings where one might worry about case 
sensitivity.

A. SQL keywords (SELECT, WHERE, BETWEEN, ...).  It is my belief that 
these are treated in a case insensitive fashion in all databases. 
There may be a related issue when a nominally available word is reserved 
in a particular database, e.g., dec is a reserved word in MySQL, but 
that probably deserves it's own discussion.

B. Function names (sin(x), someUserDefinedFunction, region, ...).  I 
believe that ADQL defines a standard subset of these and that they are 
case insensitive in all of the databases that I know of.  If we ever 
allow user defined functions, then perhaps the issues in C will also 
arise here.

C. Names defined in the schema, i.e., table and column names.  These are 
mostly what have been talked about previously. The default configuration 
of at least MySQL and Sybase seems to be to have these be case sensitive 
although this is not in the SQL standard.  There are some tempting 
precedents for use of names distinguished only by case, e.g., v and V 
are standard names for the apparent and absolute visual magnitudes.

The real issue here is that for a case-sensitive database to handle a 
case insensitive query in general, the SQL must be completely parsed and 
all references to columns and tables analyzed and compared so that the 
case of the input can be adjusted to whatever the native case is.  Thus 
instead of being able to defer the parsing of the SQL largely to the 
database (perhaps with some filtering of the REGION constraint), the 
server code must completely analyze it.  No thin layer on top of 
existing frameworks here.  This is independent of whether there are 
multiple fields distinguished only by case.  If the database happens to 
contain such fields, then any query which references them is ambiguous 
and probably would have to be rejected by a fully conformant implementation.

Solution 1, basically says that case sensitive databases may reject 
queries where the 'wrong' case is used for inputs.  The idea is that 
users will build the query based upon metadata supplied by the database 
which should supply values in the proper case.  So long as users don't 
change the names they should be OK.

Solution 2 is the fully conformant implementation discussed above. 
Speaking only for the HEASARC, we're not likely to implement anything 
like that on the soon.  However we may move to a Postgres database 
sometime next year, so the point may become moot for us.

D. Gerard brings up the case sensitivity of data within the database. 
Here I believe the SQL standard is that the '=' operator is case 
sensitive.  It's also true in practice in the Postgres, MySQL and Sybase 
implementations I've tried.  I believe that the 'LIKE' operator is 
supposed to be case-insensitive.

While I agree with Gerard that it is often the case that the query we 
want to provide to the user is case insensitive, I don't think that 
making the '=' operator work that way is correct.  E.g., joins based on 
exact match are much faster than those based upon a case-insensitive 
match.  Maybe keys are build case-insensitively when '=' is not case 
sensitive.

In practice I don't think this will matter too much.  Once users can get 
  to the databases, they'll learn how to filter queries as they wish. 
Perhaps there should be some metadata which indicates the 
case-sensitivity of the '=' operator.

So to summarize my understanding:

Keywords are case insensitive in the standard and in practice.

Functions are case insensitive in the standard and in practice for 
standard functions -- and we don't yet support user defined ones.

Schema defined names are supposed to be case insensitive according to 
the standard but are not in practice on many databases. Accommodating 
case insensitivity on such databases can be difficult.

String comparisons are supposed to be case sensitive according to  the 
standard and are on many databases, but Gerard suggests that they are 
not on others.  In practice I don't think this will be to big of a deal

	Tom

Gerard wrote:
> Hi Pat
> 
>> ...
>> I see two possible options that would make things work:
>>
>> 1. The service exposes the correct case-sensitive names and 
>> TAP clients do not modify these names in any way. This is 
>> less extreme than what is stated in
>> 2.3.4 but in the same spirit (which is explained in that 
>> subsection). We could modify the text to say that the query 
>> is not case sensitive except in this one respect. 
>>
> What should happen if clients *do* use table/column names with different
> case.
> Should the TAP service report an error? Also for case-insensitive databases?
> 
>> 2. The service has to accept a completely non-case query and 
>> if the backend db is case-sensitive the service has to map 
>> the table and column names from the metadata as necessary to 
>> make it work. This would get harder if the service really 
>> used case to distinguish things (like columns ra,dec,RA,DEC) 
>> but could still be done as long as the column names in the 
>> exposed metadata are unique. 
>> I don't know if anyone has/would make a database schema that way...
>>
> That would be very bad design indeed.
> 
>> Note: It should be noted that the current metadata (TBD) plan 
>> is that the service exposes the fully-qualified table name 
>> (e.g. with as many of the catalog and schema specified as are 
>> needed for the query to be correct) and the client is 
>> expected to use the fully-qualified table names "as-is" and 
>> not to parse them to get at these constructs. This greatly 
>> simplifies the metadata (esp. the XML schema, I understand) 
>> and works fine. Thus, clients are already supposed to use use 
>> table names "as-is" and #1 above just extends it to include 
>> case and apply to column names.
>>
>> Note: For TAP implementors where the DB is not case 
>> sensitive, neither of these effect you, but #1 marginally 
>> effects the client. 
>>
> That depends on what the reaction should be in case clients do not use the
> same
> case as in the metadata. See above.
> 
> 
> Btw, I think ni my comments I mentioned two types of case sensitivity.
> One of table/column names etc, another of string literals.
> 
> At least my SQLServer databases are case-insensitive regarding string
> literals.
> i.e. 
>   select * from coords where coordsystem='icrs'
> will return same as 
>   select * from coords where coordsystem='ICRS'
> 
> Though a postgres installation is not.
> (Both are case-insensitive regarding table and column names.)
> There are arguments why case-insensitivity for string literals is indeed a
> good thing.
> 
> 
>> Please comment asap.
> If the choice is between 1 and 2, I vote for 2.
> 
> Gerard
> 
>