ADQL-2.1 internal draft

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Thu Jun 11 13:22:41 CEST 2015


Hi,

On Thu, Jun 11, 2015 at 10:46:13AM +0200, Marco Molinaro wrote:
> >> mandatory...and also to have only one of them to help with tables
> >> indexing.
> >> Probably this is something to discuss.
> >
> > If anything we should be normalizing to upper case.  There are some
> > letters that do not round trip properly through lower case.
> 
> <cut>
> 
> it can be true also the other way around (scharfes S, e.g., even if
> unicode has an uppercase letter for it), but again I don't think the
> intent in adding these functions was to support all encodings and
> character sets, it pointed to correctly manage comparison for things
> like UCDs and Utypes.
> That's why ASCII was considered to be enough.

The problem is that database backends wildly vary in what unicode
codepoints they can represent and  properly manipulate, and there's
not much a frontend can (efficiently) do about it.  All we can
sensibly do is mandate that the TAP query is decoded as utf-8 (->
matter for TAP 1.1) and that ASCII must work within the database with
all functions and operators we define.

Hence I think the current text for LOWER must be defused, and the one
for ILIKE made more precise, or we'll have lots of non-conforming
implementations (for trivial reasons that for most DBs won't even
matter).

The text in the implementation note encouraged people to adhere to
unicode conventions.  That's I believe all we can do.

> > This still does not specify whether it is UTF-8, UTF-16, or UCS-32.  I
> > think we should just choose one, with my vote being UTF-8 since ASCII
> > is unchanged.

I don't think ADQL needs to specify this.  Like Unicode itself,
database backends in effect work on code points, not on byte streams,
and thus you cannot talk about encodings.  

It is the implementor's responsibility to make sure her database
client's sequences of codepoints ("strings") actually make it into
the database as the same codepoints.  That, however, has nothing to
do with the encoding of the query string or the behaviour of LOWER
or ILIKE.  Or am I missing something?

Cheers,

           Markus



More information about the dal mailing list