ADQL and ORDER BY

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Thu Feb 25 10:41:14 CET 2021


Hi Pat,

On Wed, Feb 24, 2021 at 01:33:49PM -0800, Patrick Dowler wrote:
> I have just discovered (to my horror) that some of our databases sort in
> different order than others depending on the LC_COLLATE value...
> specifically thus is noticeable with underscore which comes before letters
> with C but after letters with en_CA or en_US... ouch.
> 
> Does ADQL specify this? Which locale (?) is assumed to be correct for ADQL?

It certainly doesn't so far, and doing this properly and in a way
that implementors don't despair over getting this right, in
particular over pre-existing databases, is I think somewhere between
hard and impossible.

For the somewhat related problem of UPPER and LOWER, we're weaseling
a bit in the current spec --

  Since case folding is a nontrivial operation in a multi-encoding
  world, ADQL requires standard behaviour for the ASCII characters,
  and recommends following algorithm R2 described in Section 3.13,
  "Default Case Algorithms" of \citet{std:UNICODE} for characters
  outside the ASCII set.

--, so perhaps we should do something similar for string ordering.  I
had hoped that at least within printable ASCII collations would
agree, which is why so far I've been happy to look the other way.

Your example with the underscore is alarming in that sense -- I've
not been aware of this.  I'd much prefer if we could say something
like "Within ASCII, collation must be as in the C locale, outside of
ASCII you're on your own" (or something like that; I'd have to look
up whether "as in the C locale" actually is well-defined).  But I'm
not sure about the implications this has after what you've written.

Note that even our language on UPPER and LOWER already has
limits what LC_CTYPE (I think) ADQL-serving databases can run in.  I
know of at least one locale that's excluded by our requirements:
Turkish, which has upper(i)=İ and lower(I)=ı.  I can't say I like
that exclusion much, but then I don't know a non-bigoted solution to
the problem.

       -- Markus


More information about the dal mailing list