TABLESAMPLE?
Markus Demleitner
msdemlei at ari.uni-heidelberg.de
Wed Jul 17 14:48:39 CEST 2019
Hi all,
Since there's no ADQL-2.2-Next yet, I'd like to propose a feature
that I could see in it here: TABLESAMPLE.
The purpose is that the server takes a more-or-less random sample of
a table; you will specify a percentage to take in parentheses. For
instance, on http://dc.g-vo.org/tap you can run something like
select avg(phot_g_mean_mag) from gaia.dr2light tablesample(0.2)
and still finish within the sync timeout (and be within some 1e-4 of
the true value, I guess). Tablesample can be applied to each table
separately (though there's probably not many scenarios where you'd
want multiple of them). For instance, you can do something like
SELECT
*
FROM gaia.dr2light as d tablesample(0.01)
JOIN ppmxl.main AS n
ON distance(d.ra, d.dec, n.raj2000, n.dej2000)< 2./3600.
which gives you a reasonably all-sky sample of ~1e5 pairs of gaia and
ppmxl objects (try it and do a sky plot), which might give you an
idea of positional scatter, photometric matches, or whatever. And
that still within the sync timeout on my box, which is a few seconds.
Much better than TOP, anyway.
This is fashioned after the corresponding feature of postgres.
Postgres offers different sampling methods; essentially, row-wise and
block-wise. Only block-wise gives you a lot in terms of run-time,
and I'd avoid giving any guarantees (beyond "best effort") here
anyway, because it'll be hard to give them interoperably. And that's
why I'd say
tablesample(percentage)
would be a great feature that probably can be reasonably implemented
on most databases where the difference between a naive and a fast
implementation matters in the first place.
I've even given it an id one can use in TAPRegExt until, perhaps,
it'll get into ADQL; capabilities fragment:
<languageFeatures type="ivo://org.gavo.dc/std/exts#extra-adql-keywords">
<feature>
<form>TABLESAMPLE</form>
<description>
Written after a table reference,
TABLESAMPLE(10) will make the database only use 10% of the
rows; these are `somewhat random' in that the system will
use random blocks. This should be good enough when just
testing queries (and much better than using TOP n).
</description>
</feature>
</languageFeatures>
Opinions?
-- Markus
More information about the dal
mailing list