TABLESAMPLE?

Wed Jul 17 16:10:48 CEST 2019

Hi Markus
I did a quick check. TABLESAMPLE also exists in (our CasJobs) MS SQL Server (databases).
Nicer and likely faster than "select top N ... order by newid()".
I think main use is your second example of creating representative samples for scatter plots for example, where top without random ordering is no use indeed.

I tried on a table with 750million rows.
Averages at .1 percent sampling can be 1% off for that table.
For that table also without sampling the query returns within the sync timeout.

Note, MS SQL supports TABLESAMPLE (N [PERCENT | ROWS]), PERCENT being default. 
ROWS is not exact and TABLESAMPLE(100 ROWS) returned 0 rows. 
Also, MS SQL allows specification of the random seed for reproducibility:
TABLESAMPLE (N [PERCENT | ROWS]) [REPEATABLE (repeat_seed)]

Anyway, as it seems to have simple support from both postgres and sql server, +1 from me.

Cheers
Gerard

> -----Original Message-----
> From: dal-bounces at ivoa.net <dal-bounces at ivoa.net> On Behalf Of Markus
> Demleitner
> Sent: Wednesday, July 17, 2019 8:49 AM
> To: dal at ivoa.net
> Subject: TABLESAMPLE?
> 
> 
> Hi all,
> 
> Since there's no ADQL-2.2-Next yet, I'd like to propose a feature that I could
> see in it here: TABLESAMPLE.
> 
> The purpose is that the server takes a more-or-less random sample of a
> table; you will specify a percentage to take in parentheses.  For instance, on
> http://dc.g-vo.org/tap you can run something like
> 
> select avg(phot_g_mean_mag) from gaia.dr2light tablesample(0.2)
> 
> and still finish within the sync timeout (and be within some 1e-4 of the true
> value, I guess).  Tablesample can be applied to each table separately (though
> there's probably not many scenarios where you'd want multiple of them).
> For instance, you can do something like
> 
> SELECT
>    *
>    FROM gaia.dr2light as d tablesample(0.01)
>    JOIN ppmxl.main AS n
>    ON distance(d.ra, d.dec, n.raj2000, n.dej2000)< 2./3600.
> 
> which gives you a reasonably all-sky sample of ~1e5 pairs of gaia and ppmxl
> objects (try it and do a sky plot), which might give you an idea of positional
> scatter, photometric matches, or whatever.  And that still within the sync
> timeout on my box, which is a few seconds.
> 
> Much better than TOP, anyway.
> 
> 
> 
> This is fashioned after the corresponding feature of postgres.
> Postgres offers different sampling methods; essentially, row-wise and block-
> wise.  Only block-wise gives you a lot in terms of run-time, and I'd avoid
> giving any guarantees (beyond "best effort") here anyway, because it'll be
> hard to give them interoperably.  And that's why I'd say
> 
> tablesample(percentage)
> 
> would be a great feature that probably can be reasonably implemented on
> most databases where the difference between a naive and a fast
> implementation matters in the first place.
> 
> I've even given it an id one can use in TAPRegExt until, perhaps, it'll get into
> ADQL; capabilities fragment:
> 
>       <languageFeatures type="ivo://org.gavo.dc/std/exts#extra-adql-
> keywords">
>         <feature>
>           <form>TABLESAMPLE</form>
>           <description>
>             Written after a table reference,
>             TABLESAMPLE(10) will make the database only use 10% of the
>             rows; these are `somewhat random' in that the system will
>             use random blocks.  This should be good enough when just
>             testing queries (and much better than using TOP n).
>           </description>
>         </feature>
>       </languageFeatures>
> 
> Opinions?
> 
>         -- Markus