TABLESAMPLE?

Thu Jul 18 12:41:52 CEST 2019

+1

I agree this would be useful.

Question - what does the user want, a random percentage (P) of rows, or 
a random sample of (N) rows from the table ?

If it is the latter, then the client would need to know the total number 
of rows and calculate the percentage before making the request.

It looks like the SQLSERVER version of the function allows the user to 
specify PERCENT or ROWS.
https://docs.microsoft.com/en-us/sql/t-sql/queries/from-transact-sql?view=sql-server-2017

     TABLESAMPLE ( sample_number [ PERCENT | ROWS ] )

Which moves the percentage calculation to the server, making it easier 
for a client to request a number of rows.

Cheers,
-- Dave

--------
Dave Morris
Research Software Engineer
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
--------

On 2019-07-17 13:48, Markus Demleitner wrote:
> Hi all,
> 
> Since there's no ADQL-2.2-Next yet, I'd like to propose a feature
> that I could see in it here: TABLESAMPLE.
> 
> The purpose is that the server takes a more-or-less random sample of
> a table; you will specify a percentage to take in parentheses.  For
> instance, on http://dc.g-vo.org/tap you can run something like
> 
> select avg(phot_g_mean_mag) from gaia.dr2light tablesample(0.2)
> 
> and still finish within the sync timeout (and be within some 1e-4 of
> the true value, I guess).  Tablesample can be applied to each table
> separately (though there's probably not many scenarios where you'd
> want multiple of them).  For instance, you can do something like
> 
> SELECT
>    *
>    FROM gaia.dr2light as d tablesample(0.01)
>    JOIN ppmxl.main AS n
>    ON distance(d.ra, d.dec, n.raj2000, n.dej2000)< 2./3600.
> 
> which gives you a reasonably all-sky sample of ~1e5 pairs of gaia and
> ppmxl objects (try it and do a sky plot), which might give you an
> idea of positional scatter, photometric matches, or whatever.  And
> that still within the sync timeout on my box, which is a few seconds.
> 
> Much better than TOP, anyway.
> 
> 
> 
> This is fashioned after the corresponding feature of postgres.
> Postgres offers different sampling methods; essentially, row-wise and
> block-wise.  Only block-wise gives you a lot in terms of run-time,
> and I'd avoid giving any guarantees (beyond "best effort") here
> anyway, because it'll be hard to give them interoperably.  And that's
> why I'd say
> 
> tablesample(percentage)
> 
> would be a great feature that probably can be reasonably implemented
> on most databases where the difference between a naive and a fast
> implementation matters in the first place.
> 
> I've even given it an id one can use in TAPRegExt until, perhaps,
> it'll get into ADQL; capabilities fragment:
> 
>       <languageFeatures 
> type="ivo://org.gavo.dc/std/exts#extra-adql-keywords">
>         <feature>
>           <form>TABLESAMPLE</form>
>           <description>
>             Written after a table reference,
>             TABLESAMPLE(10) will make the database only use 10% of the
>             rows; these are `somewhat random' in that the system will
>             use random blocks.  This should be good enough when just
>             testing queries (and much better than using TOP n).
>           </description>
>         </feature>
>       </languageFeatures>
> 
> Opinions?
> 
>         -- Markus