TABLESAMPLE?

Thu Jul 18 13:34:15 CEST 2019

> Question - what does the user want, a random percentage (P) of rows, or a
> random sample of (N) rows from the table ?
> 
I would generally want a number of rows.

> If it is the latter, then the client would need to know the total number of rows
> and calculate the percentage before making the request.
> 
> It looks like the SQLSERVER version of the function allows the user to specify
> PERCENT or ROWS.
> https://docs.microsoft.com/en-us/sql/t-sql/queries/from-transact-sql?view=sql-
> server-2017
> 
>      TABLESAMPLE ( sample_number [ PERCENT | ROWS ] )
> 
> Which moves the percentage calculation to the server, making it easier for a
> client to request a number of rows.
> 
Yes, but it should be understood that in MS SQL specifying "N ROWS" does in general not return exactly a random sample of N rows.
It seems to be just a way to specify a percentage that on average would return the number of rows.
This is useful, but expectations should be made clear.

Gerard

> Cheers,
> -- Dave
> 
> --------
> Dave Morris
> Research Software Engineer
> Wide Field Astronomy Unit
> Institute for Astronomy
> University of Edinburgh
> --------
> 
> On 2019-07-17 13:48, Markus Demleitner wrote:
> > Hi all,
> >
> > Since there's no ADQL-2.2-Next yet, I'd like to propose a feature that
> > I could see in it here: TABLESAMPLE.
> >
> > The purpose is that the server takes a more-or-less random sample of a
> > table; you will specify a percentage to take in parentheses.  For
> > instance, on http://dc.g-vo.org/tap you can run something like
> >
> > select avg(phot_g_mean_mag) from gaia.dr2light tablesample(0.2)
> >
> > and still finish within the sync timeout (and be within some 1e-4 of
> > the true value, I guess).  Tablesample can be applied to each table
> > separately (though there's probably not many scenarios where you'd
> > want multiple of them).  For instance, you can do something like
> >
> > SELECT
> >    *
> >    FROM gaia.dr2light as d tablesample(0.01)
> >    JOIN ppmxl.main AS n
> >    ON distance(d.ra, d.dec, n.raj2000, n.dej2000)< 2./3600.
> >
> > which gives you a reasonably all-sky sample of ~1e5 pairs of gaia and
> > ppmxl objects (try it and do a sky plot), which might give you an idea
> > of positional scatter, photometric matches, or whatever.  And that
> > still within the sync timeout on my box, which is a few seconds.
> >
> > Much better than TOP, anyway.
> >
> >
> >
> > This is fashioned after the corresponding feature of postgres.
> > Postgres offers different sampling methods; essentially, row-wise and
> > block-wise.  Only block-wise gives you a lot in terms of run-time, and
> > I'd avoid giving any guarantees (beyond "best effort") here anyway,
> > because it'll be hard to give them interoperably.  And that's why I'd
> > say
> >
> > tablesample(percentage)
> >
> > would be a great feature that probably can be reasonably implemented
> > on most databases where the difference between a naive and a fast
> > implementation matters in the first place.
> >
> > I've even given it an id one can use in TAPRegExt until, perhaps,
> > it'll get into ADQL; capabilities fragment:
> >
> >       <languageFeatures
> > type="ivo://org.gavo.dc/std/exts#extra-adql-keywords">
> >         <feature>
> >           <form>TABLESAMPLE</form>
> >           <description>
> >             Written after a table reference,
> >             TABLESAMPLE(10) will make the database only use 10% of the
> >             rows; these are `somewhat random' in that the system will
> >             use random blocks.  This should be good enough when just
> >             testing queries (and much better than using TOP n).
> >           </description>
> >         </feature>
> >       </languageFeatures>
> >
> > Opinions?
> >
> >         -- Markus