TABLESAMPLE?
Gerard Lemson
glemson1 at jhu.edu
Thu Jul 18 13:34:15 CEST 2019
> Question - what does the user want, a random percentage (P) of rows, or a
> random sample of (N) rows from the table ?
>
I would generally want a number of rows.
> If it is the latter, then the client would need to know the total number of rows
> and calculate the percentage before making the request.
>
> It looks like the SQLSERVER version of the function allows the user to specify
> PERCENT or ROWS.
> https://docs.microsoft.com/en-us/sql/t-sql/queries/from-transact-sql?view=sql-
> server-2017
>
> TABLESAMPLE ( sample_number [ PERCENT | ROWS ] )
>
> Which moves the percentage calculation to the server, making it easier for a
> client to request a number of rows.
>
Yes, but it should be understood that in MS SQL specifying "N ROWS" does in general not return exactly a random sample of N rows.
It seems to be just a way to specify a percentage that on average would return the number of rows.
This is useful, but expectations should be made clear.
Gerard
> Cheers,
> -- Dave
>
> --------
> Dave Morris
> Research Software Engineer
> Wide Field Astronomy Unit
> Institute for Astronomy
> University of Edinburgh
> --------
>
> On 2019-07-17 13:48, Markus Demleitner wrote:
> > Hi all,
> >
> > Since there's no ADQL-2.2-Next yet, I'd like to propose a feature that
> > I could see in it here: TABLESAMPLE.
> >
> > The purpose is that the server takes a more-or-less random sample of a
> > table; you will specify a percentage to take in parentheses. For
> > instance, on http://dc.g-vo.org/tap you can run something like
> >
> > select avg(phot_g_mean_mag) from gaia.dr2light tablesample(0.2)
> >
> > and still finish within the sync timeout (and be within some 1e-4 of
> > the true value, I guess). Tablesample can be applied to each table
> > separately (though there's probably not many scenarios where you'd
> > want multiple of them). For instance, you can do something like
> >
> > SELECT
> > *
> > FROM gaia.dr2light as d tablesample(0.01)
> > JOIN ppmxl.main AS n
> > ON distance(d.ra, d.dec, n.raj2000, n.dej2000)< 2./3600.
> >
> > which gives you a reasonably all-sky sample of ~1e5 pairs of gaia and
> > ppmxl objects (try it and do a sky plot), which might give you an idea
> > of positional scatter, photometric matches, or whatever. And that
> > still within the sync timeout on my box, which is a few seconds.
> >
> > Much better than TOP, anyway.
> >
> >
> >
> > This is fashioned after the corresponding feature of postgres.
> > Postgres offers different sampling methods; essentially, row-wise and
> > block-wise. Only block-wise gives you a lot in terms of run-time, and
> > I'd avoid giving any guarantees (beyond "best effort") here anyway,
> > because it'll be hard to give them interoperably. And that's why I'd
> > say
> >
> > tablesample(percentage)
> >
> > would be a great feature that probably can be reasonably implemented
> > on most databases where the difference between a naive and a fast
> > implementation matters in the first place.
> >
> > I've even given it an id one can use in TAPRegExt until, perhaps,
> > it'll get into ADQL; capabilities fragment:
> >
> > <languageFeatures
> > type="ivo://org.gavo.dc/std/exts#extra-adql-keywords">
> > <feature>
> > <form>TABLESAMPLE</form>
> > <description>
> > Written after a table reference,
> > TABLESAMPLE(10) will make the database only use 10% of the
> > rows; these are `somewhat random' in that the system will
> > use random blocks. This should be good enough when just
> > testing queries (and much better than using TOP n).
> > </description>
> > </feature>
> > </languageFeatures>
> >
> > Opinions?
> >
> > -- Markus
More information about the dal
mailing list