TABLESAMPLE?
Markus Demleitner
msdemlei at ari.uni-heidelberg.de
Fri Jul 19 11:45:57 CEST 2019
Hi all,
On Thu, Jul 18, 2019 at 11:34:15AM +0000, Gerard Lemson wrote:
> > Question - what does the user want, a random percentage (P) of rows, or a
> > random sample of (N) rows from the table ?
> >
> I would generally want a number of rows.
It is probably not surprising that I prefer a percentage -- I find it
much more natural to say "Try on 1% of the data" than "Try on 10000
items".
I will not deny that I may be a bit influenced by the fact that on
top of Postgres, percentages are a lot more natural and requires
fewer tricks -- but then that might again be an indication that *if*
we want to choose just one option, percentages perhaps are preferred
(as that's what the postgres people did, which by and large I
consider amazingly smart).
IMHO more importantly, though, "1%" is vague enough, whereas I
believe users will be a bit surprised if they say TABLESAMPLE(1000
ROWS) and get back 100 or 10000. So -- *if* we do rows, either in
addition to or instead of percent, I think we have to give certain
guarantees (as in: "the returned dataset must be within a factor of 2
of the requested size (provided the relation has that many rows to
begin with)". Which is a complication that, again, I would like to
avoid unless there were a strong use case for sampling by rows.
Note also that, as VODataService 1.2 comes in, users will have a
reasonable way of estimating the number of rows they can expect when
using percentages, as tables there can (and hopefully will) have an
@nrows attribute.
-- Markus
More information about the dal
mailing list