TABLESAMPLE?

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Fri Jul 19 11:45:57 CEST 2019


Hi all,

On Thu, Jul 18, 2019 at 11:34:15AM +0000, Gerard Lemson wrote:
> > Question - what does the user want, a random percentage (P) of rows, or a
> > random sample of (N) rows from the table ?
> > 
> I would generally want a number of rows.

It is probably not surprising that I prefer a percentage -- I find it
much more natural to say "Try on 1% of the data" than "Try on 10000
items".

I will not deny that I may be a bit influenced by the fact that on
top of Postgres, percentages are a lot more natural and requires
fewer tricks -- but then that might again be an indication that *if*
we want to choose just one option, percentages perhaps are preferred
(as that's what the postgres people did, which by and large I
consider amazingly smart).

IMHO more importantly, though, "1%" is vague enough, whereas I
believe users will be a bit surprised if they say TABLESAMPLE(1000
ROWS) and get back 100 or 10000.  So -- *if* we do rows, either in
addition to or instead of percent, I think we have to give certain
guarantees (as in: "the returned dataset must be within a factor of 2
of the requested size (provided the relation has that many rows to
begin with)".  Which is a complication that, again, I would like to
avoid unless there were a strong use case for sampling by rows.

Note also that, as VODataService 1.2 comes in, users will have a
reasonable way of estimating the number of rows they can expect when
using percentages, as tables there can (and hopefully will) have an
@nrows attribute.

        -- Markus


More information about the dal mailing list