thoughts on TAP-1.2

Mon Mar 28 22:31:39 CEST 2022

On Mon, 28 Mar 2022 at 04:11, Markus Demleitner <
msdemlei at ari.uni-heidelberg.de> wrote:

> Hi Pat,
>
> On Fri, Mar 25, 2022 at 02:09:17PM -0700, Patrick Dowler wrote:
> > astronomical catalogues they create and curate. To that end, the tables
> are
> > added to the tap_schema and visible in the /tables endpoint. There is
>
> All tables for all users?
>
> I'm mainly asking because I'd like to do this to let "normal" people
> do rather ephemeral (but yet persistent for a few days or weeks)
> uploads, and I'm 100% sure nobody else should or want to see them.
>

Yes, in youcat all tables go into the tap_schema.

If the caller has read permission on the schema in tap_schema, they can see
the metadata. Otherwise, the tables in that schema are invisible via both
mechanisms. I don't recall if the schema itself is invisible or simple
looking like a directory you can't look inside, but in general permissions
on the schema hide/expose the table (existence and metadata).

By default, a created schema is public=false and created tables are
public=false; the owner decides if they want to make the schema public
(expose metadata about tables) and if they want to make a specific table
public (expose data in the table).

Once could support private ephemeral tables by not making the schema
public. We chose to give users control of permissions (inc. GMS support)
because of the use cases I outlined, but that would not be mandatory for
all implementers. I can even envision a way to let people create tables but
not change the permissions with our existing code, which is likely what I'd
do if we wanted to allow user ephemeral tables in our argus (CAOM) tap
service -- with our existing code I would simply have "cadcops" own the
schema and grant table creation permission to a user. Then the user could
not change schema permissions and the tables would be invisible.

> That's not *much* of a problem for /tables, but for tap_schema it is
> *quite* a complication, and hence I'm curious what you do and how it
> works for you.
>

We do provide consistent metadata from /tables and tap_schema queries.
Yeah, that means row level access control injected into tap_schema queries
and that bit can be complicated to implement. However, we already had
figured out how to do that in argus (CAOM) because of protected metadata,
so we had well tested code to inject access control constraints into ADQL
queries that was easy enough to reuse for this. I can think of ways to
implement this that would be much simpler in less general purpose scenarios
and in standardising this whole thing, it would be important to write the
spec so this kind of simplification was feasible. So if a TAP service
supported ephemeral tables only visible to the owner, that simple scenario
should be easier to implement than the general case like youcat, but have
most of the mechanics in common and just "not allow" some things (could be
literally "permission denied" or "unsupported" - TBD).

> > access control that the users/projects control so they can control who
> can
> > the group can see/query them) until the project publishes a paper, at
> which
> > point they would make the table publicly queryable.
>
> Rephrasing my query above: Is it just the data that's hidden or the
> metadata, too?
>

as above, metadata is hidden if the caller doesn't have read permission on
the schema

>
> > We do not (yet) put table metadata into the registry so I haven't thought
> > that bit through, but probably only public tables should go there and I'd
> > probably make it an additional manual step to "publish" (to registry) and
>
> At this point I'm a lot less worried about the registry than about
> what clients get from /tables and tap-schema.
>
> However, with my Registry hat on let me briefly state that for me it
> sucks if your tap_schema is different from what you give the
> registry, as that will give everyone a lot of headache when, one day,
> we want to move from GloTS (which harvests tap_schema if it can) to a
> proper Registry approach.
>
> > If you look at the details of the bulk loading,  you see that it is a
> > streaming operation that directly inserts rows into the database.
> There's a
>
> Our of curiosity (not closely related to much anything): You're not
> batching these inserts?  And that's performing well?
>
The server side does transactions in a (currently internal) batch size to
balance performance and recovery from failure. The actual batch size isn't
exposed or controllable by users. The user could effectively make smaller
batches by splitting up the stream into a sequence of requests, but can't
make them larger.

>
> > clients could automatically recover from content failures. It's hard to
> > push 500e6 rows into a database table without failures, but that's what
>
> If find it remarkable that you seem to spend quite a bit of effort
> on defeating transactionality -- that's really what your users
> wanted?  Half-uploaded tables?  How does that work technically?  Are
> you really inserting these things outside of database transactions?
>

In the original design/planning it was very clear to me that "bulk loading"
was pretty simple and anything that exposes transaction semantics to the
client more or less brings to bear the entire database API. Users/projects
that want the latter should operate the database, have a database account,
and connect to the database with an API of their choice. And they should
create and manage the TAP service as well. We were/are willing to help
projects with that in various ways, but this functionality is for people
who don't want to do all that stuff. They typically have a big fat table
with 200 columns and a few hundred million rows that they want people to
query.

> -- Definitely interested in more use cases for user-generated database
> > content...
>
> Well, as hinted above what I'm really after is
>
>   SELECT
>   INTO my_schema.result_table
>     ra, dec, foo, bar
>   FROM some.tap_table
>   WHERE...
>
> That is, people shouldn't need to download their results if they'd
> like to reuse them later within my database.
>
Conceptually, I think it would be feasible to make this work and have the
same result.

I always think of INSERT .... SELECT ... to better separate destination
from source values, but they are equivalent. Obviously this would be an
async job of some sort and we have thought about async loading where row
data is pulled from some source (URL) but it could be a local query as
well. The URL can effectively be a remote TAP query, so these cases are
conceptually related.

--
Patrick Dowler
Canadian Astronomy Data Centre
Victoria, BC, Canada
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dal/attachments/20220328/22daadaf/attachment.html>