thoughts on TAP-1.2

Patrick Dowler pdowler.cadc at gmail.com
Fri Mar 25 22:09:17 CET 2022


** use cases
The primary use case for our youcat service is for projects to publish
astronomical catalogues they create and curate. To that end, the tables are
added to the tap_schema and visible in the /tables endpoint. There is
access control that the users/projects control so they can control who can
create tables, who can insert rows, and who can query (all using external
GMS service). The general usage pattern is for tables to be protected (only
the group can see/query them) until the project publishes a paper, at which
point they would make the table publicly queryable.

We do not (yet) put table metadata into the registry so I haven't thought
that bit through, but probably only public tables should go there and I'd
probably make it an additional manual step to "publish" (to registry) and
not just have it triggered by a project admin changing a table to public
(and then back again a day later).

If you look at the details of the bulk loading,  you see that it is a
streaming operation that directly inserts rows into the database. There's a
lot to go wrong there, both  transient network failures, an input row
rejected because of invalid values or duplicate key, etc. By streaming
input directly into the tables, the client has the ability to look at
direct error messaging from the attempt to insert and can immediately query
to see the last row that was successful in order to resume. Any async
process is going to make that much harder, and very hard to standardise so
clients could automatically recover from content failures. It's hard to
push 500e6 rows into a database table without failures, but that's what
youcat users do and with the ability to diagnose and resume they can
eventually succeed.

Our secondary use case is at the other extreme: the DRAO Solar Flux Monitor
(not yet public/operational). This is a set of instruments that record and
persist a handful of measurements a few times each day. The process is to
add a few rows each day, so it is still "append rows to table" but at a
very small scale and never finishes. This use case is also very nicely
satisfied by our current implementation, allows the client to immediately
detect failures and retry, if they are feeling extra cautious, query for
recently ingested measurements to very success.

-- Definitely interested in more use cases for user-generated database
content...

** about vospace
Both the error handling/failure/resume of real bulk loading and the trickle
of measurements from sensors benefit from a synchronous direct-to-database
approach that can be immediately queried via the TAP API. We do have a
complete vospace service (vault) that could accept/stage catalogue content
and we did look at those heady ideas but it is at least as complex or maybe
more so. That's the primary road block for the "vospace" ideas and as far
as I am aware, no one has ever made it work. We stopped thinking about that
approach during the design phase when the list of "vospace magic" things
that had to happen and the opaqueness of such a system grew too large in
comparison with the vosi-tables + bulk load approach.

I've kind of skipped the whole topic of indices, but we do have an async
(uws) job endpoint to run create index commands either before or after bulk
loading. We recommend people create a unique key index (pk) before loading
and other indices after for the typical bulk loaded catalogue use case. So
we are re-using existing APIs where they are applicable, but this part was
looking much more complicated as "vospace magic".

The final thing about "vospace magic" is that for someone who is into TAP
and catalogues, requiring a vospace implementation in order to get user
content into a tap service is a big ask. First, it's an
implementation/deployment/operational burden to require a vospace someone
might not otherwise want to offer; that's a big barrier to adoption.
Second, you need to either (i) have your vospace service connecting to your
tap database or (ii) some external agent has access to both vospace content
and the tap database, which has big red bad/monolithic architecture flags
all over it; that's obviously a personal opinion, but I see a lot of
tight-coupling between two services that are already individually
complicated to operate and that's something I want to avoid. We also
thought a little about simply repurposing some parts of the vospace api
rather than having a complete vospace for this, but it just didn't seem to
buy very much here even where the concepts are the same.

-- Would like to stop hearing about how someone once thought vospace could
do this :-) unless of course someone wants to show a working service and
explain how they made it work...


--
Patrick Dowler
Canadian Astronomy Data Centre
Victoria, BC, Canada


On Mon, 21 Mar 2022 at 09:51, Dave Morris <dave.morris at metagrid.co.uk>
wrote:

>
> This is indeed one of the use cases that we had in mind for VOSpace.
>
> A section of space in a VOSpace service where the directory structure
> maps to the catalog/schema/table hierarchy of a writable database.
>
> Creating a 'file' called 'mytable' in 'mycatalog/myschema' would create
> a new table.
>
> All of the object construction and access control rules map fairly well
> onto a virtual directory structure and from a user's perspective it can
> be made really simple.
>
> To create a new database table, just drag a VOTable file from my desktop
> into 'mycatalog/myschema', and the service takes care of the rest.
>
> As a side effect, you get all of the 3rd party asynchronous transfer
> capabilities needed to transfer a multi-Tbyte result set from one
> service to another.
>
> Cheers,
> -- Dave
>
> --------
> Dave Morris
> Research Software Engineer
> Wide Field Astronomy Unit
> Institute for Astronomy
> University of Edinburgh
> --------
>
> On 2022-03-17 07:22, Markus Demleitner wrote:
> >
> > The thing that worries me a bit about the current proposal is that
> > the operations *are* fairly similar to what we offer in VOSpace, and
> > if we have two rather different APIs for what's straightforwardly
> > subsumed as remote data management, I think we should have strong
> > reasons.
> >
> > Have you considered employing VOSpace for this?  If so, why did you
> > discard it?  Could it perhaps be fixed to work for this use case?
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dal/attachments/20220325/c871331a/attachment.html>


More information about the dal mailing list