thoughts on TAP-1.2

Dave Morris dave.morris at metagrid.co.uk
Sat Mar 26 12:38:02 CET 2022


Yep, you are right.

This is looks like a simpler/cleaner way to create and upload tables.

Cheers,
-- Dave

--------
Dave Morris
Research Software Engineer
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
--------

On 2022-03-25 21:09, Patrick Dowler wrote:
> ** use cases
> The primary use case for our youcat service is for projects to publish
> astronomical catalogues they create and curate. To that end, the tables 
> are
> added to the tap_schema and visible in the /tables endpoint. There is
> access control that the users/projects control so they can control who 
> can
> create tables, who can insert rows, and who can query (all using 
> external
> GMS service). The general usage pattern is for tables to be protected 
> (only
> the group can see/query them) until the project publishes a paper, at 
> which
> point they would make the table publicly queryable.
> 
> We do not (yet) put table metadata into the registry so I haven't 
> thought
> that bit through, but probably only public tables should go there and 
> I'd
> probably make it an additional manual step to "publish" (to registry) 
> and
> not just have it triggered by a project admin changing a table to 
> public
> (and then back again a day later).
> 
> If you look at the details of the bulk loading,  you see that it is a
> streaming operation that directly inserts rows into the database. 
> There's a
> lot to go wrong there, both  transient network failures, an input row
> rejected because of invalid values or duplicate key, etc. By streaming
> input directly into the tables, the client has the ability to look at
> direct error messaging from the attempt to insert and can immediately 
> query
> to see the last row that was successful in order to resume. Any async
> process is going to make that much harder, and very hard to standardise 
> so
> clients could automatically recover from content failures. It's hard to
> push 500e6 rows into a database table without failures, but that's what
> youcat users do and with the ability to diagnose and resume they can
> eventually succeed.
> 
> Our secondary use case is at the other extreme: the DRAO Solar Flux 
> Monitor
> (not yet public/operational). This is a set of instruments that record 
> and
> persist a handful of measurements a few times each day. The process is 
> to
> add a few rows each day, so it is still "append rows to table" but at a
> very small scale and never finishes. This use case is also very nicely
> satisfied by our current implementation, allows the client to 
> immediately
> detect failures and retry, if they are feeling extra cautious, query 
> for
> recently ingested measurements to very success.
> 
> -- Definitely interested in more use cases for user-generated database
> content...
> 
> ** about vospace
> Both the error handling/failure/resume of real bulk loading and the 
> trickle
> of measurements from sensors benefit from a synchronous 
> direct-to-database
> approach that can be immediately queried via the TAP API. We do have a
> complete vospace service (vault) that could accept/stage catalogue 
> content
> and we did look at those heady ideas but it is at least as complex or 
> maybe
> more so. That's the primary road block for the "vospace" ideas and as 
> far
> as I am aware, no one has ever made it work. We stopped thinking about 
> that
> approach during the design phase when the list of "vospace magic" 
> things
> that had to happen and the opaqueness of such a system grew too large 
> in
> comparison with the vosi-tables + bulk load approach.
> 
> I've kind of skipped the whole topic of indices, but we do have an 
> async
> (uws) job endpoint to run create index commands either before or after 
> bulk
> loading. We recommend people create a unique key index (pk) before 
> loading
> and other indices after for the typical bulk loaded catalogue use case. 
> So
> we are re-using existing APIs where they are applicable, but this part 
> was
> looking much more complicated as "vospace magic".
> 
> The final thing about "vospace magic" is that for someone who is into 
> TAP
> and catalogues, requiring a vospace implementation in order to get user
> content into a tap service is a big ask. First, it's an
> implementation/deployment/operational burden to require a vospace 
> someone
> might not otherwise want to offer; that's a big barrier to adoption.
> Second, you need to either (i) have your vospace service connecting to 
> your
> tap database or (ii) some external agent has access to both vospace 
> content
> and the tap database, which has big red bad/monolithic architecture 
> flags
> all over it; that's obviously a personal opinion, but I see a lot of
> tight-coupling between two services that are already individually
> complicated to operate and that's something I want to avoid. We also
> thought a little about simply repurposing some parts of the vospace api
> rather than having a complete vospace for this, but it just didn't seem 
> to
> buy very much here even where the concepts are the same.
> 
> -- Would like to stop hearing about how someone once thought vospace 
> could
> do this :-) unless of course someone wants to show a working service 
> and
> explain how they made it work...
> 
> 
> --
> Patrick Dowler
> Canadian Astronomy Data Centre
> Victoria, BC, Canada
> 
> 
> On Mon, 21 Mar 2022 at 09:51, Dave Morris <dave.morris at metagrid.co.uk>
> wrote:
> 
>> 
>> This is indeed one of the use cases that we had in mind for VOSpace.
>> 
>> A section of space in a VOSpace service where the directory structure
>> maps to the catalog/schema/table hierarchy of a writable database.
>> 
>> Creating a 'file' called 'mytable' in 'mycatalog/myschema' would 
>> create
>> a new table.
>> 
>> All of the object construction and access control rules map fairly 
>> well
>> onto a virtual directory structure and from a user's perspective it 
>> can
>> be made really simple.
>> 
>> To create a new database table, just drag a VOTable file from my 
>> desktop
>> into 'mycatalog/myschema', and the service takes care of the rest.
>> 
>> As a side effect, you get all of the 3rd party asynchronous transfer
>> capabilities needed to transfer a multi-Tbyte result set from one
>> service to another.
>> 
>> Cheers,
>> -- Dave
>> 
>> --------
>> Dave Morris
>> Research Software Engineer
>> Wide Field Astronomy Unit
>> Institute for Astronomy
>> University of Edinburgh
>> --------
>> 
>> On 2022-03-17 07:22, Markus Demleitner wrote:
>> >
>> > The thing that worries me a bit about the current proposal is that
>> > the operations *are* fairly similar to what we offer in VOSpace, and
>> > if we have two rather different APIs for what's straightforwardly
>> > subsumed as remote data management, I think we should have strong
>> > reasons.
>> >
>> > Have you considered employing VOSpace for this?  If so, why did you
>> > discard it?  Could it perhaps be fixed to work for this use case?
>> >
>> 


More information about the dal mailing list