new take on resource registration best practice

Thu Oct 24 05:32:00 PDT 2013

Dear Reg-WG,

On Thu, Oct 24, 2013 at 10:04:01AM +0100, Mark Taylor wrote:
> On Wed, 23 Oct 2013, Markus Demleitner wrote:
> 
> > One possibility still is that we do nothing in VOResource.  Under the
> > assumption that there's not going to be thousands of "federated"
> > services, maybe clients could cope with resolving relationsships by
> > just memoizing the most common federated services?  Maybe queries
> > against the original VOResource DM can be made natural enough that
> > this can work?  I believe the three-worlds approach I've described in
> 
> can you explain in a bit more detail how this memoizing
> (? if that's a technical term with non-obvious
> connotations I've missed them) might work?

Well, "memoizing" is a bigger word than I perhaps should have used.
The basic idea is to avoid having to query the registry for "common"
services by keeping some sort of cache on them.  "Cache" is the catch
phrase here, since of course the question is how to decide what to
cache and when to discard the cache.  This is complicated enough to
make me doubt many people will like this.

> Is the idea that one could make a RegTAP query to identify the
> (small number of) large federated TAP services, or that such a
> list is hard coded into clients?  What needs to get memoized,

What I could see is a query to rr.relation; clients could then be
free to decide whether to cache resources that serve more than 5 or
20 or whatever data collections.

> a large chunk of VOResource for each one or just the service URL?

I believe the service URL and maybe a resource name would be plenty.

While I'm speaking, let me take the liberty of commenting on some of
Marco's remarks in
<CABiOC765D8gLNG-czdd_CPnjRa56O8eL6YDZUsFYh8HzukFDKQ at mail.gmail.com>:

>> This is true to some extent -- but of course the design of the
>> registry data model (and its actual usage) has to keep both the needs
>> of the publishers and the needs of the searching users in mind.  If
>> these needs are so different that, in effect, two different data
>> models (i.e., we shove around pieces of information on resource
>> ingestion) are necessary, so be it.
>
> I'm not sure I'm getting completely the meaning of this distinction.
> Considering the registry data model as a unique, the requirements
> form the searchers and those from the publisher should live in the
> same space. I consider the distinction as an interfaces-related
> topic. The caveat, seems to me, is that the model has to allow for

...except that the interfaces of course expose *some* sort of model,
even if you go some way to bag-of words -- e.g., is there a concept
of publisher that I could search for.  And yes, of course the
question of how to get from "physical" metadata to an actual, usable
computer interface, too.

So, if we define an interface that lets users find capabilities in
resources that don't have them on the publisher side, that's defining
a new data model.

I think it's fair to say that capabilities of federating services in
data collections (in one way or the other) are "good" for the users.
The quesetion is: are they "bad" for the publishers or is Marco right
with his suggestion that the two interests aren't that far apart?

With my publisher hat on I'd say I probably wouldn't mind including
such capabilities in my (so far) data collection records.  But that's
anecdotal, and I haven't even implemented that.

> I agree it could have a minimal impact, but I think it could mess
> up things at registry maintenance.
> I mean, you add cross-resource manipulation at ingestion step:
> isn't this adding a point of failure, specially if you consider mirrored
> resources?

I've thought about this when I brood over the data collection issue,
and you're right, these things become fairly messy with incremental
harvesting, in particular if you consider both service-for a
served-by relationships.  They become messy enough that I've not even
considered proposing this as a solution for RegTAP.

Without incremental harvesting, it's feasible but obviously ugly.

>> Therefore, I'd say these "served-by" capabilities should have special
>> standardIds (maybe just the normal standard ids with "?service-for"
>> appended?).
>>
>
> I'd prefer something that does not require parsing (am I monotonous?),
> but the idea of clearly stating the "service-for" I think would be useful
> for clients.

This is not meant for parsing.  Clients would search either for "data
served by TAP services" (with ?service-for) or "TAP services"
(without ?service-for).  In both cases they'll use the full string
opaquely.

The point of mogrifying existing ivo-ids is to avoid having to
register these names separately, and also to indicate to tech-savvy
humans (who can't avoid parsing) what kind of services the publishers
are talking about without having to learn a new set of ids.

> alongside best practice with IVORNs? New publishers entering the
> VO may find it useful to have some guidelines for it.
> Is this only a dream of mine?

I'd help writing such a thing, but I'll not push it along, I'm
afraid...

> The second is only a question about the "case 2: data repository".
> Shouldn't each collection in it have a "part-of" relationship
> to the repository DataCollection2, like it happens with Data Center
> individual mission resources?
> If not, can you explain me why? (probably my fault, but I cannot see it).

I'm not a big fan of noting down purely physical or political
circumstances ("A and B are residing in the same data center") in
this way -- I don't see much of a use case for this information.

If, on the other hand, we're talking about "Data products from
Smartsat" -- then yes, I think there should be a part-of
relationship.  Do we have resources that would need this sort of
setup?

Finally, let me comment on something Ray said in
<alpine.DEB.2.00.1310230920040.5536 at epte>:

> There is a question, though, as to whether this kind of smarts needs
> to be widely applied across all searchable registries or let this just
> be a custom, value-added feature a registry might implement.

and again

> Will we need to *require* a registry to implement the ingestion
> behavior you described?  Is it better to get the original resource
> record in a optimal state to begin with?

Since the effects of whether or not that  happens are visible to the
registry client, I am completely convinced that yes, this must be
mandated.  Maybe not in a REC initially, but clients must be able to
rely on this when posting queries, or we'll see the hell of
registries giving a different view of the VO each that we have had in
the past couple of years multiplied by... ah, well, whatever you
multiply mythical places by.

Cheers,

          Markus