Question: harvesting managed vs. all resource records

Ray Plante rplante at ncsa.uiuc.edu
Mon Apr 4 14:40:44 PDT 2005


Hey Kevin,

On Mon, 4 Apr 2005, KevinBenson wrote:
> As you say on your wiki page Ray, you can discover who the curator is by the
> Registry type of who is managing that authority id, so I am not quite sure
> what the "harvestFrom" gains you.  

In principle, I admit the difference is probably subtle, but in practice,
it can make a noticeable difference.  Here's what I think harvestFrom
gains you:

  o  You don't have to do an additional query to find out where the record 
     came from.  

  o  You are protected against the possibility that Registry record is 
     either not up to date (i.e. doesn't contain the authority ID) or is 
     otherwise inconsistent (e.g. corrupted, missing, etc.).

  o  You can trace records that make multiple harvesting stops.  Note that 
     what is recorded in the Registry record is not exactly what 
     harvestFrom holds.  The latter will be the registry that the 
     harvester got the record from.  That registry may have gotten that 
     record from another registry (which would happen if the harvester 
     grabs all records, rather than just the managed ones).  

     We noticed some cases in the NVO in which the records exported by a 
     registry is not exactly what was originally published (and we're 
     talking about the resource metadata here).  Tracking down a problem 
     like this would benefit from harvestFrom if the record actually makes 
     multiple hops from its originator.  

I think the fact that two working registries felt compelled to record this 
information internally suggests that it's a good idea.  

> Now we do need to talk about the notion
> again of <ownedAuthority> but that is later (this deals with full-full
> harvesting only so we don't keep harvesting every registry around).  

Agreed.  We should bring this up in a separate thread.

> xs:date to my knowledge is okay with time values and in fact astrogrid does
> it with a "time" with a "Z" ending and xerces seems to be okay with it.  So
> I think date should be okay, we probably should make sure status and updated
> are required attributes; possibly created as well.

Technically, including time in a xs:date is not correct.  Given your 
practice, I'll put supporting dateTime on the list of proposed changes to 
VOResource.  It will be backward-compatible.  

> Also I am now coming around on OAI sets, originally I was not to keen on
> them, and thought you could just do everything with ListRecords, but I do
> see where using a set to get everything the first time could be very good
> and is probably not to hard to implement plus adding oai_managed set would
> be just as easy.  I do think ListRecords need to only be managed Resources
> each time though.

Could you clarify this last sentence?  I think I hear you say that you're 
okay with defining a standard set called "ivo_managed" to just get the 
managed resources; is that right?  This could be used as an 
argument to ListRecords (as well as ListIdentifiers).  If no set argument 
were provided, all records would be returned.  In practice then, IVOA 
harvesters would usually provide set=ivo_managed as an argument to 
ListRecords.  Is this consistent with what you are thinking?

cheers,
Ray





More information about the registry mailing list