Differing registries, and dates

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Thu Apr 1 04:47:36 PDT 2010


Dear registry list,

In checking the output of my registry RSS feed (cf.
http://vo.uni-hd.de/registryrss/q/rss/info), I noticed that some
records I had expected simply didn't show up.  

The source for this data is usually the EuroVO registry, so just to
make sure I checked if I had more luck with the STScI registry.  Sure
enough, the service I looked for was there, but many others that were
in the EuroVO registry were not.

Since my OAI harvesting tool (available at the URL above) needs some
prerequisites, I whipped up a quick shell script that shows what
worries me (xmlstarlet is a useful little utility available at
http://xmlstar.sourceforge.net/ or in Debian's xmlstarlet package):

----------------8<-------------------
#!/bin/sh
# A quick hack to compare queries for various publishing registries.

args="from=2010-03-01T00:00:00&until=2010-03-16T00:00:00"

function count() {
	oaiEndpoint=$1
	oaiURL="$oaiEndpoint?verb=ListIdentifiers&metadataPrefix=ivo_vor&$args"
	wget -qO - "$oaiURL" |\
	xmlstarlet sel -N oai=http://www.openarchives.org/OAI/2.0/ -t \
		-v "count(//oai:identifier)"
}

echo -n "EuroVO "
count http://registry.euro-vo.org/oai.jsp

echo -n "STScI "
count http://nvo.stsci.edu/vor10/oai.aspx
---------------->8---------------------

-- as you can see, it queries a few searchable registries (is there a
good way to enumerate them?) for the identifiers they list between
2010-03-01 and 2010-03-15.

I'm ignoring resumptionTokens, but that doesn't seem to be a problem
-- the results are consistent with what my more refined tool says.

The output for that particular script is

EuroVO 49
STScI 4

(on the other hand, STScI has ivo://org.gavo.dc/ppmxl/q/cone as of
now -- that was the service I was missing --, and EuroVO doesn't, so
I'm certainly not bashing the STScI registry).

For comparison, here's the results for from=20XX-03-01&until=20XX-03-15

         EuroVO         STCsI
2008      56             35
2009      37             13
2010      44              4

Before I start digging into what's going on here:  Is it just that
I'm being stupid?  Is there something wrong with cross-harvesting, or
is it I simply expect it do something it that it just doesn't do?  In
my RSS builder, should I simply be querying as many searchable
registries as I can and then join by IVORN?  Or harvest the
publishing registries myself?


Plus... I had suspected some interesting interference between
oai:datestamp and the updated attribute on ri:Resource to be at the
root of this.  This does not seem to be the case, but still: What's
the general opinion on them?

Here's my take:

oai:datestamp is the date the resource record was last changed (in
my system, you say "(re-)publish this resource", and the point in
time that happens becomes oai:datestamp).  This is also the point in
time relevant for the OAI operations ("from", "until").

updated, on the other hand, reflects when the "resource" itself last
changed; my intention here has been to update it only when queries
might yield different results (e.g., new data is ingested into the
underlying database; this becomes somewhat tricky because there are
computed resources for which there is no ingestion, but never mind
the details now).

This is what I gather the intention of OAI-PMH to be -- am I wrong?

Cheers,

        Markus



More information about the registry mailing list