Differing registries, and dates
Markus Demleitner
msdemlei at ari.uni-heidelberg.de
Thu Apr 1 04:47:36 PDT 2010
Dear registry list,
In checking the output of my registry RSS feed (cf.
http://vo.uni-hd.de/registryrss/q/rss/info), I noticed that some
records I had expected simply didn't show up.
The source for this data is usually the EuroVO registry, so just to
make sure I checked if I had more luck with the STScI registry. Sure
enough, the service I looked for was there, but many others that were
in the EuroVO registry were not.
Since my OAI harvesting tool (available at the URL above) needs some
prerequisites, I whipped up a quick shell script that shows what
worries me (xmlstarlet is a useful little utility available at
http://xmlstar.sourceforge.net/ or in Debian's xmlstarlet package):
----------------8<-------------------
#!/bin/sh
# A quick hack to compare queries for various publishing registries.
args="from=2010-03-01T00:00:00&until=2010-03-16T00:00:00"
function count() {
oaiEndpoint=$1
oaiURL="$oaiEndpoint?verb=ListIdentifiers&metadataPrefix=ivo_vor&$args"
wget -qO - "$oaiURL" |\
xmlstarlet sel -N oai=http://www.openarchives.org/OAI/2.0/ -t \
-v "count(//oai:identifier)"
}
echo -n "EuroVO "
count http://registry.euro-vo.org/oai.jsp
echo -n "STScI "
count http://nvo.stsci.edu/vor10/oai.aspx
---------------->8---------------------
-- as you can see, it queries a few searchable registries (is there a
good way to enumerate them?) for the identifiers they list between
2010-03-01 and 2010-03-15.
I'm ignoring resumptionTokens, but that doesn't seem to be a problem
-- the results are consistent with what my more refined tool says.
The output for that particular script is
EuroVO 49
STScI 4
(on the other hand, STScI has ivo://org.gavo.dc/ppmxl/q/cone as of
now -- that was the service I was missing --, and EuroVO doesn't, so
I'm certainly not bashing the STScI registry).
For comparison, here's the results for from=20XX-03-01&until=20XX-03-15
EuroVO STCsI
2008 56 35
2009 37 13
2010 44 4
Before I start digging into what's going on here: Is it just that
I'm being stupid? Is there something wrong with cross-harvesting, or
is it I simply expect it do something it that it just doesn't do? In
my RSS builder, should I simply be querying as many searchable
registries as I can and then join by IVORN? Or harvest the
publishing registries myself?
Plus... I had suspected some interesting interference between
oai:datestamp and the updated attribute on ri:Resource to be at the
root of this. This does not seem to be the case, but still: What's
the general opinion on them?
Here's my take:
oai:datestamp is the date the resource record was last changed (in
my system, you say "(re-)publish this resource", and the point in
time that happens becomes oai:datestamp). This is also the point in
time relevant for the OAI operations ("from", "until").
updated, on the other hand, reflects when the "resource" itself last
changed; my intention here has been to update it only when queries
might yield different results (e.g., new data is ingested into the
underlying database; this becomes somewhat tricky because there are
computed resources for which there is no ingestion, but never mind
the details now).
This is what I gather the intention of OAI-PMH to be -- am I wrong?
Cheers,
Markus
More information about the registry
mailing list