Using 'google' as a data discovery repository

Norman Gray norman at astro.gla.ac.uk
Thu Oct 25 13:07:32 PDT 2012


Greetings.

During the discussion after Arnold's talk this afternoon, I suggested using 'google' as a data discovery repository. This sounds eccentric: I should send a few pointers.

Prefatory remark 1: I should remark that I think that a model of provenance for astronomical data is surely a good thing, though we should make sure that we're building on considerable existing work in this area, rather than duplicating it.

Prefatory remark 2: It's probably useful to distinguish a preservation model from a discovery one.  The former is a model that's suitable for documenting a dataset in a way that's suitable for long-term preservation; this might include a variety of OAIS and related data-management and preservation considerations.  The latter might be a much lighter-weight thing, intended simply to let users _discover_ a source of data, without trying to be something that would do fine-grained selection.  This might well be a subset, or a high-level profile of the preservation model, but the Dublin Core metadata set (for example) is surprisingly flexible and widely used, and _might_ be a better thing to build on.

Anyway: 'google'…

There is a great deal of work being done on making large and highly-structured data collections straightforwardly discoverable.  This is in the field of e-commerce, where online suppliers are _extremely_ concerned with the problem of helping you, looking for widgets, find _their_ widget shop.

  * GoodRelations <http://www.heppnetz.de/projects/goodrelations/> is a vocabulary for describing products, which is designed to be interwoven in the human-facing HTML of an online store, in such a way that it can be harvested _and understood_ by Google.  The underlying technology here is RDFa -- GoodRelations is an RDF model, which looks broadly similar to the model which Arnold described, but specialised to for-sale objects.

  * A similar thing is <http://schema.org>, which is a collection of 'microformats' -- not restricted to e-commerce -- which are intended to be embedded into web pages, with the same intention: 'google' crawls the pages, interprets the embedded information, and is therefore able to better 'understand' what is on the page, and so find it more effectively.  'Microformats' are the same basic idea as the RDFa which GoodRelations sits on (I tend to think of RDFa as 'microformats done right', but perhaps that's just my biases).  The 'microformats' idea is lighter-weight, but more vaguely-specified.

The scenario is a decentralised one.  A repository would advertise its wares by exposing _structured_ content describing it, in a lightweight discovery model.  'Google' crawls this and indexes it, whereupon a user can simply 'google' for the sort of data she's looking for and find it.  Once found, the user might do a variety of things, including use the data (if she's a scientist), mirror it (if she's an archive) or index it (if she's ADS), and in each case they'd probably use the fuller provenance model rather than the light discovery model.

Qualifications:

I don't think this is yet trivial, because I'm not sure about the precise story of what microformats 'google' does and does not interpret.  But I know where to find out more about this.

I've put 'google' in scare-quotes above, because I don't want to give the impression that this is somehow handing the keys to the kingdom over to Google.  The 'google' here is any search engine which is smart enough to use things like schema.org microformats, and that's a larger set than Google Inc, and so is potentially longer-term.

Also, this approach doesn't _preclude_ setting up an astronomy- or IVOA-specific repository; it might be that such a service could do discipline-specific search better than 'google' can.  But by designing the discovery mechanism so that it's manifestly web-style and web-scale, we give ourselves maximum flexibility to let 'google' do what 'google' does best, for possibly minimal cost.

Best wishes,

Norman


-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK



More information about the datacp mailing list