building a search engine

Patrick Dowler patrick.dowler at nrc-cnrc.gc.ca
Wed Oct 19 11:06:30 PDT 2005


Yes - I agree with everything Markus said... had to happen eventually :-)

I'm not sure how practical making PUBID and/or CREATORID a comma-separated
list will be... it won't scale very much but will be sufficient to coalesce some 10s of
actions into a single action (getting records for IDs), which may well be good enough.

Another thing Doug mentioned that I forgot about is the need to be able to handle
arbitrary large query results. It is pretty hard to deal with arbitarily large XML files, both
writing and reading them. Currently, the service can just truncate output and a SE 
builder would have a hard time knowing they had completely scanned the service content. 

Having written programs to harvest metadata from our own (other) databases, the
generally useful pattern is to harvest in order of increasing mtime. So, if a SE did a
query like MTIME=t1,&TOP=1000 to get the oldest records with mtime >= t1, it
could gradually harvest all the records with repeated queries just by advancing t1.
This would work assuming that using TOP and MTIME meant getting the oldest
records. Once the SE had completely harvested a service, it could keep up to date
my doing this query perioidically with a min mtime equal to the last time it checked the
service (to get new/changed records).

So, could this interpretation of using MTIME and TOP (order by MIME) be included in
the spec explicitly? I don't foresee any difficulty in implementing it...

Pat

PS-From the search engine point of view, services that generate products on the fly
aren't useful to re-index because in theory they have a response for every query and
this an infinite number of "virtual records" to index... 

On 19.10.2005 05:05, Markus Dolensky wrote:
> Hi,
> 
> Before commenting on Pat's search engine use case here's where one can 
> find the latest info:
> DAL presentations of the respective interop session at ESAC are here 
> http://www.ivoa.net/twiki/bin/view/IVOA/InterOpOct2005DAL
> - many thanks to the authors for promptly providing them. This includes 
> the minutes with action items related to Pat's proposal
> http://www.ivoa.net/internal/IVOA/InterOpOct2005DAL/dal_20051007.txt
> Finally note that, Francesco has added the sample files of his demos.
> 
> 
> Patrick Dowler wrote:
> > In Madrid I brought up the topic of having a "last modification time" on
> > records returned from SSA and SIA. The intent is to allow on this to
> > get new or changed records - something needed to build a search engine,
> > for example. 
> 
> My perception when adding your idea to the DAL minutes was that a query 
> parameter MTIME=<interval> and a corresponding output parameter was 
> generally considered an excellent enhancement and it's merely a matter 
> of agreeing how to do it.
> 
> 
> > 1. unique identifier that could be used sometime later to get the AccessReference
> > (ie to get the data or let a user get the data): 
> > 
> > - publisher ID is tied to the specific service, so one would need to keep the tuple of 
> > <resourceID, pubID> where resourceID lets you find the same service in the registry 
> > and pubID lets you find the record within that service.... Correct?
> 
> There is an action to clarify the meaning of CREATORID and PUBID since 
> Doug and Jonathan had slightly different expectations. Therefore, I'd 
> like to ask them to agree on a (uniform) answer to point #1.
> 
> 
> > 2. a globally unique "dataset ID" culd be used, but the SE would still need to know
> > which service(s) can deliver the record and data... plus specific implementations of a
> > SE might need specific things from the record not supplied by everyone that can deliver
> > the dataset (eg. I need spatial support, time bounds, and energy bounds to build my 
> > search engine - someone else might need more or less).... 
> > 
> > To support an SE, "mtime" needs to be a query parameter of the form mtime=MIN,MAX
> > with support for mtime=MIN, (for >=) and it has to be part of each record on output. Personally
> > I would like to see these as REQUIRED.
> 
> In general, this is how such range conditions should be specified:
> example1: MTIME=lo,hi  # bounded range
> example2: MTIME=lo,    # bigger or equal to lo
> example3: MTIME=,hi    # smaller than or equal to hi
> 
> 
> > ** using/getting AccessReference
> > 
> > In addition, if I build an SE that stores <resourceID,pubID> then I will also like to have a
> > fast way to convert them into AccessReference (URLs). I'm assuming the AccessReference 
> > one gets from the query is currently valid but not guaranteed to be valid indefinitely (publishers
> > may want/need to change data delivery, which I don't think should mandate changing 
> > the modification time). Specifically, it would be nice to be able to pass a list of pubID values to
> > a service and get one response, rather than have to issue separate queries and get one response
> > (VOTable) per pubID with one record each. With http get, the length of the list would be limited, of
> > course. 
> 
> > Logically, I an SE will need pubID as a REQUIRED query and output parameter. List
> > support is an optimisation.
> 
> Unless there are objections I'll turn the parameter specification of 
> PUBID and CREATORID into type 'comma separated list' in the SSA 
> interface doc. This again requires a final word on the meaning of the 
> two parameters. Presumably chances are dim that this will break already 
> existing services(?)
> 
> Let me try to work out what REQUIRED means in this context:
> A service needs to recognize query parameter MTIME. If there is no MTIME 
> value - for instance, because a mosaic is computed on the fly  (=> 
> virtual data) - then the service must not produce an error but ignore 
> MTIME(?).
> 
> - Markus
> 
> 

-- 
Patrick Dowler
Tel/Tél: (250) 363-6914                  | fax/télécopieur: (250) 363-0045
Canadian Astronomy Data Centre   | Centre canadien de donnees astronomiques
National Research Council Canada | Conseil national de recherches Canada
Government of Canada                  | Gouvernement du Canada
5071 West Saanich Road               | 5071, chemin West Saanich
Victoria, BC                                  | Victoria (C.-B.)



More information about the dal mailing list