building a search engine

Patrick Dowler patrick.dowler at nrc-cnrc.gc.ca
Wed Oct 19 15:05:11 PDT 2005


I thought about the "next" idea as it is common usage is webapps (as you mentioned)
but it also has issues. If you look at the link to the second page of results on
google, for example, it is the same query with a "start=10" thrown in... but that
approach requires that the google query return results in the same order every time - 
that is, you still end up with an ordering constraint on the implementation.

I'm not really crazy about defining how TOP and MTIME queries interact in a
special way... TOP already has the implicit ORDER BY SCORE associated with it.
I hesitate to suggest ORDER=MTIME because it starts to look like we're trying to
shove SQL onto the url param string :-) 

As for staging query results for later traversal, I'm not sure services would like to
impelment that kind of thing. We are generally trying to do things in a stream as
much as possible because it eases resource contention and management issues.

My experience in implementing this kind of harvesting is that to do it optimally
(be able to gradually traverse a large collection and also sync changes) I always
ended up ordering by mtime: "give me the oldest N things that are newer than T".
This lets the caller batch things, restart at a suitable value of T when something
fails and thus make progress even when networks and services are not perfect, and 
keep the load on the service and client to a minimal level. I think the latter is important
as it would be hard on services to have SEs crawling around in a non-optimal way
trying to index them. 

One could use a range of mtime to step forward as long as there was a clear way to 
tell whether or not the service truncated the results. The SE would have to dynamically
tune the mtime interval to avoid truncation; it would be clumsy but feasible.

Pat

On 19.10.2005 12:24, Doug Tody wrote:
> Hi Pat -
> 
> I am concerned about getting too specific about things like the sort
> order of the returned results.  This can get complicated, e.g., if there
> are multiple query parameters, which do we sort by without specifying
> the sort order explicitly?  MTIME should not be a special case.  If the
> service implements TOP it would make sense to sort by the score, but this
> is not well defined, e.g., all values within a given range of MTIME could
> be assumed to have the same score for that parameter since all are within
> the specified range.  If the result of the query is very large the service
> might not be able to sort it at all.
> 
> It might be better to separate TOP from the issue of stepping through
> a large query, which is more of a transport protocol issue.  We should
> probably return an error, and perhaps preferably indicate an overflow
> condition, if the query result exceeds some maximum value.  Or (if a given
> service supports it) we could provide a general mechanism to cache the
> query response on the server and iteratively step through large queries.
> A combination of these two is probably what is needed.  The simplest
> services would only be required to reliably indicate if the query result
> overflows.  A more sophisticated service would allow additional chunks of
> the query response to be retrieved.  Webapps do this all the time of course,
> usually by providing something like a URL to fetch the "next" segment.
> 
> This is a general problem affecting any query (DAL, skyNode, registry, etc.)
> hence it would be good to have a general solution or at least a consistent
> approach.
> 
> 	- Doug
> 
> 
> 
> 
> On Wed, 19 Oct 2005, Patrick Dowler wrote:
> 
> >
> > Yes - I agree with everything Markus said... had to happen eventually :-)
> >
> > I'm not sure how practical making PUBID and/or CREATORID a comma-separated
> > list will be... it won't scale very much but will be sufficient to coalesce some 10s of
> > actions into a single action (getting records for IDs), which may well be good enough.
> >
> > Another thing Doug mentioned that I forgot about is the need to be able to handle
> > arbitrary large query results. It is pretty hard to deal with arbitarily large XML files, both
> > writing and reading them. Currently, the service can just truncate output and a SE
> > builder would have a hard time knowing they had completely scanned the service content.
> >
> > Having written programs to harvest metadata from our own (other) databases, the
> > generally useful pattern is to harvest in order of increasing mtime. So, if a SE did a
> > query like MTIME=t1,&TOP=1000 to get the oldest records with mtime >= t1, it
> > could gradually harvest all the records with repeated queries just by advancing t1.
> > This would work assuming that using TOP and MTIME meant getting the oldest
> > records. Once the SE had completely harvested a service, it could keep up to date
> > my doing this query perioidically with a min mtime equal to the last time it checked the
> > service (to get new/changed records).
> >
> > So, could this interpretation of using MTIME and TOP (order by MIME) be included in
> > the spec explicitly? I don't foresee any difficulty in implementing it...
> >
> > Pat
> >
> > PS-From the search engine point of view, services that generate products on the fly
> > aren't useful to re-index because in theory they have a response for every query and
> > this an infinite number of "virtual records" to index...
> >
> > On 19.10.2005 05:05, Markus Dolensky wrote:
> > > Hi,
> > >
> > > Before commenting on Pat's search engine use case here's where one can
> > > find the latest info:
> > > DAL presentations of the respective interop session at ESAC are here
> > > http://www.ivoa.net/twiki/bin/view/IVOA/InterOpOct2005DAL
> > > - many thanks to the authors for promptly providing them. This includes
> > > the minutes with action items related to Pat's proposal
> > > http://www.ivoa.net/internal/IVOA/InterOpOct2005DAL/dal_20051007.txt
> > > Finally note that, Francesco has added the sample files of his demos.
> > >
> > >
> > > Patrick Dowler wrote:
> > > > In Madrid I brought up the topic of having a "last modification time" on
> > > > records returned from SSA and SIA. The intent is to allow on this to
> > > > get new or changed records - something needed to build a search engine,
> > > > for example.
> > >
> > > My perception when adding your idea to the DAL minutes was that a query
> > > parameter MTIME=<interval> and a corresponding output parameter was
> > > generally considered an excellent enhancement and it's merely a matter
> > > of agreeing how to do it.
> > >
> > >
> > > > 1. unique identifier that could be used sometime later to get the AccessReference
> > > > (ie to get the data or let a user get the data):
> > > >
> > > > - publisher ID is tied to the specific service, so one would need to keep the tuple of
> > > > <resourceID, pubID> where resourceID lets you find the same service in the registry
> > > > and pubID lets you find the record within that service.... Correct?
> > >
> > > There is an action to clarify the meaning of CREATORID and PUBID since
> > > Doug and Jonathan had slightly different expectations. Therefore, I'd
> > > like to ask them to agree on a (uniform) answer to point #1.
> > >
> > >
> > > > 2. a globally unique "dataset ID" culd be used, but the SE would still need to know
> > > > which service(s) can deliver the record and data... plus specific implementations of a
> > > > SE might need specific things from the record not supplied by everyone that can deliver
> > > > the dataset (eg. I need spatial support, time bounds, and energy bounds to build my
> > > > search engine - someone else might need more or less)....
> > > >
> > > > To support an SE, "mtime" needs to be a query parameter of the form mtime=MIN,MAX
> > > > with support for mtime=MIN, (for >=) and it has to be part of each record on output. Personally
> > > > I would like to see these as REQUIRED.
> > >
> > > In general, this is how such range conditions should be specified:
> > > example1: MTIME=lo,hi  # bounded range
> > > example2: MTIME=lo,    # bigger or equal to lo
> > > example3: MTIME=,hi    # smaller than or equal to hi
> > >
> > >
> > > > ** using/getting AccessReference
> > > >
> > > > In addition, if I build an SE that stores <resourceID,pubID> then I will also like to have a
> > > > fast way to convert them into AccessReference (URLs). I'm assuming the AccessReference
> > > > one gets from the query is currently valid but not guaranteed to be valid indefinitely (publishers
> > > > may want/need to change data delivery, which I don't think should mandate changing
> > > > the modification time). Specifically, it would be nice to be able to pass a list of pubID values to
> > > > a service and get one response, rather than have to issue separate queries and get one response
> > > > (VOTable) per pubID with one record each. With http get, the length of the list would be limited, of
> > > > course.
> > >
> > > > Logically, I an SE will need pubID as a REQUIRED query and output parameter. List
> > > > support is an optimisation.
> > >
> > > Unless there are objections I'll turn the parameter specification of
> > > PUBID and CREATORID into type 'comma separated list' in the SSA
> > > interface doc. This again requires a final word on the meaning of the
> > > two parameters. Presumably chances are dim that this will break already
> > > existing services(?)
> > >
> > > Let me try to work out what REQUIRED means in this context:
> > > A service needs to recognize query parameter MTIME. If there is no MTIME
> > > value - for instance, because a mosaic is computed on the fly  (=>
> > > virtual data) - then the service must not produce an error but ignore
> > > MTIME(?).
> > >
> > > - Markus
> > >
> > >
> >
> > --
> > Patrick Dowler
> > Tel/Tél: (250) 363-6914                  | fax/télécopieur: (250) 363-0045
> > Canadian Astronomy Data Centre   | Centre canadien de donnees astronomiques
> > National Research Council Canada | Conseil national de recherches Canada
> > Government of Canada                  | Gouvernement du Canada
> > 5071 West Saanich Road               | 5071, chemin West Saanich
> > Victoria, BC                                  | Victoria (C.-B.)
> >
> >
> 
> 
> 

-- 
Patrick Dowler
Tel/Tél: (250) 363-6914                  | fax/télécopieur: (250) 363-0045
Canadian Astronomy Data Centre   | Centre canadien de donnees astronomiques
National Research Council Canada | Conseil national de recherches Canada
Government of Canada                  | Gouvernement du Canada
5071 West Saanich Road               | 5071, chemin West Saanich
Victoria, BC                                  | Victoria (C.-B.)



More information about the dal mailing list