building a search engine

Maria A. Nieto-Santisteban nieto at skysrv.pha.jhu.edu
Wed Oct 19 14:54:35 PDT 2005


Hi,

During the VOQL session Yuji suggested to have an ADQL EXTENSION called
OFFSET, which combined with TOP and ORDER BY would return the next N rows
after an OFFSET M. After some discussion, we concluded that services
handling small datasets (with images, sources, etc) might be able to
implement such a capability. However, those handing millions of records,
and depending on the system (mainly thinking in the DBMS underneath),
would not be capable of guarantying the OFFSET extension.  We all agreed
this would be an optional ADQL extension that services could implement
depending on their own characteristics and capabilities. 

hope this solve the question of ... " it would be good to have a general
solution or at least consistent approach." 

Cheers,

Maria


On Wed, 19 Oct 2005, Doug Tody wrote:

> Hi Pat -
> 
> I am concerned about getting too specific about things like the sort
> order of the returned results.  This can get complicated, e.g., if there
> are multiple query parameters, which do we sort by without specifying
> the sort order explicitly?  MTIME should not be a special case.  If the
> service implements TOP it would make sense to sort by the score, but this
> is not well defined, e.g., all values within a given range of MTIME could
> be assumed to have the same score for that parameter since all are within
> the specified range.  If the result of the query is very large the service
> might not be able to sort it at all.
> 
> It might be better to separate TOP from the issue of stepping through
> a large query, which is more of a transport protocol issue.  We should
> probably return an error, and perhaps preferably indicate an overflow
> condition, if the query result exceeds some maximum value.  Or (if a given
> service supports it) we could provide a general mechanism to cache the
> query response on the server and iteratively step through large queries.
> A combination of these two is probably what is needed.  The simplest
> services would only be required to reliably indicate if the query result
> overflows.  A more sophisticated service would allow additional chunks of
> the query response to be retrieved.  Webapps do this all the time of course,
> usually by providing something like a URL to fetch the "next" segment.
> 
> This is a general problem affecting any query (DAL, skyNode, registry, etc.)
> hence it would be good to have a general solution or at least a consistent
> approach.
> 
> 	- Doug
> 
> 
> 
> 
> On Wed, 19 Oct 2005, Patrick Dowler wrote:
> 
> >
> > Yes - I agree with everything Markus said... had to happen eventually :-)
> >
> > I'm not sure how practical making PUBID and/or CREATORID a comma-separated
> > list will be... it won't scale very much but will be sufficient to coalesce some 10s of
> > actions into a single action (getting records for IDs), which may well be good enough.
> >
> > Another thing Doug mentioned that I forgot about is the need to be able to handle
> > arbitrary large query results. It is pretty hard to deal with arbitarily large XML files, both
> > writing and reading them. Currently, the service can just truncate output and a SE
> > builder would have a hard time knowing they had completely scanned the service content.
> >
> > Having written programs to harvest metadata from our own (other) databases, the
> > generally useful pattern is to harvest in order of increasing mtime. So, if a SE did a
> > query like MTIME=t1,&TOP=1000 to get the oldest records with mtime >= t1, it
> > could gradually harvest all the records with repeated queries just by advancing t1.
> > This would work assuming that using TOP and MTIME meant getting the oldest
> > records. Once the SE had completely harvested a service, it could keep up to date
> > my doing this query perioidically with a min mtime equal to the last time it checked the
> > service (to get new/changed records).
> >
> > So, could this interpretation of using MTIME and TOP (order by MIME) be included in
> > the spec explicitly? I don't foresee any difficulty in implementing it...
> >
> > Pat
> >
> > PS-From the search engine point of view, services that generate products on the fly
> > aren't useful to re-index because in theory they have a response for every query and
> > this an infinite number of "virtual records" to index...
> >
> > On 19.10.2005 05:05, Markus Dolensky wrote:
> > > Hi,
> > >
> > > Before commenting on Pat's search engine use case here's where one can
> > > find the latest info:
> > > DAL presentations of the respective interop session at ESAC are here
> > > http://www.ivoa.net/twiki/bin/view/IVOA/InterOpOct2005DAL
> > > - many thanks to the authors for promptly providing them. This includes
> > > the minutes with action items related to Pat's proposal
> > > http://www.ivoa.net/internal/IVOA/InterOpOct2005DAL/dal_20051007.txt
> > > Finally note that, Francesco has added the sample files of his demos.
> > >
> > >
> > > Patrick Dowler wrote:
> > > > In Madrid I brought up the topic of having a "last modification time" on
> > > > records returned from SSA and SIA. The intent is to allow on this to
> > > > get new or changed records - something needed to build a search engine,
> > > > for example.
> > >
> > > My perception when adding your idea to the DAL minutes was that a query
> > > parameter MTIME=<interval> and a corresponding output parameter was
> > > generally considered an excellent enhancement and it's merely a matter
> > > of agreeing how to do it.
> > >
> > >
> > > > 1. unique identifier that could be used sometime later to get the AccessReference
> > > > (ie to get the data or let a user get the data):
> > > >
> > > > - publisher ID is tied to the specific service, so one would need to keep the tuple of
> > > > <resourceID, pubID> where resourceID lets you find the same service in the registry
> > > > and pubID lets you find the record within that service.... Correct?
> > >
> > > There is an action to clarify the meaning of CREATORID and PUBID since
> > > Doug and Jonathan had slightly different expectations. Therefore, I'd
> > > like to ask them to agree on a (uniform) answer to point #1.
> > >
> > >
> > > > 2. a globally unique "dataset ID" culd be used, but the SE would still need to know
> > > > which service(s) can deliver the record and data... plus specific implementations of a
> > > > SE might need specific things from the record not supplied by everyone that can deliver
> > > > the dataset (eg. I need spatial support, time bounds, and energy bounds to build my
> > > > search engine - someone else might need more or less)....
> > > >
> > > > To support an SE, "mtime" needs to be a query parameter of the form mtime=MIN,MAX
> > > > with support for mtime=MIN, (for >=) and it has to be part of each record on output. Personally
> > > > I would like to see these as REQUIRED.
> > >
> > > In general, this is how such range conditions should be specified:
> > > example1: MTIME=lo,hi  # bounded range
> > > example2: MTIME=lo,    # bigger or equal to lo
> > > example3: MTIME=,hi    # smaller than or equal to hi
> > >
> > >
> > > > ** using/getting AccessReference
> > > >
> > > > In addition, if I build an SE that stores <resourceID,pubID> then I will also like to have a
> > > > fast way to convert them into AccessReference (URLs). I'm assuming the AccessReference
> > > > one gets from the query is currently valid but not guaranteed to be valid indefinitely (publishers
> > > > may want/need to change data delivery, which I don't think should mandate changing
> > > > the modification time). Specifically, it would be nice to be able to pass a list of pubID values to
> > > > a service and get one response, rather than have to issue separate queries and get one response
> > > > (VOTable) per pubID with one record each. With http get, the length of the list would be limited, of
> > > > course.
> > >
> > > > Logically, I an SE will need pubID as a REQUIRED query and output parameter. List
> > > > support is an optimisation.
> > >
> > > Unless there are objections I'll turn the parameter specification of
> > > PUBID and CREATORID into type 'comma separated list' in the SSA
> > > interface doc. This again requires a final word on the meaning of the
> > > two parameters. Presumably chances are dim that this will break already
> > > existing services(?)
> > >
> > > Let me try to work out what REQUIRED means in this context:
> > > A service needs to recognize query parameter MTIME. If there is no MTIME
> > > value - for instance, because a mosaic is computed on the fly  (=>
> > > virtual data) - then the service must not produce an error but ignore
> > > MTIME(?).
> > >
> > > - Markus
> > >
> > >
> >
> > --
> > Patrick Dowler
> > Tel/Tél: (250) 363-6914                  | fax/télécopieur: (250) 363-0045
> > Canadian Astronomy Data Centre   | Centre canadien de donnees astronomiques
> > National Research Council Canada | Conseil national de recherches Canada
> > Government of Canada                  | Gouvernement du Canada
> > 5071 West Saanich Road               | 5071, chemin West Saanich
> > Victoria, BC                                  | Victoria (C.-B.)
> >
> >
> 
> 

-- 
------------------------------------------------
Maria A. Nieto-Santisteban (nieto at pha.jhu.edu)
Johns Hopkins University
3400 N. Charles St.
Physics & Astronomy Department
Baltimore, MD 21218 (USA)

Tel: 	1 410 516-7679  Fax: 	1 410 516-5096



More information about the dal mailing list