building a search engine

Wed Oct 19 12:24:35 PDT 2005

Hi Pat -

I am concerned about getting too specific about things like the sort
order of the returned results.  This can get complicated, e.g., if there
are multiple query parameters, which do we sort by without specifying
the sort order explicitly?  MTIME should not be a special case.  If the
service implements TOP it would make sense to sort by the score, but this
is not well defined, e.g., all values within a given range of MTIME could
be assumed to have the same score for that parameter since all are within
the specified range.  If the result of the query is very large the service
might not be able to sort it at all.

It might be better to separate TOP from the issue of stepping through
a large query, which is more of a transport protocol issue.  We should
probably return an error, and perhaps preferably indicate an overflow
condition, if the query result exceeds some maximum value.  Or (if a given
service supports it) we could provide a general mechanism to cache the
query response on the server and iteratively step through large queries.
A combination of these two is probably what is needed.  The simplest
services would only be required to reliably indicate if the query result
overflows.  A more sophisticated service would allow additional chunks of
the query response to be retrieved.  Webapps do this all the time of course,
usually by providing something like a URL to fetch the "next" segment.

This is a general problem affecting any query (DAL, skyNode, registry, etc.)
hence it would be good to have a general solution or at least a consistent
approach.

	- Doug

On Wed, 19 Oct 2005, Patrick Dowler wrote:

>
> Yes - I agree with everything Markus said... had to happen eventually :-)
>
> I'm not sure how practical making PUBID and/or CREATORID a comma-separated
> list will be... it won't scale very much but will be sufficient to coalesce some 10s of
> actions into a single action (getting records for IDs), which may well be good enough.
>
> Another thing Doug mentioned that I forgot about is the need to be able to handle
> arbitrary large query results. It is pretty hard to deal with arbitarily large XML files, both
> writing and reading them. Currently, the service can just truncate output and a SE
> builder would have a hard time knowing they had completely scanned the service content.
>
> Having written programs to harvest metadata from our own (other) databases, the
> generally useful pattern is to harvest in order of increasing mtime. So, if a SE did a
> query like MTIME=t1,&TOP=1000 to get the oldest records with mtime >= t1, it
> could gradually harvest all the records with repeated queries just by advancing t1.
> This would work assuming that using TOP and MTIME meant getting the oldest
> records. Once the SE had completely harvested a service, it could keep up to date
> my doing this query perioidically with a min mtime equal to the last time it checked the
> service (to get new/changed records).
>
> So, could this interpretation of using MTIME and TOP (order by MIME) be included in
> the spec explicitly? I don't foresee any difficulty in implementing it...
>
> Pat
>
> PS-From the search engine point of view, services that generate products on the fly
> aren't useful to re-index because in theory they have a response for every query and
> this an infinite number of "virtual records" to index...
>
> On 19.10.2005 05:05, Markus Dolensky wrote:
> > Hi,
> >
> > Before commenting on Pat's search engine use case here's where one can
> > find the latest info:
> > DAL presentations of the respective interop session at ESAC are here
> > http://www.ivoa.net/twiki/bin/view/IVOA/InterOpOct2005DAL
> > - many thanks to the authors for promptly providing them. This includes
> > the minutes with action items related to Pat's proposal
> > http://www.ivoa.net/internal/IVOA/InterOpOct2005DAL/dal_20051007.txt
> > Finally note that, Francesco has added the sample files of his demos.
> >
> >
> > Patrick Dowler wrote:
> > > In Madrid I brought up the topic of having a "last modification time" on
> > > records returned from SSA and SIA. The intent is to allow on this to
> > > get new or changed records - something needed to build a search engine,
> > > for example.
> >
> > My perception when adding your idea to the DAL minutes was that a query
> > parameter MTIME=<interval> and a corresponding output parameter was
> > generally considered an excellent enhancement and it's merely a matter
> > of agreeing how to do it.
> >
> >
> > > 1. unique identifier that could be used sometime later to get the AccessReference
> > > (ie to get the data or let a user get the data):
> > >
> > > - publisher ID is tied to the specific service, so one would need to keep the tuple of
> > > <resourceID, pubID> where resourceID lets you find the same service in the registry
> > > and pubID lets you find the record within that service.... Correct?
> >
> > There is an action to clarify the meaning of CREATORID and PUBID since
> > Doug and Jonathan had slightly different expectations. Therefore, I'd
> > like to ask them to agree on a (uniform) answer to point #1.
> >
> >
> > > 2. a globally unique "dataset ID" culd be used, but the SE would still need to know
> > > which service(s) can deliver the record and data... plus specific implementations of a
> > > SE might need specific things from the record not supplied by everyone that can deliver
> > > the dataset (eg. I need spatial support, time bounds, and energy bounds to build my
> > > search engine - someone else might need more or less)....
> > >
> > > To support an SE, "mtime" needs to be a query parameter of the form mtime=MIN,MAX
> > > with support for mtime=MIN, (for >=) and it has to be part of each record on output. Personally
> > > I would like to see these as REQUIRED.
> >
> > In general, this is how such range conditions should be specified:
> > example1: MTIME=lo,hi  # bounded range
> > example2: MTIME=lo,    # bigger or equal to lo
> > example3: MTIME=,hi    # smaller than or equal to hi
> >
> >
> > > ** using/getting AccessReference
> > >
> > > In addition, if I build an SE that stores <resourceID,pubID> then I will also like to have a
> > > fast way to convert them into AccessReference (URLs). I'm assuming the AccessReference
> > > one gets from the query is currently valid but not guaranteed to be valid indefinitely (publishers
> > > may want/need to change data delivery, which I don't think should mandate changing
> > > the modification time). Specifically, it would be nice to be able to pass a list of pubID values to
> > > a service and get one response, rather than have to issue separate queries and get one response
> > > (VOTable) per pubID with one record each. With http get, the length of the list would be limited, of
> > > course.
> >
> > > Logically, I an SE will need pubID as a REQUIRED query and output parameter. List
> > > support is an optimisation.
> >
> > Unless there are objections I'll turn the parameter specification of
> > PUBID and CREATORID into type 'comma separated list' in the SSA
> > interface doc. This again requires a final word on the meaning of the
> > two parameters. Presumably chances are dim that this will break already
> > existing services(?)
> >
> > Let me try to work out what REQUIRED means in this context:
> > A service needs to recognize query parameter MTIME. If there is no MTIME
> > value - for instance, because a mosaic is computed on the fly  (=>
> > virtual data) - then the service must not produce an error but ignore
> > MTIME(?).
> >
> > - Markus
> >
> >
>
> --
> Patrick Dowler
> Tel/Tél: (250) 363-6914                  | fax/télécopieur: (250) 363-0045
> Canadian Astronomy Data Centre   | Centre canadien de donnees astronomiques
> National Research Council Canada | Conseil national de recherches Canada
> Government of Canada                  | Gouvernement du Canada
> 5071 West Saanich Road               | 5071, chemin West Saanich
> Victoria, BC                                  | Victoria (C.-B.)
>
>