building a search engine

Wed Oct 19 10:59:59 PDT 2005

Hi Pat -

This is very useful, you have identified some issues that need more
careful thought.

> ** buildng a search engine (SE)
>
> To elaborate further, a useful SE on SSA and SIA would also need to find the
> following things for each record:
>
> 1. unique identifier that could be used sometime later to get the
> AccessReference (ie to get the data or let a user get the data):
>
> - publisher ID is tied to the specific service, so one would need to keep
> the tuple of <resourceID, pubID> where resourceID lets you find the same
> service in the registry and pubID lets you find the record within that
> service.... Correct?
>
> 2. a globally unique "dataset ID" culd be used, but the SE would still
> need to know which service(s) can deliver the record and data... plus
> specific implementations of a SE might need specific things from the
> record not supplied by everyone that can deliver the dataset (eg. I need
> spatial support, time bounds, and energy bounds to build my search engine -
> someone else might need more or less)....

There are two related aspects to this problem depending upon whether
we are building a generic index or are building lists of data objects
targeting some specific type of analysis:

Indexing static data  --

By SE I think we mean a global indexing service, which indexes "atlas"
datasets belonging to some collection (i.e., static files or records in
some archive).  The SE would restrict its queries to "atlas" or "pointed"
services (or whatever we decide to call these in the future).  This is
distinct from services which compute virtual data, where what you see
depends upon what you ask for.

For this case I think what you suggest is probably the way to go.  The SE
needs to record the resourceID of the service, and the publisher dataset
ID (pubID or whatever we decide to call it) of the specific dataset as
assigned by the service.

CreatorID cannot be used for this purpose as 1) we can't guarantee that
there is one (not all data collections assign CreatorIDs), and 2) in this
case we want to index specific dataset instances from specific services.
However, if there is a creatorID it can be used for data discovery or to
query the SE to find indexed replicas.

Indexing virtual data  --

A similar issue came up recently in connection with persistent virtual
directories, where a data discovery client application builds a list of
data products targeting some specific type of analysis, and comes back
sometime later to access them.  This is a different case as here we want
to deal with virtual data - we are building a filtered-down list of data
objects to be used for specific analysis, and we may have many such lists.
In this case the IDs in general will not work, as there may be multiple
virtual data products (e.g., cutouts) which are generated from the same
atlas dataset, or a virtual data product may derive from multiple atlas
datasets.

One way to address this problem and solve the problem of persistence
could be for the service to represent a virtual data collection, assigning
persistent CreatorIDs for virtual data it can generate (such an ID would
probably point to a persistent database record which tells the service
how to generate the virtual data product).  However this seems like it
is probably too complex, at least for the moment.  The access reference
generated by a service already tells the service how to generate a virtual
data product.  Perhaps to deal with issues of persistence we just need to
be more rigorous about specifying the time to live for an access reference
(the old SIA spec already includes this but I don't think current services
have bothered to implement it).

> To support an SE, "mtime" needs to be a query parameter of the form
> mtime=MIN,MAX with support for mtime=MIN, (for >=) and it has to be part
> of each record on output. Personally I would like to see these as REQUIRED.

Yes, this looks reasonable, and is consistent with the current spec.

> ** using/getting AccessReference
>
> In addition, if I build an SE that stores <resourceID,pubID> then I
> will also like to have a fast way to convert them into AccessReference
> (URLs). I'm assuming the AccessReference one gets from the query is
> currently valid but not guaranteed to be valid indefinitely (publishers
> may want/need to change data delivery, which I don't think should mandate
> changing the modification time). Specifically, it would be nice to be
> able to pass a list of pubID values to a service and get one response,
> rather than have to issue separate queries and get one response (VOTable)
> per pubID with one record each. With http get, the length of the list
> would be limited, of course.
>
> Logically, I an SE will need pubID as a REQUIRED query and output
> parameter. List support is an optimisation.

We already thought of this, which is why SSA permits a query by ID.
Markus's suggestion of changing the query by ID parameters to permit a
list of ID's looks reasonable.  This approach does not scale well but is
simple, and probably adequate for the moment.

> I really hope this can get into SSA 1.0 and hence SIA 1.1,

I don't see any problem.  The main issue has to do with the precise
semantics of the IDs, and what we decide to call them.

	- Doug