STC in VOResource records

Gretchen Greene greene at stsci.edu
Thu Dec 14 09:51:13 PST 2006


Ray,

Thank you for the summary and thanks to the discoverer of the problem before
the rest of us catch up!

I would advocate requiring the STC and VOResource to enforce uniqueness in
their schemas for these ids rather than a programmatic temporary solution.
In respect of the issue for expediency,  how long would it take to make the
schema change (VOResource?, STC?).  It sounds like there are known methods.

This has to happen sometime,  so should we wait until all the registries and
applications are in full operation to make the schema change?  

I'm open to whatever solution is decided upon but leaning toward a
preference of having the schemas corrected.  

Since we are only now preparing to release VOResource 1.0 registries,  it is
a perfect time to get in a last change.  As for the STC schema constraint,
this conflict will not be unique to registries.  

-Gretchen

-----Original Message-----
From: owner-registry at eso.org [mailto:owner-registry at eso.org] On Behalf Of
Ray Plante
Sent: Thursday, December 14, 2006 9:57 AM
To: IVOA Registry WG
Cc: Arnold Rots
Subject: STC in VOResource records


Hi RWGers,

So we have a bit of a crisis to contend with regarding our use of STC within
a VOResource record which is standing in the way of our upgrade to RI v1.0.
To catch folks up, I'm going to summarize the problem and review some useful
input that others have made, and then try to conclude with our current set
of alternatives.

I. The Problem

We use the Space-Time Coordinates schema (STC) to describe a resource's
coverage of the sky, time, and frequency.  In STC, this is done by first 
defining "coordinate systems" for each of these things and then listing 
how the resource maps onto those systems.  A single, simple instance looks 
like this:

     <stc:STCResourceProfile
          xmlns="http://www.ivoa.net/xml/STC/stc-v1.30.xsd">

        <AstroCoordSystem xlink:type="simple"
                          xlink:href="ivo://STClib/CoordSys#UTC-FK5-TOPO"
                          id="UTC-FK5-TOPO"/>

        <AstroCoordArea coord_system_id="UTC-FK5-TOPO">
           <AllSky/>
        </AstroCoordArea>

     </stc:STCResourceProfile>

The <AstroCoordSystem> defines a system on the sky by refering to a 
"standard system", via the xlink attributes.  The <AstroCoordArea> 
describes the actual coverage on that system.  The two are linked through 
the id value, "UTC-FK5-TOPO", which by convention, matches the local 
identifier part of the xlink:href attribute.

An STC description may require multiple coordinate systems to describe its 
coverage, so it needs a way to uniquely connect a particular coverage 
description to a single coordinate system.  This is done with a little 
XML magic by making <AstroCoordSystem>'s id of type xs:ID and 
<AstroCoordArea>'s coord_system_id of type xs:IDREF.  For this to work, 
there must be only one id="UTC-FK5-TOPO" in the entire document.

This is easily satisfied when we have single VOResource records; however, 
the problem comes when we concatonate records into a single document. 
If every record follows the conventional choice, there will be many 
occurances of id="UTC-FK5-TOPO".  We could change this convention; 
however, we have to realize that the individual VOResource records are 
created independently, so some coordination is needed to ensure 
uniqueness.

Concatonation of VOResource records happens in two cases in the Registry 
Interface, within a harvesting response and within a search query 
response.  As Paul Harrison has pointed out, there is an analogous problem 
with VOEvent's use of STC, so this is likely to be a more general problem.

II. Discussion

Paul Harrison posted this very useful summary of suggested alternatives:

On Tue, 5 Dec 2006, Paul Harrison wrote:
> As I see it, there a several solutions to this,
>
> 1. The registry always rewrites the id and coord_system_id within a 
> single record with unique values - e.g. ascending integers for a 
> particular harvest set - this is relatively simple to implement, but 
> is rather a shame to loose the "human readable" ids, however the 
> document will be xml valid.
>
> 2. Gather all of the AstroCoordSystem definitions into a special 
> record and retain their human readable IDs and then do not emit the 
> individual AstroCoordSystem elements in the individual records - 
> though for a normal query to the registry (returning one record), it 
> must remember to insert the appropriate AstroCoordSystem(s) from the 
> special record. This would be an extra level of complexity in the 
> registries housekeeping that it has not had to deal with so far 
> though.
>
> 3. Change the STC schema so that it does not use xs:ID and xs:IDREF 
> types for the cross referencing, but use xs:unique and xs:keyref 
> constraints to ensure integrity of the ids and references - this has 
> the advantage that the scope of the uniqueness can be defined rather 
> than it having to be global to the XML document, so that the ids could 
> be scoped to be unique just within each registry record. This solution 
> seems best to me as it retains XML parser checking of id uniqueness, 
> allows "human readable" ids within each record, and requires no 
> special processing by the registries.

Here are a few comments about these alternatives:

1. Rewriting IDs.

This would have to be done at both publishing time and harvesting time since
the IDs would have to be unique within the entire registry.  Note that you
can't just take what another registries id when you harvest; consider:

   o  you have to make sure that the remote registry's locally unique
      id doesn't clash with yours.
   o  when you reharvest a record, you don't know what has changed or
      added, so every id must be at least examined and perhaps
      undated.

This might be made easier if we augment the id with the registry's IVOA ID;
e.g: id="nvo.ncsa/registry/5:UTC-FK5-TOPO".  In this case, we would only
need to set the ID at publishing time; subsequent rewriting is not
necessary.  Note that the ID part does not need refer to the registry; it
could be the ID of the resource itself.  If you used the resource id, then
you shouldn't need the additional "/5".

My biggest misgivings are:

   o  this requires special processing for a special subset of records
   o  we have to explain how (and why) to do this to publishers.  It's
      not simple.

These are not insurmountable.

2.  Restructure the records.

I belive Paul included this for completeness and for further illustrating
the problem.  Nevertheless, this would require significant processing by
both the sender and receiver to combine and then split the records.  So
(unless I've misunderstood something), this is not particularly appealing.

3.  Changing STC to use xs:keyref and xs:unique.

In principle this is possible because these types allow you to say that
combinations of values--e.g. STC id and VOResource identifier--must be
unique.  However, this would require coordination across these two schemas,
which would break their respective designs. Any use of xs:keyref within just
STC (I believe) would inevitably encounter the same problem.

III.  Current Options

We need a solution pretty much right away as this problem is standing in the
way of our registry upgrade work.  I think the simplest solution available
is Paul's suggestion #1, with the variation I suggest to incorporate the
registry's (or the resource's) IVOA ID.

Arnold could in principle, change the STC schema not to use the xs:ID/IDREF
types.  It could retain the data model, but impose rules of uniqueness that
are outside the capabilities of a an XML Schema-aware parser to check; this
would require an application-specific validater to check.  This is not
unprecedented as we have this in VOResource now.  However, I'm not sure this
is practical on a short timescale, and if the #1 solution above is viable,
then changing the STC schema may not be wise and worth the extra validater
development required.

If we assume #2 and #3 above are not viable (especially given our schedule),
the only other option is to drop the use of STC altogether from VOResource
until a solution can be found.  We still have the ability to point to a
footprint service.  Personally, I'm not ready to go here, yet.  I'm not
about to propose an alternate schema to STC (for one, this is not a quick
solution).  More importantly, I'm not ready to drop an important set of
metadata--coverage--recommended by the RM because of a technical glitch in
STC.

In conclusion, if you guys agree that solution #1 is the way to go, then we
will need to get out (quickly) a concise, unambiguous description of how
form and use these IDs.

cheers,
Ray




More information about the registry mailing list