Putting the pieces together...

Thu May 13 13:52:04 PDT 2004

(I am moving my response to this to DAL since this is mainly a discussion
of SSA).

Hi Tom -

This is an excellent way of looking at the problem of how we actually
use the VO in a typical data analysis scenario.  At this level we have
three main elements:

    o	User application (VSPlot).
    o	Registry.
    o	DAL Service (SSA).

Most of the functionality you identify on the VO side is provided by SSA.
This is what SSA is all about - making it possibly to write actual data
real world analysis applications that access spectral data via the VO.

On the data modeling side the key thing here is the SSA data model.  While
the goal is to keep this "simple" (focused only on one class of data,
limited to 1D spectra, etc.) it also needs to be real enough to actually
be useful, e.g., define a useful set of spectral coordinate types, flux
vector types, deal with background, errors, and so forth.

On the DM side, we can provide something useful for actual data analysis
given only the SSA data model.  Already though things start to emerge
which we would like so standardize more broadly, e.g., data provenance,
data characterization (part of what is currently called the "observation"
data model).  This is a good way to develop things: start with something
useful that actually works end-to-end, and as the technology develops,
refine it to include more sophisticated standard query facilities,
standardized component data models and metadata, etc.

Regarding data coming back from various sources: data providers should
be strongly encouraged to implement services that return data in the
service-defined data model.  Since most data has to be reformatted anyway
this won't be that hard to do.  Applications will probably use the registry
to select only services which are capable of returning data in a format
defined by the SSA protocol (SSAP).  These formats all implement the SSA
data model.

Ivo - regarding your point about data quality vectors:  As you know,
the SSA data model has a data quality vector.  We don't really know what
to put in it though.  I don't think we should put anything instrumental
in nature in the general SSA data model (this can be done but it would
go into nonstandard extension records).  Simple models for the quality
vector would be binary (good or bad) or trinary (known good, known bad
or flagged, or questionable).  Perhaps once we get more experience with
real data from archives it will be possible to develop a more refined
quality model.  (Note this should not be confused with the error vectors
which we already have).

	- Doug

On Thu, 13 May 2004, Thomas McGlynn wrote:

> [Mail note.  I've sent this to the DM and APPS groups, it's
> relevant to others, but I don't want to get 5 copies of
> every response.]
> 
> There has certainly been plenty of mail on data models, registries,
> STC, UCD's, UTYPE's, measurements, etc, but a lot of it is
> frustrating for me.  I can't seem to get a hold on how much of it
> gets used and what the consequences in terms of what the
> developer or user sees will be.  We seem to spend a lot of
> time discussing abstract data and data structures, but there
> has been very little about how software finds and uses this information.
> 
> Until these discussions are anchored in some more concrete framework
> for how the data models, UTYPES, STC are used, it may
> be very difficult to come to any real consensus on what
> they should look like.
> 
> Below I go through an extended example which shows how I think
> we might use many of the concepts that have been the
> subject of discussion.  It's clear to me that there are enormous
> largely undiscussed gaps between these data concepts and
> their use in software.
> 
> 
> ------------------------------------------------------------------
> Scenario:
> 
> Find and plot spectra of TY Pyxis in the range from the soft X-ray
> to the optical (1 A to 10000 A). Let's call the software tool
> that does this VSPlot.
> 
> Step 1.
> 
> VSPlot needs to know the location of at least one registry.
> VSPlot makes a query of the registry using a standard registry protocol.
> VSPlot asks for all SSAP services that might have data in the desired
> regimes and location.
> 
>    Issue 1.A: Is the registry query syntax different from the VOQL protocol?
>    If so what is it?
> 
>    Issue 1.B: What is the protocol for the registry query?  Is it defined
>    as by some standard registry WSDL?
> 
>    Issue 1.C: VSPlot hardwires the connection between spectra and SSAP services.
>    Are we restricted to a single kind of service for each kind of data?  Do
>    we need to register the attributes of the kinds of services so that if I'm
>    interested in getting archival spectral data I might learn that there are SSA
>    services and maybe other kinds of services that I want to query?
> 
>    Issue 1.D: How does VSPlot know where the registry is?  Is there a registry
>    of registries?  What is the root of the hierarchy?
> 
> Step 2.
> 
> VSPlot parses the set of services returned from the registry?
> 
>    Issue 2.A: Where is the structure/protocol of this returned data defined?
> 
>    Issue 2.B: What is the contract regarding these services?  (I.e.,
>    given that I asked for services that meet some criteria (spectral
>    and spactial coverage), do I know that these services will actually
>    have data that meets these criteria?  Probably not I think.)
> 
>    Issue 2.C: How much information is stored in the registry
>    about each of the SSA services?
> 
> Step 3.
> 
> VSPlot queries the potential matching service one at a time to
> get links to candidate spectral data using the SSA protocol.
> 
>    Issue 3. Need definition of the SSA protocol.
> 
> Step 4.
> 
> VSPlot now has a links to a list of files that may be of interest
> for plotting.  We begin a loop over this list.
> 
> VSPlot copies a spectral file into local storage.
> 
> Step 5.
> 
> VSPlot determines if the file supports the Spectrum data model.
> If the file does not support this data model it is discarded.
> 
>    Issue 5.A:  How do we find out if a data element supports a given
>    data model?  Is it required that any file returned by the SSA
>    support the Spectrum data model?  If so where do we put the mapping
>    between service types and the data models that the returned
>    data is going to support?
> 
>    Issue 5.B: Is there some list of the potential data models that
>    any file might support?
> 
> Step 6.
> 
> VSPlot looks for frame information for this file to confirm
> that it is a spectrum at the appropriate location and in the appropriate
> spectral regime for further processing.
> 
>    Issue 6.A For a FITS file I know how to do this.  I'm much less
>    clear how to do this for arbitrary data returned by an SSA service.
>    Is this a standard method associated with the Spectrum data
>    model that enables me to find this out?  Basically we're asking
>    how we discover the STC information for a given dataset and the
>    comparable spectral info.
> 
>    Issue 6.B Is coverage information (spatial and spectral) required to be
>    in a standard format?  If so what is that format?  If not do we have
>    standard conversion services or is it the responsibility of the application
>    to convert?
> 
> Step 7.
> 
> VSPlot iteratively uses the standard (in this scenario) getNextElement method defined
> in the spectrum data model to extract data from the file.
> 
>    Issue 7.A  How do we use the data model in real code?  Is the
>    data model associated with a set of Java classes that we can
>    invoke on the data?  If the data model is more than documentation
>    we need to be able to instantiate behavior in some TBD way.
>    How do we preserve language independence? (Or do we?)
> 
>    Issue 7.B Does the data model describe behavior that is defined
>    for the data element or does it indicate that the data is convertible
>    to some fiducial form?  If the latter who is responsible for the conversion?
> 
> 
> Step 8.
> 
> The user had indicated that they wanted the spectrum to be flux versus
> wavelength.  VSPlot needs to see if it can convert the data extracted
> from the file into those units.  VSPlot looks at the UCDs and Units
> associated with the spectra.  It converts columns to the desired
> units where possible.  Spectra where the data are not convertable
> are discarded.
> 
>    Issue 8.A. How does VSPlot know which column to look at as the flux-like
>    column and which as the wavelength-like column?  It could look through
>    a list of potential UCD's or UTYPE's could be invoked here.   Could
>    the UCD and UTYPE seem to conflict?
> 
>    Issue 8.B.  How do we do the transformations? Is this VSPlot's responsibility
>    or do we support standard VO transformation services.
> 
>    Issue 8.C.  This is a hard step.  How does VSPlot know enough to distinguish
>    between raw and background subtracted spectra and the myriad details like that?
>    Is this a characeristic of the flux column or of the entire spectral file?
>    This seems to be where all of the discussion of measurements and quantities
>    needs to provide some benefit to the user.
> 
>    Issue 8.D. How does VSPlot find and use the measurement data model
>    information to help here?  What functionality is associated
>    with a measurement?
> 
> 
> Step 9.
> 
> The data is searched for error columns using UCDs and errors bars are computed.
> 
>    Issue 9.A. Errors need to be transformed if the data is transformed, but
>    the transformations can be complex.  Where is this handled?
> 
>    Issue 9.B.  How do we aasociate the error columns with the approprite
>    measurements?  Again this seems to be part of the mesurement discussion
>    but I need to know how this model is instantiated for it to be useful.
>    Does it use Groups in VOTables?  Are there other mechanisms?
> 
> 
> Step 10.
> 
> The data and errors are plotted.  Fini!
> 
> -----------------------------------------------------------
> 
> This is intended to give only an example of how these
> pieces might play together.  I don't know that there is any more
> formal description of this architecture -- nor do I know
> who is responsible for one.  Without such a broader picture
> of how all these things interact it's very difficult
> to assess all the myriad proposals that show
> up in the mail.
> 
> The issues seem to repeat two themes:
> 
> How do I find and get the various data structures that we've
> been talking about?
> What functionality is associated with them?  If we're talking
> about data models as objects, then what are the methods as
> well as the fields?
> 
> 
> If we can focus on what we really want to use the quantity
> model for, or the UTYPEs, or the spectral data model, then I think
> we'll be a lot more successful at defining them -- and we'll make
> a lot of progress towards deliverable VO applications!
> 
> The places where we've been most successful in the VO are when
> we have balanced definition of structures with protocols for using
> these structures: e.g., VOTables, UCDs and CGI access in the SIAP.
> 
> 		Regards,
> 		Tom McGlynn
> 
> 
>