Comments on Canadian VO data model

Tue Apr 22 14:31:27 PDT 2003

First, thanks for the detailed comments. I will address some/most of them 
below.

General comments: 

- a data model does not stand alone. It is integrally related to the query 
model one adopts. You need types that reponds sensibly to queries and you 
need queries that can leverage the available types. Thus, the DM is only half 
of the story (IMO :-).

- the design goal of the Catalog interface and associated data model is to
enable describing catalog entries (observations, processes, sources, etc)
for the purpose of searching for entries that can enable one to do some
science. Thus, we have restricted the property list to things that appear 
useful for querying; there may be other very useful properties that one would 
not query on, but would use for processing (an observation, for example).

On April 22, 2003 10:40, Jonathan McDowell wrote:
> Canadian VO Data Model Comments     - Jonathan McDowell
> -------------------------------
>
> The Canadian VO have published details of the data model used to
> describe images in their archive.
> The relevant documents are at
> http://services.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/doc/cvo/
> This data model is used to describe images and potentially
> spectra and other data products returned from the CVO (the voObs
> object), and also to describe entries in the derived source
> catalogs (the voSrc object which I do not review here).
> I'm sending my comments to the whole list in the hope of prompting
> the rest of you to look at their documents too.
>
> I think there are a few changes to the voObs model which could
> make it more general. The major comments I have are:
>
> A) lack of uniformity on axes
> B) lack of information on observables.
>
> (A) First, the axes: Spatial, Temporal and Spectral. Each of these have
> a
> lot of overlap but not completely; this seems unfortunate because
> if you want to add another axis it's hard to generalize.
>
> Specifically, the relevant attributes are:
>
>           Spatial                 Temporal             Spectral
>
> Shape     _bounds_eq [deg]          NONE                 NONE
> Bounds      NONE                  _bounds [s?]         _bounds  [A]
> Sample    _sample  [deg/bin]      _sample [s?/bin]     _sample  [A/bin]
> Bins        NONE                  _bins   [bin]        _bins    [bin]
> Fill      _fill                   _fill                  NONE
> Res.      _resolution [deg]         NONE               _resolution [A]
> Nyquist   _Nyquist                  NONE               _Nyquist
> [Deprecated:]
> Span      _span      [deg?]       _span   [s?]         _span    [A]

note: _span information is accessible via an operator (aka looks like a
method call on an underlying object). Most methods on the objects usable
as values can be used in queries.

spatial_bounds is 2D-ish. temporal and spectral bounds are 1-D. I don't see
a need to separate shape and bounds since we are specifying the portion of the
coordinate system that was sampled (or a superset when fill < 1).

> Notes:
>
> A.1  Spatial bounds are given as polygon nodes in J2000, and
>      repeated as galactic and ecliptic. See notes on regions and bounds
> below.

Strictly an optimsation. We should have a way to specify units for the
spatial_bounds and the ExplorableCatalog service can do whatever it likes to 
convert to it's internal format. Then one would have only one spatial_bounds
property...

In general, whenever we were tempted to put something in  the model that is an 
optimisation, it was a mistake :-) Such things should be implementation 
details.

>     The choice of polygon nodes as the description of 2D
>     regions is a fair one for the application in question, but doesn't
> generalize
>     well to other VO uses. Eventually one should support a general VO
> region
>    (which can include a circle, for instance, not supported here).

I agree that it would be nice to use Shape rather than Polygon. We thought
to start simple (one concrete type for spatial coverage) and maybe generalise
later. Of course, it is simple enough to say Shape2D in  the interface and
have the implementation only allow/deliver Polygon2D, so I can easily do this.

>     I would argue that it would be nice to have 'bounds' mean the
> extreme
>     bounds of each coordinate, as it does for the other axes.
>     As described, the spatial bounds can be a complicated polygon giving
>     the exact shape of the detector, but the temporal bounds are a
> simple
>     range giving the outer hull of the temporal window function.

This seems like an implementation detail for the catalog service. If the 
implementation wants to make a bounding box available, it can put an 
axis-aligned box in spatial_bounds and a spatial_fill < 1 to indicate that 
there is area within the bounds that were not actually sampled. If the 
implementation wants to put the full polygon, it will deliver more value to
the user (potentially) at extra cost for itself (query processing mainly).

- exposing the Shape2D.getBoundingBox method as queryable would 
accomplish the same thing

For the time_bounds (type: interval) this is not necessarily the outer hull
of the sample. The catalog permits one to have multiple instances of a
property (essentially a list of values) such that one could specify the
time sampling at the level of detail required. For example, if I had a
stack of images taken at [t1,t2] and [t3,t4] then one could describe it 
via a single time_bounds of [t1,t4] - the outer hull - or by having mutliple
instances of time_bounds (the complete list). This is part of the structural
model, (EntryProp and Entry) so it isn't immediately obvious that one can
do this.

- same duplicity allowed for spectral_bounds...

>     It is useful to have this outer bounds to answer the question 'might
> this
>     dataset contain stuff of interest'. The detailed shape (detector
> polygon,
>     temporal start and stop intervals) is needed when you get to
> actually
>     analysing the data; the next step up is the sensitivity map and
> effective
>     exposure depth versus time. The detailed information should
>     accompany the data when it is retrieved, but arguably may not be
> needed
>     at the index layer that this data model seems to represent.

The usage pattern we support is for the user to say:

         [how many | show me ] observations where spatial_bounds_<coordsys>
                                            contains the point (x,y)? 

Thus, the user doesn't extract and scan these polygons to ask do that kind of 
thing. A bounding box is just a way of asking the above question and adding
"but do it fast rather than accurate" - which has its place - but probably not 
in the data model itself (like other optimisations).

> A.2  Why no spatial_bins ? This seems a critical piece of info
>      (e.g. 1024 x 1024 image, or 1x1-spatial-pixel spectrum...)

Didn't see the need for it... basically I don't think anyone would actually 
query on spatial_bins when they had spatial_bounds and spatial_sample
and spatial_resolution. A spatial_bins property would only be a crude 
estimate of the size of the data_product (in MB or GB). 

Basically, *_bins is useful in practice to differentiate between 1 and non-1
bins or for small numbers of bins. 

The best examples are logically associated datasets: a spectral association  
would be a field in the sky with multiple observations with different 
spectral bounds. Here the difference between spectral_bins of 1 and 5 is very 
important because 4-5 lets you do photometric redshift computations while
1-2 does not (but spectral_bins = 2 would let you compute colors). Same could 
be applied to a temporal association: a small number of bins would not allow 
detection of variable objects, but some larger time_bins value (10?) might... 
moving objects probably only need 3+ time_bins.

> A.3  Why no spectral_fill?  Not needed very often, but consistency is
>      helpful.
>      I'm not fully convinced fill is that useful a value, since usually
> what
>      you want is really to take a variable QE across the detector axis
>      into account, rather than just an on/off - although I guess in the
> temporal
>      case a simple fill number is often useful.

If spectral_fill isn't there, that is a mistake. It is needed for cases where
one co-adds some narrow band images to get a pseudo-broad-band
image, and wants to describe it with a single spectral_bounds and 
spectral_fill < 1...

> A.4  Why no temporal resolution or Nyquist?
>      For old, historical observations the accuracy of the recorded
>      observing time may be poor (I've seen data in the literature,
>      which one could imagine scanning back in, where the observational
>      date is only known to a year or so. Bad, bad referee.)

Didn't seem useful since time_sample covers it and time_Nyquist is
probably always 1. The lack lack of certainty in your example is part of
the error. 

I suppose scanning observations would have temporal resolution and hence
non-one Nyquist ratio... should be added for completeness.

> A.5  It seems a bit labored to have Nyquist as a separate attribute
>      (rather than method) since it is simply the ratio of two other
> attributes.

Agreed. The query half of the story does allow for method calls on properties
(ie. you can specify things like polygon.area() or interval.size()). However,
nyquist() is a method on the Entry rather than a method on an EntryProp since 
it involves two EntryProp objects. The other way to handle it, which we do 
support to a limited degree, is to use algebra involving the two properties. 
I certainly agree that *_Nyquist is an optimisation and not fundamental. There 
are other such things lurking in there :-)

>
>
> B) Observables
>
> The "content properties" attributes give derived properties of an image
> that are really the summary of a derived catalog for that image.

The intent of the optional "content" properties is to describe the observation
content in some general/aggregate/statistical fashion so that users can search 
for observations that will probably be worth examining. 

> But the huge thing that seems to be missing here is a description of
> what the pixel values in the data actually represent - I think the
> implied assumption of your model is that they are flux values in Jansky

It seems to be the fundamental flux unit. Proper unit-handling would remove
the necessity of making this uniform at the interface level.

> (or if you prefer, Janskys, but please, not "Jansky's" :-))

Point taken :-)

>
> Even within this assumption, I think there's crucial information that
> could be added:
>   - actual units of image

This is something you need to know to work with the actual data, but not to
query a catalog. Thus, it is part of the Archive data product you download
to tell you how to use the data correctly.

We have tried to make a strong distinction between a Catalog (supports 
querying) and an Archive (delivers data products). An Archive makes 
data products available by publishing to an observation catalog; it has no
query facility at all. 

>   - is the photometry absolutely calibrated, or not?

As above. The archive supplies the zero-point. If we had
non-absolutely-calibrated observations, we probably want a
property saying that so that users know that the data product
will not contain the zero point. Such data is still useful for some
science cases but not others. This would be a required property.

>   - is it linear, or in magnitudes (instrumental or standard)
>   - other indications of photometric quality
>   - saturation level

We are working with the simple case of flux-calibrated observations,
ie linear. I don't think one needs to know this to query, so it is part of the 
archive data product.

> But I think one should allow for the possibility that what is in the
> data is not sky intensity but some other quantity:
>   - spatial image of spectral index (or B-V color)
>   - spatial image of ISM extinction, or Faraday rotation measure
>   - spatial image of CMB dT/T anisotropy
>   - extinction versus wavelength
>   - integrated line flux versus time
>   - radial velocity versus time
>   - observatory humidity versus time

Excellent!!! 

This is why we really need people from different groups/institutes/fields 
contributing to the model...

> So I would propose
>
>   observable_quantity: String [REQUIRED]  The quantity represented by
> the pixel values.
>               The usual value is "SKY FLUX DENSITY".
>   observable_unit:     String [REQUIRED]  The unit of above, e.g. "Jy",
> "count",
>                                           "mag".

There is a very real danger with String values: everyone creating an Entry has 
to use the same terminology or searching becomes very messy. I have on the
back burner a plan for a enumerated value type that imposes value constraints
that are known. The allowed values would be enforced when adding content to a
catalog and users could see the list of allowed values by probing the 
EntryPropMap - which they already do to see the property names, types, and
units.

Basically, it is like property names: everyone has to agree or we're screwed 
:-)

So, the type would be something like Enum.String rather than  String.

> As for the content properties:
>
> I'm intrigued by the choice of S/N = 10 for your point source
> reference. I would have thought that S/N = 3 might be more helpful
> for people who are interested in 'is there a chance my source might be
> there?'
> which I think is the most common question.

This is supposed to be an indication of depth/sensitivity. The idea is that
an astronomer can use this to find observations of suficient depth...

> Again, one can generalize on axes. The number density things
> are crying out for generalization: How about
>  spatial_feature_density_positive_total
>  spatial_feature_density_positive_resolved
>  spectral_feature_density_positive_total
>  spectral_feature_density_positive_resolved
>  spectral_feature_density_negative_total
>  spectral_feature_density_negative_resolved
>  temporal_feature_density_positive_total
>  temporal_feature_density_positive_resolved
>  temporal_feature_density_negative_total
>  temporal_feature_density_negative_resolved

Nice!

> Negative spatial features may also be worth counting since they
> may indicate localized absorption or incorrect background
> estimate.

-- 
Patrick Dowler
Tel/Tél: (250) 363-6914 | Fax: (250) 363-0045
Canadian Astronomy Data Centre    | Centre canadien de donnees astronomiques
National Research Council Canada  | Conseil national de recherches Canada
Government of Canada                   | Gouvernement du Canada
5071 West Saanich Road                | 5071, chemin West Saanich
Victoria, BC                                   | Victoria (C.-B.)