A Virtual Observatory Data Model

Fri May 9 17:05:03 PDT 2003

                        [NOAO Data Products Program]

                      A Virtual Observatory Data Model

                              Francisco Valdes
                              fvaldes at noao.edu
                                May 9, 2003

COVER LETTER

The following (also http://iraf.noao.edu/projects/vo/dal/datamodel.html) is
a contribution for the data model discussions at the working groups meeting
next week in Cambridge. It is an extension of some earlier ideas ( [1] , [2]
) on including celestial coordinates in the WCS for 1D and 2D spectra and on
the question about whether accessing spectra and images in the prototype VO
framework requires different protocols. Because of the deadline imposed by
the meeting the discussion is abbreviated in some areas. My hope is
emphasize the general philosophy and approach. There are some important
ideas which I support in the Spectral Data Models draft from Jonathan
McDowell and Steve Lowe but there are some philosophical differences which I
wanted to offer. Primarily the ideas of treating images and spectra as
projections of a more general class and simplifying as much as possible by
limiting VO data to "calibrated" forms which don't require complex metadata
to interpret.

Because I decided it was more valuable to try and build a consistent
discussion from my perspective I did not have time to also critique the
McDowell and Lowe concepts. But it makes sense that before doing that one
really needs to reach concensus on whether to treat spectra of various
dimensions separately or whether to work towards an integrated
spectrum/image data model.

Good luck in the meetings. I'm sorry I can't be there.

Frank Valdes

   * [1] Spectral WCS Conventions
     About FITS WCS and how 1D and 2D spectra can include celestial
     coordinates.
   * [2] Incorporating Spectra in the Next Phase of the Virtual Observatory

1. What is a virtual observatory data model?

The first hurtle to overcome in defining virtual observatory (VO) data
models is to understand what they are and what they are not. In the
discussion given here a VO data model is the SIMPLEST abstraction of
physically calibrated, wavelength regime and detector technology independent
astronomical data.

We emphasize simplest because a key part of the VO concept is that users,
called VO observers, should not need to be experts in every regime of
astronomy and instead only be educated astrophysicists. The science done by
VO observers generally involves data from various telescopes and various
energy subdisciplines. The reason for striving towards the simplest
description is to allow concensus and interoperability between a wide
variety of data providers.

The other side of the question, which should be a mantra of sorts, is:

    "VO data models are not FITS or file formats"
    "VO data models are not archived data"
    "VO data models are not instrumental data"

2. Celestial Sphere Binned Photon Observations - 4DBIN

This document defines a broad class of astronomical data called "Celestial
Sphere Binned Photon Observations". Note that the detailed definition of the
class identified by this label is more specific than the literal
interpretation of the words. The definition of the class flows from the name
as follows.

     Celestial Sphere
          Restricts the class to data about the two dimensional
          celestial sphere. There are two spatial parameters specifying
          the longitude and a latitude in some specified celestial
          system.
     Photon
          Restricts the class to data about the photon energies as
          described by an energy parameter.
     Binned
          Restricts the class to data about the number of photons
          arriving over finite regions, called bins, of the parameter
          domain. A way to look at this is that photon events are
          indistinguishable within a bin. A further restriction is that
          the bins are rectangular so they may be described by a center
          and width in each parameter.
     Observations
          Restricts the class to data about photons over a time
          described by time parameter. Observation evokes the idea of
          detecting photons over an integration period, though
          simulation and model results can be cast into simulated
          observations.

This definition of the class has four parameters; celestial position,
energy, and time. This forms a continuous space or domain which is divided
into a set of bins that are not necessarily uniformly distributed or of
equal size. Each bin is associated with the number of photons it contains.
The number of photons may be expressed in various ways such as number,
energy, and flux.

This class may be thought of data obtain through the following process.
Photons of various energies are detected as a function of time coming from
points on the sky. Each photon is tagged by four numbers from a four
dimensional continuous space. The numbers are a latitude and longitude on
the celestial sphere from which the photon arrives, the energy of the
photon, and the time. The continuous space is divided into a set of discrete
regions or bins which are indexed in some fashion. The photons are counted
in each bin. The details of the continuous energy, position, and time
parameters are lost and only the bin index and bin counts are retained.

This definition makes a notable distinction between the measured quantity,
the photons, and the sampling, the bins. This distinction is often confused
or lost. The photons, sometimes thought of as the "z" axis in an image, is
the scientific content which is conveyed in standard physical units. The
sampling or binning is variable and dependent on the way the data was
obtained. The VO infrastructure or the data providers may "convert" units
for the photon values and "resample" the bins at the request of the VO
observer.

To identify data which falls into this class we define a top level tag

        VOCLASS = 4DBIN

2.1 What is the difference between VO data and observational data?

A key aspect of virtual observatory photon binned data is that the primary
bin values be calibrated to standard physically meaningful units. There are
two important reasons for this. One is to allow VO observers to easily
intercompare data with only simple physical unit conversions. The other is
to simplify the data model and limit metadata which must be supplied to
allow meaningful interpretation.

This does provide a small burden on the data providers above what has been
typical. For instance, optical imaging often provides data in digital units
with the conversion to photons implicit in a gain and a magnitude zeropoint.
For VO data the data provider does the gain multiplication and conversion of
the magnitude system to photon based units so that non-optical astronomers
don't need to understand the detector technology, many of the ideas of
magnitudes, and the metadata doesn't need to include a gain and magnitude
zeropoint.

In order to provide a "caveat emptor" option to the VO observers and data
providers, a top-level metadata declaration is whether the primary data
values meet the VO standard for this class:

        4DBIN.CALIBRATED = [yes|no|relative]

By asserting "no" the data may be useful but would require the VO observer
to calibrate it themselves in some way. The "relative" calibration is a way
to assert that the data is proportional to photon counts and that the
response to photon fluxes is independent of position (after taking
differences in bin sizes into account). Therefore, relative comparison
between different bins is scientifically meaningful even though an absolute
calibration is not defined.

Note that the first sentence of this section refers to the "primary photon
bin values". The reason for this is that the observational and calibration
characteristics appear in the ancillary data and metadata. This is primarily
contained in the uncertainties but some other useful information may be
provided in exposure maps and data quality flags.

2.2 What is an image and a spectrum?

In as much as astronomers define and distinguish between "images" and
"spectra", an image is a subclass with only a single energy bin, a single
time bin, and multiple bins in both spatial parameters. The energy bin is
often fairly wide but not always. A spectrum also has only a single time
bin, but has more than one energy bin, and one or more spatial bins.

Astronomers also typically discriminate between spectra having a single
spatial bin, called a "one-dimensional spectrum", and multiple spatial bins,
often called a "data cube". The special case of spatial bins restricted to a
curve on the celestial sphere is called a "slit spectrum".

In this document there is no distinction made between spectra and images.
However, one could choose to subclass the metadata concepts. A subclass
means using implicit and explicit conventions and defaults. The subclasses
might be:

        VOCLASS = 4DBIN.IMAGE
        VOCLASS = 4DBIN.1DSPECTRUM
        VOCLASS = 4DBIN.SLITSPECTRUM
        VOCLASS = 4DBIN.DATACUBE

3. Metadata

Data from 4DBIN Class fundamentally consists of a set of numbers related to
photon counts. To make sense of this set of numbers requires metadata or
conventions which describe the relationship between photon counts and the
bin value, define the bins, the uncertainties in the values, and associated
attributes.

As a thought experiment, which we use to identify the metadata through a use
case, suppose one is given the set of numbers {0,6,7,2,5,3,1,4}. What do we
need to understand something about the photons observed on the sky? Along
these lines the minimal metadata necessary should be separated from optional
metadata. Here we suggest the minimal description is provided by section 3.1
on the bin geometry and section 3.2 on the bin values.

First we need a top level piece of metadata defining the class and
conventions. This type of metadata is sometimes associated with a name, such
as FITS (with SIMPLE=T). For this document we define this metadata class
domain

        VOCLASS = 4DBIN

3.1 Bin Geometry

The metadata for the bin geometry describes the mapping from the continuous
four dimensional photon parameter space to the discrete indexed bins. As
noted in section 2, the bins are required to be described by a center and
width along each of the four parameter dimensions. This constitutes the bin
geometry.

The first thing we need is a definition for the indexing of the data bin
values. There are two straightforward ways to do this. One is to use the
ordinal of the data value set. The other is to arrange the values into an
array. For the 4DBIN class the array is required to be four dimensional.

        4DBIN.INDEXING = ordinal
        4DBIN.INDEXING = array(N1,N2,N3,N4)

3.1.1 Ordinal or tabular indexing

The first method is completely general while the second requires the number
of data values to be the product of the array dimensions. At this point the
two indexing schemes seem pretty much the same. The distinction comes in how
the indices are used to map to the bin geometries in the four dimensional
parameter space. In practice, the ordinal indexing is used with a table and
the array is used for gridded bins.

In the ordinal indexing the metadata includes a table of bin geometry
values. The table is a set of numbers ordered such that each sequential set
of eight values define a line and the line number corresponds to the data
value with matching ordinal. For example, the first eight numbers apply to
the first data value, the second eight to the second data value, and so
forth. The eight values are the bin centers in longitude, latitude, energy,
and time followed by bin widths.

In the simple 1D spectrum example we might have

  0 : 12h10m15s 32d15m10s 4001A 2003-05-07T12:10:15 1arcsec 1arcsec 1A 300s
  6 : 12h10m15s 32d15m10s 4002A 2003-05-07T12:10:15 1arcsec 1arcsec 1A 300s

3.1.2 Array or raster indexing

For the array indexing we use a metadata description along the lines of the
FITS WCS. This is a complex description which we only touch on here with
attention to the restrictions imposed by the 4DBIN class. The metadata
components would include many of the basic elements of the FITS WCS
metadata. Besides the actual formalism for evaluating the bin centers and
widths another key piece of metadata is the units of the four parameters.

The main restriction on the FITS WCS formalism as it applies to the 4DBIN
class is that the axes ordering is required to be latitude, longitude,
energy, and time and so the FITS WCS is always a WCSDIM of 4. The FITS WCS
does not currently explicitly define time coordinates. But for the main data
types of interest, images and spectra with a single time bin, we simply use
a linear WCS.

The bin centers are a direct analog to the pixel centers in the FITS WCS.
There is a linear mapping from the array index to an intermediate WCS
coordinate. There is potentially a distortion transformation to an ideal
intermediate coordinate. For calibrated data typical of the VO this should
not be required except possibly to describe the path of a slit spectrum on
the sky. Finally there is a projection or standard non-linear transformation
to the final coordinates.

One new feature of the FITS WCS formalism is use of a lookup table. This
allows for bin centers which are not uniformly arrayed in the parameter
space. It can provide similar information to the ordinal description.

The concept of bin widths is only implicit in the FITS WCS formalism. For
the array indexing metadata model defined here, the bin widths are computed
from the WCS using the idea that the WCS functions are continuous in the
index space. So the bin edges are computed by adding and subtracting
one-half to the integer indices and evaluating the parameter value at those
points. The WCS formalism is more general than simple rectangular bins so
this computation is done by varying only the index of one parameter. The
width of the bin is average difference from the integer index center and the
two half index values.

3.2 Bin Values

Section 2.1 declares that calibrated 4DBIN data be in certain physical units
directly related to the photons and the bin sizes. The primary metadata for
the bin values is then the units. For example,

    4DBIN.VALUES.UNITS = ergs/s/cm^2/A
    4DBIN.VALUES.UNITS = photons
    4DBIN.VALUES.UNITS = Jy

The definition of the allowed units also needs to provide standards such as
calibrations to above the atmosphere.

When there is a significant variation in the detection of photons across an
energy bin, such as occurs with a filter in a broadband image, the
calibration must be referenced to the filter system.

    4DBIN.FILTER = Johnson(B)

Background contributions need to be described by primary metadata.

    4DBIN.VALUES.BACKGROUND = Subtracted using nearby simultaneous observations
    4DBIN.VALUES.BACKGROUND = Subtracted by CCD shuffling
    4DBIN.VALUES.BACKGROUND = None subtracted

3.4 Uncertainties

For identification purposes, such as finding sources or redshifts, and when
the magnitude of the signal is high, such as continuum shapes over decades
of energy, the uncertainties about the data bin values may not be important.
In other words, there a a number of uses for calibrated VO data that just
depend on the data units and the the binning.

But for detailed measurements where detection and instrumental effects are
important, a significant piece of metadata are the uncertainties. There are
two approaches which might be provided by the data model. The more rigorous
approach would be to give statistical information about each bin (possibly
including covariances).

The statistical description of the uncertainties implicitly carries
information about exposure times, rejected data in combined observations,
variable sensitivities, and so on. Other attribute metadata may explicitly
provide the means to separate these implicit contributions to the total
uncertainties.

The other is to provide a functional description. This is only really useful
if the data is relatively homogeneous so that variable DQE, bin sizes, and
backgrounds are not present. A typical model describes the variances as a
function of the data values. For instance,

        V = A + B N ...

where N is the binned photon number.

3.5 Attributes

This section on attributes is a catch-all for all the rest of the metadata.
This is all to be defined. However a quick list of common useful attributes
is given below.

     label/title
          a label or title provided by the observer
     object ID
          a standard object id
     instrument
          details of the telescope and instrumentation
     conditions
          information about the observing conditions
     calibrations
          details of the calibrations

     data quality
          a table of data quality indicators for:
             o uncalibrated bins due to vignetting or masking
             o poorly calibrated bins
     exposure map
          a table of effective exposure times
     exposure filter
          a table describing chopping, shuffling, sequences of combined
          exposures, etc. This is a filter function for the time
          dimension of a bin.