Ontology/Metadata/Query

Thu Mar 20 11:22:34 PST 2003

I send this now only to the Knowledge Engineering discussion group.
If after the flames die down there are still pieces left, I will
submit it to the Metadata discussion group.  If any stubs remain after
that, it can go to the VO Query Language discussion group where it
hopefully can be a basis for query development.

Let's first look at a high level model of astronomical knowledge, a
sort of ontology on which to base astronomical ontology.  After this,
we can look into a highest level data model and then see if we
can somehow tie them together so that queries for knowledge can be
transformed into queries on data via a data model.  For this paper,
I just examine the mappings between knowledge and data and not
worry about the mechanics of when or which application takes care of
each step.

1. Ontology

1.1 Classes

Each class has a name, definition, indicative properties, and
prototypes.  Instances of a class are commonly called objects.
A definition may be held in free flowing text, but it usually includes
a set of defining properties within prescribed ranges.  There may
be other properties that are not official defining properties, but
also typically lie within a range and so indicate possible membership
to the class.  Other classes may be subclasses or superclasses of a
given class.  There would be only one superclass for each class if
they reside in a hierarchical scheme, but allowing for more than one
allows for more general and complex topical maps.  The properties of
a subclass would always have property values with ranges within the
ranges of the class.  Prototypes are merely good, nearby instances
of the class.

1.2 Properties

Each property type has a name, definition, domain of applicability,
and range of possible values.  Instances of property types are
simply called properties.  Properties always have names and value
(sometimes called an argument).  A value may be numbers, string, a
numerical range, or a list of strings, numbers or objects of a class.
The more general case of a bag of heterogeneous object types can be
avoided since there can be a separate property for each class in the
range.  The values of a property usually refer to direct measurement
values on an object or to a mathematical expression involving direct
measurement values.  Part of a property's definition may include
instruction on how to create it from a combination of other properties.

The domain is simply the set of classes for which the property type
is valid.  I am not sure it is absolutely necessary to actually make
such an accounting for astronomy.  If no data center responds to a
query for a specific property on a specific class, then you know all
you need to know.  However, if one provides full domain information for
a property one might prevent many fruitless, time consuming queries
from occurring.  Perhaps domains can be periodically updated from
the registry.

1.3 Modifiers

Properties may require modifiers and these also have name and values.
An example of a modifier is withinApertureOfSize that modifies a
brightness property or hasNeighbor.  Another example is epoch which
states the time of the measurement of a property.  Modifiers may
require modifiers, as in units of an aperture size etc.  Modifier
values usually refer to metadata about the instrument, where and when
it was used to make a property measurement.

1.4 A Property Space

Two or more properties can be combined to create a property space,
that is, an N-dimensional combination of N properties.  Rather than
expressing the range of each dimension as a simple segment on the
number line, an N-d geometric shape is required.  One way of expressing
a polygon in this space is to declare a vector with a specified order
for the N properties and then list a set of such vectors that point
to the vertices of the polygon.  More complex shapes would require
mathematical expressions describing the property shape.  Perhaps SVG
(Scalable Vector Graphics) would be helpful for describing this
as well.

1.5 Hierarchical Classification Schemes

In astronomy, there are several general sets of classes that are
arranged into hierarchical classification schemes. If we wish to
think of these in analogy to trees, then at the base of each tree is
a single object type that generalizes the scheme and every subtype of
astronomical object sits on top as branches extending from the base.
Each branch, extending from the base or extending from another branch
is called a class.  In astronomy the most prominent instance of this
is the classification of astronomical objects that has at its base
any type of object in space, astroObject. Other examples of this are
optical instruments, particles (elementary, atoms, and molecules).
But, note well, often the relationships as one traverses from one
level to the next changes.  Some classes are related to its parent
via a subclassOf property, ie. the properties of the child class
takes a subset of values, while others are connected by isMemberOf
or isComponentOf.

1.7 Topical Map

A topical map is similar to a classification scheme except that a
class can have more than one parent class, thus the topology would
be considerably more complex.  A simple example of this: galaxy has
parent classes stellar system and nebula. I would suspect that property
types would require this as well.  It is probably unavoidable that
every hierarchical classification scheme turns into a topical map as
greater details are added.

2. Data Model

We begin (and end) with a common model of information and data,
that there exists data and metadata.  Data, in the NVO context,
should refer to numerical values of the properties of astronomical
objects (real or simulated) and possibly also to instrument calibration
results.  Description and preservation of these values is, and always
has been, the primary function of data repositories in astronomy.
And, if we make this semantic distinction between data and metadata
it becomes  easier to tie astronomical knowledge to data through
class properties.  For each data value there exists several layers
of metadata (real and virtual) that helps to specify the context of
the value.  Metadata elements consists of metadata terms, sometimes
referred to as keywords, and metadata values.  A metadata element can
hold a value or list of values or another element.  If we wish, we
can adopt a VO rule that the mixed case of an element having a value
and a child element is forbidden.  Then, one has terms that are used
to collect together a number of other terms, and terms that merely
hold values.  What we commonly call the items of our data centers,
the images, tables, collections, data sets, can easily be recast into
this model and then the VO designers do not need to look too hard at
the particular demarcations set by each data center for these vaguely
defined items.

A look at two common types should help to understand this.  A typical
table consists of a list of records; each record is a list of fields
that are, for the most part property values of an object given by
the ID value, usually at the beginning of the record.  The data
are the property values and the ID value is actually a metadata
value specifying the object in study.  There is metadata for each
field explicitly giving detailed information about the name of the
property in each field plus modifiers required of the property type.
On occasion there are exceptional fields that hold metadata rather
than data, such as an epoch which modifies the measurement property.
Because queries are usually formed as constraints in coordinate
values, the data center's search tool needs to be alerted that certain
constraints are going to require looking through the tabular values.
It is therefore important that the resource take note that some of
the metadata values are in the fields of the tables.

An image is a bit more deceptive.  Each cell in the image is a datum.
Where that cell pertains to in the sky and when the image was taken
are metadata values, but it is in a condensed form.  The headers
of the image provide coordinate frame of reference and positioning
information so that the coordinates of the cell can be calculated by a
standard algorithm.  Nevertheless, to a high level query system there
should not be a distinction between obtaining the metadata value at
that coordinate if the metadata for that cell is in an algorithmic
form in a header or explicitly present as it would be in a table.

Are coordinate positions metadata or data?  Are they properties
or modifier of properties?  The answer depends on the context.
The location of an object is a property.  So if you are asking for the
location of a star there is hopefully datavailable on the centroid
of its brightness.  If one asks for a brightness profile, then one
is asking for brightness data and one uses the coordinate metadata
on the location of each data point to form a brightness profile.
Thus a query on brightness distributions is a property search but
the required constraints on the coordinates are modifiers.

3. Transforming Problem Statements into Data Query Language.

3.1 Mappings

Keeping the discussion at this level of abstractness is helpful because
indeed one can plug in any specific metadata vocabulary and/or ontology
and it would not change anything here.  And it appears, atleast at
this vantage point, that software could be developed that allows query
for knowledge using the above high level ontological concepts and
that the connection to the data in the above data model context can
be reliably made.  Properties map into data which resides at the top
of the metadata/data hierarchy, property modifiers map into metadata
terms close to the property elements, objects map to ID fields or
object metadata. Class maps to class metadata, but queries that ask
to descend or ascend the hierarchy/topic map will need to be sent to
a specially prepared classification database.   If the software is
written with this level of abstraction in mind, it should be quite
easy to make modifications in metadata terminology without much fuss.

A common knowledge query is to find objects that are
constrained to a given set of property values.  This then translates
into finding data values with metadata terms corresponding to the
property names or to the combination required to form the requested
property.  If a property has modifiers, it is important that both the
query and the metadata at the data centers include these as well.
And it is of course necessary that there is no mix up of modifiers
that may also be used by some other property.  If the search is
constrained by a range in some property value it would be best if
the metadata is queried for minimum/maximum or coverage information
on the property before beginning an actual hunt through the data.
Minimum and Maximum elements within a field is straightforward, but
coverage/spectral is a bit more difficult to connect with a request
for a specific wavelength band.

3.2 Quality report

Whether the user asks for it or not, it is probably a good idea for
quality information to travel with any property value.  The display
software needs to account for this and allow the user to show or hide
this information.

3.3 Multiple responses

3.3.1  Multiple property values

Several resources may respond with slightly different results.
One needs to know the users desire on what to do with these. Choices
are (weighted) average, mean, mode, standard deviation, or simply
take the best quality one.  Software MUST deal with this in some way,
even if it is to take a simple average for everything.  One nice
procedure is to show, within a table cell, all of the values and then
the requested statistic.

3.3.2 Multiple classification

When a query is made for the class of an object, several resources may
respond with different and perhaps contradictory classifications for
a particular object.  This could be a result of poor resolution,
observational noise, or incomplete data.  Not all multiple
classifications are contradictory since it is possible for the
property space of one class to overlap slightly with the property
space of another class.  Also, there may be incomplete observational
data so that discrimination between classes is not yet possible.

It is merited to return all classes attributed to a given object and
links to the resource.  At this time only a human can review the
observational data to assess quality of classification.  However,
for the sake of large statistical studies there should be a user
option to take either the most recent classification or to drop
contradictory or ambiguous classifications.  Statistics on the number
of objects dropped should be reported to the user for this mode to
be scientifically useful.

3.4 Namespace differences

There are a host of subtle differences in terminology or their meanings
that can exist between each resource and we politely refer to these as
namespace differences.  We know what to do when resources use different
words to mean the same thing;  provide namespaced vocabularies and
then create translation software.  The more difficult problems arise
when the meanings are slightly different.  An example of this that
is often given is V_band flux.  The so called V_band filters can
differ significantly between observers resulting in differences in
throughput that may depend on the color of the object.

This issue can be rephrased into a question of how detailed and
complete can we craft property modifiers.  And, can we provide
translation services between modified properties.  For the V_band
case, the band property modifiers needs to include specification on
the band.  Minimally, a name for the subclass of V_band filters can
be given (perhaps, OCLI/F566 or something like that).  There could be
coefficients for transformation to standard Johnson V_band of the form
(A0 + A1(B-V) + A2(B-V)^2 + A3(V-R) + A4(V-R)^2).  Finally, there
could be a table of the response curve as function of wavelength.
The coefficients of transformation can be derived from that.
This could be a link that is attached to the value of the property.
Of course, to implement this you notice that a request for a V_band
measurement turns out to be a request for B,V,R plus transformation
coefficients etc.  Such is the price for scientifically accurate
knowledge.  With time, data centers may choose to provide photometry
pre-transformed into standardly used bands.  Then the users will
benefit from faster response time and the data center will benefit
from lighter load on their servers.

There are paths to ever improving transformation between differences
in exact meaning of data values.  However, sometimes the information has 
been lost over time and is not recoverable.  All we can do there is to
alert the user to this problem.  Perhaps we will need to color code
results (black means transformed, blue means links to transformations
are available, red means no transformation is known).