Approach to metadata for Spectral Data at the Planetary Data System (PDS)

Tue Jun 15 17:52:38 CEST 2021

On Tue, Jun 15, 2021 at 3:27 AM Markus Demleitner <
msdemlei at ari.uni-heidelberg.de> wrote:

> Hi Anne,
>
> On Mon, Jun 14, 2021 at 10:44:49AM -0400, Anne Catherine Raugh wrote:
> > The tags are documented as part of the PDS4 Information model. They are
> > part of the information model label taxonomy in what we call the
> > "Primary_Result_Summary" class. The formal definition from the
> Information
> > Model (IM) is here:
> >
> >
> https://pds.nasa.gov/datastandards/documents/dd/current/PDS4_PDS_DD_1G00.html#d5e15413
>
> This points to a ToC entry for me, which then points to
>
> https://pds.nasa.gov/datastandards/documents/dd/current/PDS4_PDS_DD_1G00.html#d5e85
>
> -- and that I find intriguing, because its terms seem to be much
> closer to what for us is the relationship between a primary dataset
> and component artefacts, which we keep in the datalink/core
> vocabulary, http://www.ivoa.net/rdf/datalink/core.  I'm tempted to
> try and match the two vocabularies to see what we might be missing.
>

Check with Baptiste Ceccione - he may have attempted that very thing
already, since we're a Planetary archive. I've been reading IVOA standards
for two weeks now and I'm still working out in my mind how the mapping for
our product might work in practice for an EPN-TAP interface. And then, of
course, there are the structure translations to understand. That's the next
set of standard on my reading list.

Out of curiosity: Do you have statistics on how much the different
> product types are being used, both in terms of depth ("how many are
> there of each type?") and of breadth ("how many projects are offering
> products of some type?")?
>

I do not, and I am curious myself. Most of our data is still under the old
standard (called "PDS3") - little formal modeling, even less consistency,
very, very hard to understand easily what is in our own archive and no
access to holdings of other nodes. The PDS4 standard is the opposite - very
structured, and it would be easy to poll across nodes if we all had all our
data converted and registered. We're at least a year or two away from that.
We have over 3500 data sets to convert, each of which contains from several
hundred to several hundred thousand individual data products (a "product"
for us would be, for example, one image with its PDS4 label, bad pixel map,
quality map, and whatever other additional data unique to that image that
the label describes).

My node specializes in small bodies (I'm at the comet sub-node), so
spectroscopy is always present in at least one form for every mission we
work with, and sometimes there are multiple instruments producing spectral
data in different ranges and formats. Some of the more recent missions for
which we are the data archive are:

   - Rosetta (PSA also archives these data)
   - New Horizons
   - Deep Impact/EPOXI
   - DAWN (the NASA mission)
   - Hayabusa
   - OSIRIS-REx
   - etc.

It would be a major project to pull up even rough numbers for the data at
my own site (the comet sub-node), which is largely why it has not been
done. Once the data are migrated to PDS4 and registered, a query to the
registry would provide the numbers easily.

Oh, and: From "Class Hierarchy" it would see to me that your (in
> effect) vocabulary is flat; for instance, I'd have expected to see
> "Context" as somehow narrower than "Document".  Was it a conscious
> design decision that it's not?
>

Yes. As I understand it, the derivation exists in the ontology database
(the Information Model "lives" in a Protegé database - the implementation
is generated almost, if not entirely, automatically from the database), but
there was push-back from many nodes representative about how complicated
the full structure looked in a label. The argument was that it was so
complicated that no user would ever use it, because they didn't want to be
bothered. So we flattened the structure, which makes it look simpler but,
in fact, makes it more complex to understand and use correctly (and also
parse and interpret). Now that people are used to looking at XML (that was
a big change from our old syntax), I think if we had that discussion again
it would end differently. So I hope we will get to revisit it for our
version 2.0.  I think it is key to discoverability across archives, not
just PDS nodes.

Closer to what we're trying in product-type is what I'm finding in
> classes like "Array_2D_Spectrum", which also have a deep hierarchy
> (cf.
>
> https://pds.nasa.gov/datastandards/documents/dd/current/PDS4_PDS_DD_1G00.html#d5e3914
> for an example).
>
> What I couldn't quite work out there: Is there a way for a machine
> (that, I think, would treat the class names as opaque) to work out
> that Array_2D_Spectrum and Array_3D_Spectrum offer spectrally
> resolved data?  And why is there no 1D spectrum?
>

If the class name is ignored, then the signifier that a product contains
spectral data would be the presence of the Spectral_Characteristics class
in the Discipline_Area of the label. The core PDS4 namespace is extensible
by adding namespaces to provide detailed metadata in more restricted
contexts. There are a number of "Discipline Dictionaries" that are used
across nodes to provide metadata specific to science discipline. The
Spectral Discipline Dictionary identified spectral dimensions and defines
binning parameters.  (I maintain this dictionary, so it has a detailed user
guide on my wiki:
https://sbnwiki.astro.umd.edu/wiki/Filling_Out_the_Spectral_Dictionary_Classes.)
If a data submitter is providing spectral data (photon spectra, that is),
then the node reviewing the data should ensure that the
Spectral_Characteristics class is also included in the data labels. If this
class is present, then the data object it describes contains spectral data.
If this class is not present, then there should be no photon-spectra in the
data. (Mass spectra, time-of-flight spectra, color spectra, etc. are not
included in this dictionary.)

There is no Array_1D_Spectrum because 1D spectra are not presented as
arrays. That looks like the single biggest problem we'll have making our
linear (that's how I think of them) spectra available through an IVOA
interface. Linear spectra are present in the archive in one of two formats:
Either a table, where each row describes the measure value and all related
parameters for one spectral bin; or as a table where each row contains one
spectrum and related data for each bin. The vast majority of the time,
these tables are ASCII, not binary. And the column content varies so widely
that attempts to define standard column sets have failed. No attempt was
ever made to define standard column names, because there was no enforcement
mechanism apart from fallible (and fickle!) human review.

I suspect there are some binary spectral tables in the archive I've
forgotten about, and when we get to the realm of radio data I would expect
binary rather than ASCII data. I always have to find an expert when I need
to understand what is going on in a radio data set.

I'm asking these questions because I'd *expect* (without actual
> evidence) that "I need spectrally resolved information on X" is one
> of the more common use cases in this field of data discovery, and
> every time I think of it I reach a different opinion on whether it's
> a good idea to serve users cubes when they were presumably expecting
> to see plain ol' 1D spectra and their client programmes will just say
> "Can't open your file".  Do you have any experience with user
> expectations in one way or the other?
>

Not directly. I'm a programmer by training, rather than a researcher, so I
don't have any significant personal experience. My user community is also
different from an astrophysics community. They study small bodies - data
are very, very scarce on any particular small body, so the typical question
is "Does any such data exist?" If a researcher is lucky enough to find
something, they will then figure out how to use whatever format the data
come in. Every datum is precious.

Missions to small bodies don't have much of an effect on that, even though
they can return large amounts of data. The parameters of mission data are
well-known, and someone planning to analyze data from, say, the Alice
spectrometer on New Horizons already knows it has an oddly shaped slit and
a wavelength range in the UV and resolution that depends on which end of
the slit you're examining - or at least that's the first thing they find
out. Since no other mission has even remotely similar data on Pluto,
there's really nowhere else to go.  If you don't enter parameters that
would select the Alice spectra, you will be told no data exist.

So in the small body community, and in planetary mission data generally
(maybe not for Mars), if you are too specific in your request parameters
you are likely to find nothing because there is so little data. That seems
likely to continue for the next generation or two at least. There are a LOT
of small bodies, and it takes a long time to design and fly a mission to
gather data.  Data coverage will probably continue to be little islands in
vast oceans of nothing for some time to come.

I would expect this to be different for ground-based data as it becomes
possible to search more and more observatory holdings for small bodies
images. Then there will be enough potential coverage, at least, to make
finer-grained searching a fruitful activity. For small bodies, it is
valuable to be able to go back through historic data, looking for those
precious data points.

> It was a struggle to get metadata like this into the PDS4 Information
> > Model, because historically PDS has only ever described its data in terms
> > of its source (which instrument, which spacecraft, which mission) and its
> > target (which planet, and even non-planet targets were problematic). So I
> > view it as a start, but I hope we can do better for version 2.0.
>
> Yeah -- good metadata is data altruism, as the data creators can do
> without much of it during initial data exploitation.  So, I think
> it's never an easy sell...
>

It also takes a bit of imagination. If someone had thought at the time
"What happens 50 years from now, when NASA has flown a couple hundred
planetary missions?  How will people find data if they don't know the names
of the instruments that collected data before they were born?"  - or, in
other words, "What if this archive is successful?" - then perhaps they
would have planned for success...

Regards,

-Anne.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/semantics/attachments/20210615/088f002f/attachment.html>