Datalink vocabulary additions

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Mon Jun 13 13:44:44 CEST 2016


Dear Semantics,

On Mon, Jun 06, 2016 at 06:22:17PM +0200, Mireille Louys wrote:
> at some point we will need to decide the scope of such an approach, what
> level of information we need to convey in the 'semantics' field.

I'd say this time is now, as people are deploying datalink and have
actual pieces of data to describe.

> The Provenance W3C also has some terms dedicated on the links from one
> 'document' ( called an entity) to its progenitors.

Yes, datalink by its nature is related to all data models dealing
with the more structural aspects of our data.  Once these DMs are
ready, I hope more fine-grained annotation can quite naturally be
applied to datalink documents by virtue of them being VOTables and our
VO-DML serialisation.

However, I don't think that should keep us from defining the
vocabulary of the semantics column in a way that plain datalink
clients can figure out enough to provide useful UIs.

So, from the discussion, here's what I'd suggest:

> Le 03/06/2016 à 22:28, Accomazzi, Alberto a écrit :
> >On Thu, Jun 2, 2016 at 7:38 AM, Markus Demleitner
> ><msdemlei at ari.uni-heidelberg.de <mailto:msdemlei at ari.uni-heidelberg.de>>
> >wrote:
> >
> >
> >    (1) I'd like to have a term for larger chunks of metadata in separate
> >    files.  I'd need that to link to observation logs, but I could also
> >    see logs a pipeline has written, or an extensive provenance, or
> >    similar.
> >
> >    Proposed term(s): #metadata?  #documentation?  (as a child of
> >    #auxiliary, I guess)
> >
> >
> >I dislike both terms you suggest because they sound so general that they
> >could be used for most anything.  But if we have to stay general because
> >of the potentially different types of resources we need to point to, how
> >about #Documents?
> yes , I think it is very general.

Still, we need something like it, so I'd now propose the DataCite
term.  The datalink-terms.csv line would be, I think

isMetadataFor,2,Metadata,additional documentation for this dataset (e.g., observatory logs, provenance information)

going below auxiliary; to see the vocabulary in its current beauty,
see

https://volute.googlecode.com/svn/trunk/projects/dal/DataLink/datalink-terms/src/datalink-terms.csv


> >    (2) I'd like to have a term for things like a rebinned (higher S/N)
> >    version of the dataset, or perhaps the data in a different
> >    waveband on a
> >    multi-band instrument, or the same observation with a different
> >    instrument setup (as in V500/COMB vs.  V1200 in Califa), etc.
> >Essentially:
> >    Science data that was obtained "together with" #this but that's not
> >    identical with #this.
> >
> >    Proposed term(s): #science? (but that's a bit too broad)  #alternate?
> >      (as a child of #this?)
> >
> >
> >maybe #isVariantFormOf or #isOriginalFormOf
> all three examples proposed here point to different datasets: the
> measured values have been obtained with specific settings or
> transformed from some original dataset, so to me these are
> different 'entities' in the Provenance world.  so rather

For things that are really in a single provenance chain, datalink
already has progenitor/derivation.  But at least for CALIFA, that's
not the case -- these are different measurements with (largely)
different resulting artefacts; it's just that if you're looking at
one of them, it's really likely that you want to see the... yes,
sibling, too.

> case 1 & 3 : <isDerivedFrom> as a role  and some term to qualify how it is
> derived , as a sub-category : #cutout, #regrid
> case 2:  I would propose <?siblingOf?>  a relation like "sibling",  related
> to the same observation but offering different physical properties .
> this helps to browse sister/brother datasets in the observation-dataset
> genealogy.

Sibling is an interesting concept, but I'd say the isVariantFormOf
term from Datacite captures well enough what this is about (datacite
defintion: "indicates A is  a variant or  different  form of B,  e.g. 
calculated or  calibrated  form or  different  packaging"), and since
it's a term that already exists I'd say we need a strong use case to
invent something new.  I'd hence propose, under the line for "this"
(i.e., a child)

isVariantForm,2,Variant form,"the data in a different form, e.g., a different packaging"

This doesn't match my CALIFA case perfectly since, as Mireille
points out, we're really talking about different datasets here, but I
believe no user will be surprised to get these if following a
"Variant Form" link, so I'd  be happy.

I'd not veto #sibling; it would admittedly be more precise.  The only
reason I'm not altogether sold is that datacite doesn't have it, and
I'd first like to know why.

> >    (3) I'd like to have a term for a different representation of the same
> >    dataset, e.g., a spectrum that was originally a FITS image
[...]
> >#isVariantFormOf or #isOriginalFormOf
> yes, exactly same content but different representation. I agree.

Good, so (2) does the trick, and no additional terms are necessary.

> >    (4) I'd like to have a term for a previous version of a dataset.     I
> >have
> >    that in califa, where I'd like to have *some* way to get DR1 and DR2
> >    data, but I really don't want to clutter all-VO SSA or obscore
> >    searches
> >    with these guys.  So, I'm adding links to old files (where they exist)
> >    in datalink results for new files.  This isn't really #progenitor,
> >    since
> >    the old files aren't in the provenance chain of the new files
> >    (which are
> >    generated from yet other data files).  It's... well, a previous
> >    version,
> >    and hence I'd like to see
> >
> >    Proposed term: #previous-version (as child of #auxiliary?)
> >
> >
> >we should be careful with the semantics that DataCite assigns to these but
> >#isPreviousVersionOf and #isNewVersionOf might be appropriate here
> agreed

I'd say the datacite concepts are close enough to what VO publishers
might use this for, so I'd suggest

IsPreviousVersionOf,2,Previous version,"this dataset in a previous edition, e.g., processed with an older pipeline, as part of an older data release."
IsNewVersionOf,2,New version,"this dataset in a newer edition, e.g., processed with a newer pipeline, as part of a newer data release."

Again, I'd hang this off #this.

Hanging these off #this is perhaps not ideal; if you're skeptical,
too, I think I could see us adding a new (non-datacite) top-level term

IsRelatedTo,1,Related dataset,"a separate dataset in some relationship to the present one while not in the provenance chain, e.g., an earlier or later version, data taken with a different instrumental setup"

#isVariantFormOf should IMHO still hang off #this.

> >
> >
> >    That concludes the proposed concepts for this time; #fault from the
> >    original proposals I've dropped.  One other thing I'd like:
> >
> >    (5) #proc currently has "Server-side data processing result" as its
> >    explanation.  What really is in such datalink rows is, I submit,
> >    better described by "reference to a server-side processing service"
> >    -- so, can we change that explanation?
> >
> >
> Again , this processing-service is considered as an Activity in
> Provenance DM .  I think it is worth then to look also in the
> PROV-W3C ontology and see if we can combine terms .
>
> My vague understanding is that we address the same problem with
> different tools.  Probably we need to clarify the coverage of each
> on the structure side (DM) and on the semantic side (Vocabulary) .

While it's true that applying a processing service is something that
will result in an entry in a provenance description, that's quite a
bit beyond what datalink declares or even wants to declare.  The
point here simply is to say "here's a service you can use".  What it
does to the dataset's provenance structure is, I think, not up to
Datalink to define (perhaps SODA might, some day).

Either way, this is really just about a clarification of the
explanation, so here I propose replacing the current line

proc,1,Processing,server-side data processing result

with

proc,1,Processing,reference to a server-side processing service


To get to a point where such additions can be made, what's the next
step?

       -- Markus


More information about the semantics mailing list