[Dataset] Model document update

Markus Demleitner msdemlei at ari.uni-heidelberg.de
Mon Mar 21 10:54:34 CET 2016


Dear DM,

On Fri, Mar 18, 2016 at 10:51:41AM -0400, CresitelloDittmar, Mark wrote:
> I have submitted an update to the DatasetMetadata v1.0 model document to
> the ivoa document repository.
> It is also available in volute at:
> 
> https://volute.g-vo.org/svn/trunk/projects/dm/DatasetMetadata-1.0/doc/WD-DatasetDM-1.0-20160317.pdf

Thanks for this effort -- I think factoring these common traits out
is going to help a lot in the future.  Also, and let me briefly wear
my Registry chair hat, this comes at a good time when we're about to
carefully adapt VOResource to new developments like DOIs and orcids
-- clearly much of this is reflected there, too.

But of course, I see several hairs in the soup.  Well, one is a
full-blown trunk.

(1) I don't think any STC (or "characterisation") should be embedded
here.  Because:

  (a) DatasetDM can work just fine for datasets that don't have
  (direct) STC content (think a simulated lightcurve; what STC there
  is would be heavily contrived and probably rather more confusing
  than helpful).  Keeping STC out of the DM keeps it lean and
  non-confusing for such cases, too, and improves re-usability.

  (b) In 3.2., the document says: 

    This object (characterization) may be extended and/or modified by
    specific Dataset models as needed.

  This would indicate to me that these specific Dataset models should
  be the ones to include it in the first place; it'd be very clumsy
  if they had to say (let alone actually do it in their model)
  "discard Characterisation from DatasetDM and then use this other
  thing."

Hence, I'd suggest to say that concrete Dataset DMs reference both
DatasetDM and CharDM (and then on to STC), rather than baking the
whole of the STC2 prototype into this document.  Works better, and we
won't have traces of an outdated prototype in a later REC.

Also, this would bring the document length back to somewhat more
manageable levels.

The second hair I'm finding in the soup is still substantial, but
rather a twig than a trunk:

(2) I'm always quaesy about data models hanging, as it were, in
midair, without actually inducing file formats.  I'd therefore
*really* like to see one, two or three serialised instances (in
VOTables according to the mapping document), preferably in a
non-normative introduction; *every* element mentioned in the DM
should be used in at least one of these example documents.  That's
still not a guarantee that parts of the DM don't contain hidden
traps, but it's a first step.  And it gives implementors an idea what
all this is about.

The rest are minor points:

(3) I'm sure the exposition would profit if there -- to the
extent possible in what's essentially a graph -- types weren't
introduced before they're used.  For instance, right now you
introduce AccessRights in 2.2, and one doesn't really understand
where this would go.  Only in 2.7.6 one learns that this is used in
Curation.rights and then perhaps sees where this fits.  Many other
attributes work the other way round, which is what I think is much
better when talking to humans

If a more consistent sequence turns out to be impossible, there should
be references from the types to where  they are used ("AccessRights
is used in Curation's rights attribute").  Or perhaps that would be a
good thing either way.

(4) Talking about Curation.rights: This now has a multiplicity 0..1.
I'm essentially happy with that, but then I'm not sure I see the
motivation for including startDate and endDate into AccessRights. If
you enter the time domain here (and I think that's over-modelling,
but that's just me), shouldn't you be able to say:

  proprietary 2013 thorough 2113
  free 2113 until eternity

i.e., shouldn't Curation.rights be 0..n?

[my take: strike AccessRights and make Curation.rights point to
RightsType directly -- I don't think the potential benefit of having
this kind of thing machine-readable outweighs the cost in terms of
complexity]

(5) Curation.version -- while I appreciate independent versioning
done by the publisher is a major use case, I think this has far too
much potential for confusion.  The one case I'm aware of where
distinct publisher version works is Debian packages.  They simply
define a structure for version strings and  have *one* of these. For
instance, gavodachs-server_0.9.5-8 means: This is upstream version
0.9.5, and this package is the eighth issue prepared by the packager,
presumably distinct from the previous ones by metadata of some sort.

Meaning: *If* we want this kind of thing, I'm sure we should do it in
the way proven by Debian and provide structured version number that
keep all information in one single string.  Anything else I'd like to
see tested in practice before we commit to any solution in a REC.

(6) I'm not happy with the inflation of places where dataset
identifiers can stand.  There's now Curation.publisherDID,
DataID.creatorDID, and  DataID.datasetID.  I don't think we're doing
our users a service by multiplying the concepts here, even though I
admit that each of these have a use case.

I'd much rather see an Identifier type:

  Identifier.kind: (publisher, creator, persistent, ...)
  Identifier.form: (doi, ivoid, generic-uri,  ...)
  Identifier.value: (well, you know).

[kind and form would be open vocabularies with recommended terms
defined in the standard).

And then, you'd have a 0..n DataID.identifier attribute.

It, I claim, clarifies the relationship these things have with each
other, and the semantic tort we're exercising by pulling publisherDID
from curation (where, admittedly, it rightfully belongs) to DataID is
IMHO acceptable.


(7) Publication

Here, we should be explicit about what the publication reference is.
Much as I would like the bibcode to rule supreme forever, this is
almost certainly not what is going to happen.  Either this gets a
form attribute as in (6) or we say "This should be a URI with a
scheme; use bibcode: for bibcodes, doi: for DOIs.  In a pinch,
non-URI, freetext references are ok".

(8) ObsDataset I don't like much.  It feels like a somewhat random
collection of things that are the domain of characterization or
rather the DM that embeds the DatasetDM and things that really belong
to Provenance.  Does it really need to be in this "utility" DM?

(9) Given my experience with the extent of care that people put into
the creation of metadata (just look at what we have in the Registry
right now), I'd suggest cut down on the inheritance tree under party
and just have Party -- which IMHO is good enough for any use case whe
have there.

So, you'd have:

Party:
  name
  email
  address
  phone
  logo

If you have to, you could have a kind attribute (person,
organisation, proxy), but I cannot really see a great usecase for
that.

(10) Having said that, I think orcids will become a smash hit in the
near future if they aren't one already.  Hence, I'd add

  identifier

to the Party attributes.  The stuff on defining identifiers as in (7)
applies here, too (if we go the URI way, we should say whether we
want orcid:0000-... or http://orcid.org/0000-...)

Cheers,

           Markus


More information about the dm mailing list