[Dataset] Model document update

CresitelloDittmar, Mark mdittmar at cfa.harvard.edu
Tue Mar 29 20:18:04 CEST 2016


Markus,

Sorry for the delay in responding.. thank you for the thorough comments.


On Mon, Mar 21, 2016 at 5:54 AM, Markus Demleitner <
msdemlei at ari.uni-heidelberg.de> wrote:

> Dear DM,
>
> On Fri, Mar 18, 2016 at 10:51:41AM -0400, CresitelloDittmar, Mark wrote:
> > I have submitted an update to the DatasetMetadata v1.0 model document to
> > the ivoa document repository.
> > It is also available in volute at:
> >
> >
> https://volute.g-vo.org/svn/trunk/projects/dm/DatasetMetadata-1.0/doc/WD-DatasetDM-1.0-20160317.pdf
>
> Thanks for this effort -- I think factoring these common traits out
> is going to help a lot in the future.  Also, and let me briefly wear
> my Registry chair hat, this comes at a good time when we're about to
> carefully adapt VOResource to new developments like DOIs and orcids
> -- clearly much of this is reflected there, too.
>
> But of course, I see several hairs in the soup.  Well, one is a
> full-blown trunk.
>
> (1) I don't think any STC (or "characterisation") should be embedded
> here.  Because:
>
>   (a) DatasetDM can work just fine for datasets that don't have
>   (direct) STC content (think a simulated lightcurve; what STC there
>   is would be heavily contrived and probably rather more confusing
>   than helpful).  Keeping STC out of the DM keeps it lean and
>   non-confusing for such cases, too, and improves re-usability.
>


>   (b) In 3.2., the document says:
>

>     This object (characterization) may be extended and/or modified by
>     specific Dataset models as needed.
>
>   This would indicate to me that these specific Dataset models should
>   be the ones to include it in the first place; it'd be very clumsy
>   if they had to say (let alone actually do it in their model)
>   "discard Characterisation from DatasetDM and then use this other
>   thing."
>
> Hence, I'd suggest to say that concrete Dataset DMs reference both
> DatasetDM and CharDM (and then on to STC), rather than baking the
> whole of the STC2 prototype into this document.  Works better, and we
> won't have traces of an outdated prototype in a later REC.
>
> Also, this would bring the document length back to somewhat more
> manageable levels.
>
>
I fully agree that STC should NOT be embedded in this document.
The basic Dataset (Section 2) has no dependence on it.
The ObsDataset extension however, does have a small dependency
on STC and makes the connection to Characterisation.  This model
has no embedded Characterisation.

The prototype STC model is included here ONLY because there is no
external document/model to point to yet, and I NEED an STC model
to support the NDCube work (which builds on ObsDataset).

Rather than putting some in here, and some in NDCube, I thought it
best to keep it all together.  Do you have another suggestion on how
to handle this dependency while STC2 is being reviewed?

re: Characterisation
Section 3 is for the ObsDataset extension, and is, therefore, one of the
specific Dataset types which is pulling in Characterisation.   Other types
(eg: SimDataset if it were cast into this framework), may or may not
pull in Characterisation, and may or may not want to extend that to
include other simulation specific characterisation.

Perhaps ObsDataset should be moved into the Observation/Experiment
package-model.
This would move all of Section 3 into Section 4.. and have the Experiment
defining
its output dataset (ObsDataset).  The dependencies on STC/Char would move
to that package.



> The second hair I'm finding in the soup is still substantial, but
> rather a twig than a trunk:
>
>
(2) I'm always quaesy about data models hanging, as it were, in
> midair, without actually inducing file formats.  I'd therefore
> *really* like to see one, two or three serialised instances (in
> VOTables according to the mapping document), preferably in a
> non-normative introduction; *every* element mentioned in the DM
> should be used in at least one of these example documents.  That's
> still not a guarantee that parts of the DM don't contain hidden
> traps, but it's a first step.  And it gives implementors an idea what
> all this is about.
>
>
I think it is important to keep the serializations out of the model
document, but
agree that we need examples.  I'm planning/working on putting examples on
the
twiki page for reference.. there we can put whatever flavor of
serializations we
like (FITS, VOTable, etc, using VO-DML or other tagging )  and it can evolve
was the current favorite changes with time or serialization specs change
(FITS-3.0, VOTable 1.3)

I will be requiring that I have examples in place prior to this document
going to
any further stage.

I really cannot yet produce a VOTable instance according to the mapping
document
because that is still in flux.  But this model does serve as a good test
bed for
exercising the mapping serialization spec against a valid vo-dml model.



> The rest are minor points:
>
> (3) I'm sure the exposition would profit if there -- to the
> extent possible in what's essentially a graph -- types weren't
> introduced before they're used.  For instance, right now you
> introduce AccessRights in 2.2, and one doesn't really understand
> where this would go.  Only in 2.7.6 one learns that this is used in
> Curation.rights and then perhaps sees where this fits.  Many other
> attributes work the other way round, which is what I think is much
> better when talking to humans
>
> If a more consistent sequence turns out to be impossible, there should
> be references from the types to where  they are used ("AccessRights
> is used in Curation's rights attribute").  Or perhaps that would be a
> good thing either way.
>
>
I'm fairly certain an earlier version of this (or Spectral) had issue with
my
not sticking to a strictly alphabetical arrangement of the objects.. so I
was
more careful to do so here.

The expectation is that the Section diagrams show the object
tree/arrangement
while the text is easy to look-up alphabetically.  (The primary object is
always
first, the rest is alphabetical). Erg.. as I double-check, it seems I
failed to
do this in Section 4.



> (4) Talking about Curation.rights: This now has a multiplicity 0..1.
> I'm essentially happy with that, but then I'm not sure I see the
> motivation for including startDate and endDate into AccessRights. If
> you enter the time domain here (and I think that's over-modelling,
> but that's just me), shouldn't you be able to say:
>
>   proprietary 2013 thorough 2113
>   free 2113 until eternity
>
> i.e., shouldn't Curation.rights be 0..n?
>
> [my take: strike AccessRights and make Curation.rights point to
> RightsType directly -- I don't think the potential benefit of having
> this kind of thing machine-readable outweighs the cost in terms of
> complexity]
>
>
Curation.rights has always been singluar (0..1).  Prior to this version, it
was
simply an attribute of type RightsType (enum).  Most of the changes with
this version have to do with normalizing elements which were simplified
(typically to strings).  Access rights are inherently time-dependent, so
this seemed consistent with the other changes.

Retaining the multiplicity keeps the earlier expectations in place.  A
dataset
is tagged with an access rights value.  A change to the access rights
constitutes a change to the dataset.. generating a new version of the
dataset itself (Curation.version)

When you say 'point to RightsType directly', that would not be possible as
RightsType is a DataType.. it would be an attribute (as it was previously).
I don't follow the 'machine-readable' part of your comment.

I'm simply trying to be consistent with the level of modeling. As I said,
this
seems in line with the other changes for Instrument, Publisher,
Publication,
the Party elements, etc.



> (5) Curation.version -- while I appreciate independent versioning
> done by the publisher is a major use case, I think this has far too
> much potential for confusion.  The one case I'm aware of where
> distinct publisher version works is Debian packages.  They simply
> define a structure for version strings and  have *one* of these. For
> instance, gavodachs-server_0.9.5-8 means: This is upstream version
> 0.9.5, and this package is the eighth issue prepared by the packager,
> presumably distinct from the previous ones by metadata of some sort.
>
> Meaning: *If* we want this kind of thing, I'm sure we should do it in
> the way proven by Debian and provide structured version number that
> keep all information in one single string.  Anything else I'd like to
> see tested in practice before we commit to any solution in a REC.
>
>
I suppose we can open a discussion about doing this.  The way I see
these (DataID vs Curation) is that they are groups of Metadata which
are assigned to the Dataset by different parties.  As such, the versions
are independent.  The groups both contain an ID, Version, Date  plus
other stuff specific to those roles. Creator assignd DataID
,Curator/Publisher
assigns Curation.



> (6) I'm not happy with the inflation of places where dataset
> identifiers can stand.  There's now Curation.publisherDID,
> DataID.creatorDID, and  DataID.datasetID.  I don't think we're doing
> our users a service by multiplying the concepts here, even though I
> admit that each of these have a use case.
>
> I'd much rather see an Identifier type:
>
>   Identifier.kind: (publisher, creator, persistent, ...)
>   Identifier.form: (doi, ivoid, generic-uri,  ...)
>   Identifier.value: (well, you know).
>
> [kind and form would be open vocabularies with recommended terms
> defined in the standard).
>
> And then, you'd have a 0..n DataID.identifier attribute.
>
> It, I claim, clarifies the relationship these things have with each
> other, and the semantic tort we're exercising by pulling publisherDID
> from curation (where, admittedly, it rightfully belongs) to DataID is
> IMHO acceptable.
>
>
I haven't inflated anything.  These are the same set which has been in
the prior models.  I do like the idea of using an Identifier type rather
than
anyURI.  Should be more adaptable to evolving standards/forms.  I would
resist the 'kind' attribute.  As I said above, these groupings are
associated
with the dataset by different parties and the distinction is pervasive
across
the existing Resource documents.


>
> (7) Publication
>
> Here, we should be explicit about what the publication reference is.
> Much as I would like the bibcode to rule supreme forever, this is
> almost certainly not what is going to happen.  Either this gets a
> form attribute as in (6) or we say "This should be a URI with a
> scheme; use bibcode: for bibcodes, doi: for DOIs.  In a pinch,
> non-URI, freetext references are ok".
>

Isn't this what 2.9.1 says?  Is there specific language you'd like changed
there?



> (8) ObsDataset I don't like much.  It feels like a somewhat random
> collection of things that are the domain of characterization or
> rather the DM that embeds the DatasetDM and things that really belong
> to Provenance.  Does it really need to be in this "utility" DM?
>
>
Well ObsDataset is not quite random collection of metadata associated with
a Dataset produced by an Observation. Like I said above, I'm thinking this
maybe should shift over to the Observation/Experiment.

This work separates previously mashed together information, so sorting
out where each of the different pieces should go is important.  The
problem is that the separate pieces aren't yet properly modeled.
One of the next steps will be to work the Observation section into it's
own model, folding in the Provenance pattern.



> (9) Given my experience with the extent of care that people put into
> the creation of metadata (just look at what we have in the Registry
> right now), I'd suggest cut down on the inheritance tree under party
> and just have Party -- which IMHO is good enough for any use case whe
> have there.
>
> So, you'd have:
>
> Party:
>   name
>   email
>   address
>   phone
>   logo
>
> If you have to, you could have a kind attribute (person,
> organisation, proxy), but I cannot really see a great usecase for
> that.
>

I'm flexible about the depth of the Party model.  The arrangement in
the doc is based on the reading I did of typical usage.  I like the ability
to separate People from Places for various roles, but would be fine
with the simple version.

Other opinions?


(10) Having said that, I think orcids will become a smash hit in the
> near future if they aren't one already.  Hence, I'd add
>
>   identifier
>
> to the Party attributes.  The stuff on defining identifiers as in (7)
> applies here, too (if we go the URI way, we should say whether we
> want orcid:0000-... or http://orcid.org/0000-...)
>
>
Can you elaborate?
Having an ID at the Party level could be confusing.. as an individual
(me/you)
could have different ID depending on the Role we are playing at the time.
That is why I left them up at the Role extensions (Publisher.publisherID).

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/dm/attachments/20160329/0f6514a2/attachment-0001.html>


More information about the dm mailing list