<div dir="ltr"><div>Markus,<br><br></div>Sorry for the delay in responding.. thank you for the thorough comments.<br><br><div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Mar 21, 2016 at 5:54 AM, Markus Demleitner <span dir="ltr"><<a href="mailto:msdemlei@ari.uni-heidelberg.de" target="_blank">msdemlei@ari.uni-heidelberg.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Dear DM,<br>
<span class=""><br>
On Fri, Mar 18, 2016 at 10:51:41AM -0400, CresitelloDittmar, Mark wrote:<br>
> I have submitted an update to the DatasetMetadata v1.0 model document to<br>
> the ivoa document repository.<br>
> It is also available in volute at:<br>
><br>
> <a href="https://volute.g-vo.org/svn/trunk/projects/dm/DatasetMetadata-1.0/doc/WD-DatasetDM-1.0-20160317.pdf" rel="noreferrer" target="_blank">https://volute.g-vo.org/svn/trunk/projects/dm/DatasetMetadata-1.0/doc/WD-DatasetDM-1.0-20160317.pdf</a><br>
<br>
</span>Thanks for this effort -- I think factoring these common traits out<br>
is going to help a lot in the future. Also, and let me briefly wear<br>
my Registry chair hat, this comes at a good time when we're about to<br>
carefully adapt VOResource to new developments like DOIs and orcids<br>
-- clearly much of this is reflected there, too.<br>
<br>
But of course, I see several hairs in the soup. Well, one is a<br>
full-blown trunk.<br>
<br>
(1) I don't think any STC (or "characterisation") should be embedded<br>
here. Because:<br>
<br>
(a) DatasetDM can work just fine for datasets that don't have<br>
(direct) STC content (think a simulated lightcurve; what STC there<br>
is would be heavily contrived and probably rather more confusing<br>
than helpful). Keeping STC out of the DM keeps it lean and<br>
non-confusing for such cases, too, and improves re-usability.<br>
</blockquote><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> (b) In 3.2., the document says:<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
This object (characterization) may be extended and/or modified by<br>
specific Dataset models as needed.<br>
<br>
This would indicate to me that these specific Dataset models should<br>
be the ones to include it in the first place; it'd be very clumsy<br>
if they had to say (let alone actually do it in their model)<br>
"discard Characterisation from DatasetDM and then use this other<br>
thing."<br>
<br>
Hence, I'd suggest to say that concrete Dataset DMs reference both<br>
DatasetDM and CharDM (and then on to STC), rather than baking the<br>
whole of the STC2 prototype into this document. Works better, and we<br>
won't have traces of an outdated prototype in a later REC.<br>
<br>
Also, this would bring the document length back to somewhat more<br>
manageable levels.<br>
<br></blockquote><div><div><br>I fully agree that STC should NOT be embedded in this document.<br>The basic Dataset (Section 2) has no dependence on it.<br>The ObsDataset extension however, does have a small dependency <br>on STC and makes the connection to Characterisation. This model<br></div><div>has no embedded Characterisation.<br></div><div><br></div><div>The prototype STC model is included here ONLY because there is no<br></div><div>external document/model to point to yet, and I NEED an STC model <br></div><div>to support the NDCube work (which builds on ObsDataset).<br></div><div><br></div><div>Rather than putting some in here, and some in NDCube, I thought it<br></div><div>best to keep it all together. Do you have another suggestion on how<br></div><div>to handle this dependency while STC2 is being reviewed?<br></div><div><br></div><div>re: Characterisation<br></div><div>Section 3 is for the ObsDataset extension, and is, therefore, one of the<br></div><div>specific Dataset types which is pulling in Characterisation. Other types<br></div><div>(eg: SimDataset if it were cast into this framework), may or may not <br></div><div>pull in Characterisation, and may or may not want to extend that to <br></div><div>include other simulation specific characterisation.<br></div>
<br></div><div>Perhaps ObsDataset should be moved into the Observation/Experiment package-model.<br></div><div>This would move all of Section 3 into Section 4.. and have the Experiment defining<br></div><div>its output dataset (ObsDataset). The dependencies on STC/Char would move<br></div><div>to that package.<br></div><div><br> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
The second hair I'm finding in the soup is still substantial, but<br>
rather a twig than a trunk:<br>
<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
(2) I'm always quaesy about data models hanging, as it were, in<br>
midair, without actually inducing file formats. I'd therefore<br>
*really* like to see one, two or three serialised instances (in<br>
VOTables according to the mapping document), preferably in a<br>
non-normative introduction; *every* element mentioned in the DM<br>
should be used in at least one of these example documents. That's<br>
still not a guarantee that parts of the DM don't contain hidden<br>
traps, but it's a first step. And it gives implementors an idea what<br>
all this is about.<br>
<br></blockquote><div><br></div><div>I think it is important to keep the serializations out of the model document, but<br></div><div>agree that we need examples. I'm planning/working on putting examples on the<br></div><div>twiki page for reference.. there we can put whatever flavor of serializations we <br></div><div>like (FITS, VOTable, etc, using VO-DML or other tagging ) and it can evolve<br></div><div>was the current favorite changes with time or serialization specs change (FITS-3.0, VOTable 1.3)<br><br></div><div>I will be requiring that I have examples in place prior to this document going to<br></div><div>any further stage.<br></div><div><br></div><div>I really cannot yet produce a VOTable instance according to the mapping document<br></div><div>because that is still in flux. But this model does serve as a good test bed for <br></div><div>exercising the mapping serialization spec against a valid vo-dml model.<br></div><div><br> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
The rest are minor points:<br>
<br>
(3) I'm sure the exposition would profit if there -- to the<br>
extent possible in what's essentially a graph -- types weren't<br>
introduced before they're used. For instance, right now you<br>
introduce AccessRights in 2.2, and one doesn't really understand<br>
where this would go. Only in 2.7.6 one learns that this is used in<br>
Curation.rights and then perhaps sees where this fits. Many other<br>
attributes work the other way round, which is what I think is much<br>
better when talking to humans<br>
<br>
If a more consistent sequence turns out to be impossible, there should<br>
be references from the types to where they are used ("AccessRights<br>
is used in Curation's rights attribute"). Or perhaps that would be a<br>
good thing either way.<br>
<br></blockquote><div><br></div><div>I'm fairly certain an earlier version of this (or Spectral) had issue with my<br></div><div>not sticking to a strictly alphabetical arrangement of the objects.. so I was<br></div><div>more careful to do so here.<br><br></div><div>The expectation is that the Section diagrams show the object tree/arrangement<br></div><div>while the text is easy to look-up alphabetically. (The primary object is always<br></div><div>first, the rest is alphabetical). Erg.. as I double-check, it seems I failed to <br></div><div>do this in Section 4.<br></div><div><br> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
(4) Talking about Curation.rights: This now has a multiplicity 0..1.<br>
I'm essentially happy with that, but then I'm not sure I see the<br>
motivation for including startDate and endDate into AccessRights. If<br>
you enter the time domain here (and I think that's over-modelling,<br>
but that's just me), shouldn't you be able to say:<br>
<br>
proprietary 2013 thorough 2113<br>
free 2113 until eternity<br>
<br>
i.e., shouldn't Curation.rights be 0..n?<br>
<br>
[my take: strike AccessRights and make Curation.rights point to<br>
RightsType directly -- I don't think the potential benefit of having<br>
this kind of thing machine-readable outweighs the cost in terms of<br>
complexity]<br>
<br></blockquote><div><br></div><div>Curation.rights has always been singluar (0..1). Prior to this version, it was <br></div><div>simply an attribute of type RightsType (enum). Most of the changes with <br></div><div>this version have to do with normalizing elements which were simplified <br></div><div>(typically to strings). Access rights are inherently time-dependent, so <br></div><div>this seemed consistent with the other changes.<br><br></div><div>Retaining the multiplicity keeps the earlier expectations in place. A dataset<br>is tagged with an access rights value. A change to the access rights<br>constitutes a change to the dataset.. generating a new version of the <br>dataset itself (Curation.version)<br><br></div><div>When you say 'point to RightsType directly', that would not be possible as<br></div><div>RightsType is a DataType.. it would be an attribute (as it was previously).<br></div><div>I don't follow the 'machine-readable' part of your comment.<br></div><div><br></div><div>I'm simply trying to be consistent with the level of modeling. As I said, this<br></div><div>seems in line with the other changes for Instrument, Publisher, Publication, <br>the Party elements, etc.<br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
(5) Curation.version -- while I appreciate independent versioning<br>
done by the publisher is a major use case, I think this has far too<br>
much potential for confusion. The one case I'm aware of where<br>
distinct publisher version works is Debian packages. They simply<br>
define a structure for version strings and have *one* of these. For<br>
instance, gavodachs-server_0.9.5-8 means: This is upstream version<br>
0.9.5, and this package is the eighth issue prepared by the packager,<br>
presumably distinct from the previous ones by metadata of some sort.<br>
<br>
Meaning: *If* we want this kind of thing, I'm sure we should do it in<br>
the way proven by Debian and provide structured version number that<br>
keep all information in one single string. Anything else I'd like to<br>
see tested in practice before we commit to any solution in a REC.<br>
<br></blockquote><div><br></div><div>I suppose we can open a discussion about doing this. The way I see<br></div><div>these (DataID vs Curation) is that they are groups of Metadata which <br></div><div>are assigned to the Dataset by different parties. As such, the versions<br></div><div>are independent. The groups both contain an ID, Version, Date plus <br></div><div>other stuff specific to those roles. Creator assignd DataID ,Curator/Publisher<br></div><div>assigns Curation. <br><br> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
(6) I'm not happy with the inflation of places where dataset<br>
identifiers can stand. There's now Curation.publisherDID,<br>
DataID.creatorDID, and DataID.datasetID. I don't think we're doing<br>
our users a service by multiplying the concepts here, even though I<br>
admit that each of these have a use case.<br>
<br>
I'd much rather see an Identifier type:<br>
<br>
Identifier.kind: (publisher, creator, persistent, ...)<br>
Identifier.form: (doi, ivoid, generic-uri, ...)<br>
Identifier.value: (well, you know).<br>
<br>
[kind and form would be open vocabularies with recommended terms<br>
defined in the standard).<br>
<br>
And then, you'd have a 0..n DataID.identifier attribute.<br>
<br>
It, I claim, clarifies the relationship these things have with each<br>
other, and the semantic tort we're exercising by pulling publisherDID<br>
from curation (where, admittedly, it rightfully belongs) to DataID is<br>
IMHO acceptable.<br>
<br></blockquote><div><br></div><div>I haven't inflated anything. These are the same set which has been in<br></div><div>the prior models. I do like the idea of using an Identifier type rather than<br></div><div>anyURI. Should be more adaptable to evolving standards/forms. I would<br></div><div>resist the 'kind' attribute. As I said above, these groupings are associated<br></div><div>with the dataset by different parties and the distinction is pervasive across<br></div><div>the existing Resource documents.<br> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
(7) Publication<br>
<br>
Here, we should be explicit about what the publication reference is.<br>
Much as I would like the bibcode to rule supreme forever, this is<br>
almost certainly not what is going to happen. Either this gets a<br>
form attribute as in (6) or we say "This should be a URI with a<br>
scheme; use bibcode: for bibcodes, doi: for DOIs. In a pinch,<br>
non-URI, freetext references are ok".<br></blockquote><div><br></div><div>Isn't this what 2.9.1 says? Is there specific language you'd like changed there?<br></div><div><br> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
(8) ObsDataset I don't like much. It feels like a somewhat random<br>
collection of things that are the domain of characterization or<br>
rather the DM that embeds the DatasetDM and things that really belong<br>
to Provenance. Does it really need to be in this "utility" DM?<br>
<br></blockquote><div><br></div>Well ObsDataset is not quite random collection of metadata associated with<br></div><div class="gmail_quote">a Dataset produced by an Observation. Like I said above, I'm thinking this <br></div><div class="gmail_quote">maybe should shift over to the Observation/Experiment. <br></div><div class="gmail_quote"><br></div><div class="gmail_quote">This work separates previously mashed together information, so sorting <br></div><div class="gmail_quote">out where each of the different pieces should go is important. The <br></div><div class="gmail_quote">problem is that the separate pieces aren't yet properly modeled.</div><div class="gmail_quote">One of the next steps will be to work the Observation section into it's <br></div><div class="gmail_quote">own model, folding in the Provenance pattern.<br></div><div class="gmail_quote"><br><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
(9) Given my experience with the extent of care that people put into<br>
the creation of metadata (just look at what we have in the Registry<br>
right now), I'd suggest cut down on the inheritance tree under party<br>
and just have Party -- which IMHO is good enough for any use case whe<br>
have there.<br>
<br>
So, you'd have:<br>
<br>
Party:<br>
name<br>
email<br>
address<br>
phone<br>
logo<br>
<br>
If you have to, you could have a kind attribute (person,<br>
organisation, proxy), but I cannot really see a great usecase for<br>
that.<br></blockquote><div><br></div><div>I'm flexible about the depth of the Party model. The arrangement in<br></div><div>the doc is based on the reading I did of typical usage. I like the ability<br></div><div>to separate People from Places for various roles, but would be fine<br></div><div>with the simple version.<br><br></div><div>Other opinions?<br></div><div> <br><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
(10) Having said that, I think orcids will become a smash hit in the<br>
near future if they aren't one already. Hence, I'd add<br>
<br>
identifier<br>
<br>
to the Party attributes. The stuff on defining identifiers as in (7)<br>
applies here, too (if we go the URI way, we should say whether we<br>
want orcid:0000-... or <a href="http://orcid.org/0000-.." rel="noreferrer" target="_blank">http://orcid.org/0000-..</a>.)<br>
<br></blockquote><div><br></div><div>Can you elaborate?<br></div><div>Having an ID at the Party level could be confusing.. as an individual (me/you)<br></div><div>could have different ID depending on the Role we are playing at the time.<br></div><div>That is why I left them up at the Role extensions (Publisher.publisherID).<br></div><div><br></div>Mark<br><br><br></div></div></div></div></div></div>