Comments on the Simulation Data Model

Fri Mar 13 05:30:31 PDT 2009

Hi Rick 

> > ...
> > In Volute I have added a denormalised version of the UML diagram, 
> > under 
> > http://volute.googlecode.com/svn/trunk/projects/theory/snapdm/input/
> > SimDB_DM
> > _denormalised.xml.
> > I have mainly simplified it by turning references from 
> experiment to 
> > protocol into attributes with the name of the referenced object. In 
> > relational databases this is not so important, but for XML 
> it makes a 
> > huge difference if we do not have to model the referencing using 
> > IVOIdentifier-s for example.
> 
> I agree, a reference to a name attribute is much easier to 
> implement, and the namespace of the referenced instance is 
> usually clear. (E.g., if I reference the input parameter 
> CosmologyComovingBoxSize of an Enzo simulation, I know 
> exactly which parameter that is.) However, publisherDID may 
> be preferable to name; publisherDIDs can be type to be 
> URL-like fields, which are less likely to get mangled than a name.
> 
Actually in the XML schemas we already have to make a mapping from UML
references to an element or attribute
that stores some value identifying the refernced attribute. 
We have had various ideas about how to do this, but it remains one of the
main issues to discuss with the WG representatives.  
One idea is to use the "ivoId" of the referenced object. This ivoId is
supposed to be assigned by the SimDB instance,
possibly using a UTYPE+key value (at least this is how our reference, VO-URP
application does it).
To assign this to a referencING object one needs to retrieve the referencED
object from the SimDB and extract the ivoId.
This is bothersome. 
Alternatively we could also allow to use the "publisherDID" of the
referencED object. 
This attribute is assumed to be associated to each object that is being
published, including contained objects!
The only difference is that the publisher assignes these IDs, which must
still be based on valid IVO identifiers,
and must be unique. The referencED object must still be retrieved from the
SimDB.

Note that which ever of the two solutions we choose for mapping references,
possibly we allow both,, we are still MAPPING REFERENCES! That means that we
still keep the references in the logical model.

This is different from my suggestion of using "name" attributes on classes
such as ParameterSetting or
ExperimentProperty. There we make use of our knowledge of the domain model
to ensure that the names can be uniquely interpreted. For example we use the
fact that parameter names must be unique in the context of a protocol, to
infer that
a parameter setting only needs to state the name of the parameter to
uniquely identify it *within the context of the protocol that the experiment
is referring to*. So the reference is removed and in its place comes an
attribute that must be interpreted in the context of its containing objects.

This choice for a name attribute would seem simpler for publishers than
either the ivoId or the publisherDID.
It is closer to the actual task: "I gave the parameter with this name, that
value". An XML document can be easily generated from input parameter files
for example, without requiring a lookup on the remote SimDB, EXCEPT for the
ivoId (or publisherDID) of the protocol itself.

For a given model such as the SimDB domain model such transformations can be
made explicit in the derivation of the logical model. I have created XSLT
scripts that transform back and forth between the XML versions of the
denormalised, logical model "with attributes" and the normalised domain
model "with references". In a relational database context one could write
SQL queries to do the equivalent transformation between table schemas
derived from the two models.
Only in UML is this not easy to do.

So to me there are actually two issues here, one of which complex: 
1) how do we map references that remain in the model to XML and TAP.
2) do we think that we should remove some of the references from the model
and change them into appropriate "name" attributes.

> 
> >> Here they are, in a semi-dependent order:
> >>
> >> 1) Remove the Simulator, PostProcessing, and ClusterFinder classes.
> >>
> > The problem with using an attribute to indicate the type is that 
> > particular structural constraints become harder to express. For 
> > example we might want to insist that for some protocol to be 
> > classsified as a Simulator it must at least have 1 Physics 
> object in 
> > its collection. This is easy to express using the 
> cardinality property 
> > on composition relations, but requires some constraint expression 
> > language when using attributes.
> > The PostProcessing type that we had was meant to include protocols 
> > that *do not* add new physics.
> > Agai this was simply expressed now.
> 
> You have hit on one of my motivations with the word 
> "constrained". My recent experiences have led me to suggest 
> that we make the model both less complex, and less 
> constrained. Once the subclasses of PostProcessor and 
> PostProcessing were eliminated (save for halo finding), much 
> of our data (extractions and projections), no longer fit the 
> model. I'm then left with a choice of using the Simulator and 
> Simulation classes as dumping grounds for the data that 
> doesn't fit, or relaxing the model.
> 
> Since the goal is to enable myself and others to publish as 
> much data as possible, I believe a broader definition of 
> Experiment and Protocol is preferable over eliminating some 
> data because it doesn't fit the model. And, since these 
> classes don't describe any functionality (i.e., methods), 
> then it's safer to push the classification of the Protocol to 
> an attribute. Kind of like sticking a label that says "Dog" 
> on a particular "Mammal". Currently, we have a model that has 
> the lables "Dog" and "Cat", and I'm trying to add some new 
> labels, so long as we can describe any "Mammal" in sufficient detail.
> 
It was never my intention to exclude any interesting experiments/protocols
that belong to SimDB from the model.
ClusterFinder was left as only subclass of postprocessing after removing
others as a proposal more than anything else. 
There should be a kind of OtherProtocol or CustomProtocol as place to
describe the protocols that are not covered by any of the currently
classified *concrete* subclasses of Protocol. Similar for Experiments.
We did discuss having the CustomProtocol, or CustomPostprocessing at some
point, but I do not find the result of that back in the model at the moment.

> > The main question for the domain model is whether the real 
> world can 
> > be this clearly demarcated.
> > For the logical model we need to decide whether it is 
> easier to work 
> > with one or the other.
> >
> > In the TAP interface to SimDB the difference would be that in the 
> > "typed"
> > version one can pose the query
> >
> > select * from Simulator where ...
> >
> > The alternative is to write
> >
> > select * from Protocol where type="simulator" and ...
> >
> > The latter query btw can still be asked as well, as 
> Protocol is also 
> > queriable and there will be a type column (though currently 
> generated 
> > as DTYPE by the code generator I created with Laurent).
> >
> > In the XMl representation a Simulator would be represented by an 
> > element of name <aSimulator> ....
> >
> > The alternative is to have (using always elements for attribute
> > mappings)
> >
> > <aProtocol>
> > <type>simulator</type>
> > ...
> 
> Likewise, you can do a join on Physics and Protocol, and only 
> select Protocols that have associated Physics.
Of course. 

Main question we need to resolve is how explicit we want the model to be.
Or how precise do we want to make our taxonomy.

For one, only having Experiment and Protocol does not sit well with similar
models for observations for example.
Also those have protocols and experiments, but we would like to be able to
make a distinction between those and the experiments in SimDB. For me the
distinction between adding/simulating new physics as opposed to "merely"
processing existing results is sufficient reason to add at least Simulation
and PostProcessing and corresponding protocols as concrete subclasses of
experiment. We may not need to subclass the PostProcessing further. It gives
a more natural place to hang the physics than on generic "protocol".

I propose that to decide on this it may be useful to have a list of concrete
examples of protocols. 
I have started one on
http://www.ivoa.net/cgi-bin/twiki/bin/view/IVOA/IVOATheorySimDBDM
Please everybody add to this, especially also add examples of things
different from simulators (including sam-s) and cluster finders.

> 
> >> 2) Similarly, remove the Simulation, PostProcessing, and 
> >> ClusterDetection. The type of these experiments is defined by the 
> >> type of protocol they are created with. Again, the AppliedPhysics 
> >> collection can be moved up the Experiment class, along with the 
> >> reference to the protocol, and the execution time.
> >>
> > The decision on this will have to follow the decision about the 
> > previous suggestion.
> >
> >> 3) Remove ExperimentRepresentationObject and 
> ExperimentProperty. I've 
> >> brought these up before, and I still think they are being used to 
> > ...
> particles are 
> > named x,y,z whether star, gas or dark matter particles.
> 
> I'm willing to keep these around, but it would be nice if we 
> could simplify the references. Also, if we were to use 
> publisherDID for referencing the properties, we would be able 
> to distinguish between different properties with the same 
> name, such as "Enzo/DarkMatter/ Density" and 
> "Enzo/Grid/Density". But either solution is fine to me.
> 
I don't think that would be a good solution for queries as it depends on
arbitrary choices made by the publisher.

> >> 4) CompositeProtocol and CompositeExperiment could go (and  
> Now, to summarize (please correct me if I misstate your opinions):
> 

> 1) We agree that the model needs another pass, in part to 
> ensure that it works in practice, and in part to simplify it.
> 
Agreed.

> 2) Whether or not to collapse the Simulation and 
> PostProcessing classes into Protocol and Experiment still 
> needs to be decided.
> 
Yep. As I stated above my preference is to make at leats the distinction
between simulation and (pure) post-processing.
But possibly not more. Need input from others.

> 3) Keep ExperimentRepresentionObject and ExperimentProperty, 
> but try to simplify the reference to attributes of the Protocol.
> 
Agreed
> 4) CompositeExperiment and CompositeProtocol should be removed.
> 
Agreed.

I will create two versions of the model with these choices, one the domain,
"with references" version,
the other denormalised logical "with attributes".

Gerard