Comments on the Simulation Data Model

Fri Mar 13 01:06:53 PDT 2009

Hi Gerard,

Thanks for the response. To make things clearer, I'm going to split  
my reply into two messages, which is what I should have done at the  
beginning. In this one, I'll cover the contents of the data model,  
and in the next I'll reply to our discussion about the note.

On Mar 12, 2009, at 3:37 AM, Gerard wrote:

> Dear Rick and others
>
> My comments below, which mainly try to explain (in the wordy way I  
> can not
> avoid)
> my motivations for particular choices.
>
>
>> I think these will simplify the model a little, without
>> losing any of the information contained in it. Also, these
>> changes may open up the possibility of describing more data
>> and protocols with the model.
>>
> I am viewing your comments from two perspectives:
> 1- "domain model", what model describes "reality" the best.
> This model is not meant to be directly used in applications (i.e.
> protocols),
> but to be a template for such "logical" models.
>
> 2- "logical model", which model do we want to use for the SimDB
> specification. I.e. the model from which we
> directly derive the physical representations, which for SimDB are  
> the XML
> serialisation and the TAP query interface.
> This model has to fulfil different requirements form the domain model.
> It needs to be usable, and need not necessarily be as abstract and  
> rich in
> concepts as the domain model.
>
> Originally we started defining a domain model, that then evolved  
> (albeit
> somewhat implicitly) into
> a more usable model, and then migrated back towards higher  
> abstraction.
> The current model may be too close to a domain model, and we need  
> to come
> (asap) to
> a more usable form, though possibly deriving that from the current  
> model.

I think this is what I'm suggesting. I don't want to lose the  
distinction between the Protocol and Experiment, but I would like to  
reduce the overall complexity of the model.

> In particular the heavy referencing that is going on between  
> experiments and
> protocols
> is very hard to use, especially in XML messaging. A possible way to  
> treat
> those is to use
> the "name" attributes that most referenced objects have, and which  
> moreover
> in their contexts
> are almost always unique (names of parameters, representation  
> objects types
> inside the protocol,
> properties inside their representation object).
>
> In Volute I have added a denormalised version of the UML diagram,  
> under
> http://volute.googlecode.com/svn/trunk/projects/theory/snapdm/input/ 
> SimDB_DM
> _denormalised.xml.
> I have mainly simplified it by turning references from experiment to
> protocol into attributes with the name of
> the referenced object. In relational databases this is not so  
> important, but
> for XML it makes a huge difference if
> we do not have to model the referencing using IVOIdentifier-s for  
> example.

I agree, a reference to a name attribute is much easier to implement,  
and the namespace of the referenced instance is usually clear. (E.g.,  
if I reference the input parameter CosmologyComovingBoxSize of an  
Enzo simulation, I know exactly which parameter that is.) However,  
publisherDID may be preferable to name; publisherDIDs can be type to  
be URL-like fields, which are less likely to get mangled than a name.

> I was planning to bring this up next time I have proper time to  
> work on
> SimDB,
> which hopefully starts in a week or so and should last until the  
> interop
> with short interruptions.)

Well, I'm glad I chose this time to bring it up, then.

>> Here they are, in a semi-dependent order:
>>
>> 1) Remove the Simulator, PostProcessing, and ClusterFinder classes.
>> All these classes provide is a very limited taxonomy.
>> Instead, add a "Class" or "Type" attribute to the Protocol
>> class. This attribute can be an enumeration, like the
>> RepresentationObject, e.g., "simulator", "initial conditions
>> generator", "cluster finder", "custom", etc. The collection
>> of Physics instances can be brought up to the Protocol level,
>> since many Protocols model physical processes, not just simulators.
>>
> This is a common choice one needs to make, whether creating a new
> type/class,
> or whether to use a "type" attribute that indicates the "type" of  
> object.
> I just want to explain the kind of thoughts I think should go into  
> making a
> decision one way or another.
>
> Reasons where one might choose to introduce a new subclass are when  
> the
> structure of the
> corresponding type of object differs from its siblings. Currently  
> the main
> difference between
> Simulator and the other subclasses of Protocol is indeed the  
> collection of
> Physics
> objects (indicating which physical differential equations for  
> example were
> "simulated").
> Any protocol that simulates physics would therefore have to be a  
> Simulator.
> This includes codes that add new physics to existing results, for  
> example
> semi-analytical
> galaxy formation routines.
>
> The problem with using an attribute to indicate the type is that  
> particular
> structural
> constraints become harder to express. For example we might want to  
> insist
> that for
> some protocol to be classsified as a Simulator it must at least have 1
> Physics object
> in its collection. This is easy to express using the cardinality  
> property on
> composition relations,
> but requires some constraint expression language when using  
> attributes.
> The PostProcessing type that we had was meant to include protocols  
> that *do
> not* add new physics.
> Agai this was simply expressed now.

You have hit on one of my motivations with the word "constrained". My  
recent experiences have led me to suggest that we make the model both  
less complex, and less constrained. Once the subclasses of  
PostProcessor and PostProcessing were eliminated (save for halo  
finding), much of our data (extractions and projections), no longer  
fit the model. I'm then left with a choice of using the Simulator and  
Simulation classes as dumping grounds for the data that doesn't fit,  
or relaxing the model.

Since the goal is to enable myself and others to publish as much data  
as possible, I believe a broader definition of Experiment and  
Protocol is preferable over eliminating some data because it doesn't  
fit the model. And, since these classes don't describe any  
functionality (i.e., methods), then it's safer to push the  
classification of the Protocol to an attribute. Kind of like sticking  
a label that says "Dog" on a particular "Mammal". Currently, we have  
a model that has the lables "Dog" and "Cat", and I'm trying to add  
some new labels, so long as we can describe any "Mammal" in  
sufficient detail.

> The main question for the domain model is whether the real world  
> can be this
> clearly demarcated.
> For the logical model we need to decide whether it is easier to  
> work with
> one or the other.
>
> In the TAP interface to SimDB the difference would be that in the  
> "typed"
> version one can pose the query
>
> select * from Simulator where ...
>
> The alternative is to write
>
> select * from Protocol where type="simulator" and ...
>
> The latter query btw can still be asked as well, as Protocol is also
> queriable and there will be a type column (though currently  
> generated as
> DTYPE by the code generator I created with Laurent).
>
> In the XMl representation a Simulator would be represented by an  
> element of
> name
> <aSimulator>
> ....
>
> The alternative is to have (using always elements for attribute  
> mappings)
>
> <aProtocol>
> <type>simulator</type>
> ...

Likewise, you can do a join on Physics and Protocol, and only select  
Protocols that have associated Physics.

>> 2) Similarly, remove the Simulation, PostProcessing, and
>> ClusterDetection. The type of these experiments is defined by
>> the type of protocol they are created with. Again, the
>> AppliedPhysics collection can be moved up the Experiment
>> class, along with the reference to the protocol, and the
>> execution time.
>>
> The decision on this will have to follow the decision about the  
> previous
> suggestion.
>
>> 3) Remove ExperimentRepresentationObject and
>> ExperimentProperty. I've brought these up before, and I still
>> think they are being used to represent a linking table that
>> doesn't need to be explicitly declared. There are 1..*
>> references from the Experiment to a representation's
>> properties; from there the representation can be found.
>>
> Indeed we have discussed this before but it is good to mention it  
> again.
> In TAP there does not exist such a 1..* referencing concept.
> And in our UML profile we don't have this type of aggregation, we  
> introduce
> associative
> classes like the ones you mentioned.
> But the reasons why I did so is that they allow one to describe  
> explicitly
> the choice made
> in an experiment which of the a protocol's possible representation  
> objects
> to use.
> This is explicitly possible in many simulation codes.
> E.g. most SPH codes allow one to have pure dark matter simulations,  
> but one
> can also add gas, and stars.
> Then for each of these choices one can make explicit choices which
> properties to calculate.
>
> Since this is possible, the most accurate model (in my opinion) is  
> the one
> we have.
> So in the domain model I'd like to keep it.
>
> In the logical model again we have to look at different usages.
> In particular if there we want to remove many of the references and  
> use
> names instead,
> as is done in the denormalised version of the model.
> In that case it becomes especially important to keep the
> ExperimentRepresentationObject
> to give context to the names of the ExperimentProperty. For in most  
> cases
> the positions of
> particles are named x,y,z whether star, gas or dark matter particles.

I'm willing to keep these around, but it would be nice if we could  
simplify the references. Also, if we were to use publisherDID for  
referencing the properties, we would be able to distinguish between  
different properties with the same name, such as "Enzo/DarkMatter/ 
Density" and "Enzo/Grid/Density". But either solution is fine to me.

>> 4) CompositeProtocol and CompositeExperiment could go (and
>> therefore ChildProtocol and ChildExperiment). While I can see
>> a use case for defining a CompositeProtocol for running an
>> experiment, I'm not sure it's necessary for describing one.
>> And, it gets confusing that CompositeProtocol can define its
>> own parameters and representations, and so can the
>> ChildProtocols. This makes it unclear where to define these
>> things. Likewise for CompositeExperiment. And, the Project
>> class serves as another mechanism to aggregate experiments.
>>
> I agree with this suggestion.

Ahh, a simple one, great!

Now, to summarize (please correct me if I misstate your opinions):

1) We agree that the model needs another pass, in part to ensure that  
it works in practice, and in part to simplify it.

2) Whether or not to collapse the Simulation and PostProcessing  
classes into Protocol and Experiment still needs to be decided.

3) Keep ExperimentRepresentionObject and ExperimentProperty, but try  
to simplify the reference to attributes of the Protocol.

4) CompositeExperiment and CompositeProtocol should be removed.

--Rick