Comments on the Simulation Data Model

Thu Mar 12 03:37:32 PDT 2009

Dear Rick and others

My comments below, which mainly try to explain (in the wordy way I can not
avoid) 
my motivations for particular choices.

> I think these will simplify the model a little, without 
> losing any of the information contained in it. Also, these 
> changes may open up the possibility of describing more data 
> and protocols with the model.
> 
I am viewing your comments from two perspectives:
1- "domain model", what model describes "reality" the best. 
This model is not meant to be directly used in applications (i.e.
protocols),
but to be a template for such "logical" models.

2- "logical model", which model do we want to use for the SimDB
specification. I.e. the model from which we
directly derive the physical representations, which for SimDB are the XML
serialisation and the TAP query interface.
This model has to fulfil different requirements form the domain model. 
It needs to be usable, and need not necessarily be as abstract and rich in
concepts as the domain model.

Originally we started defining a domain model, that then evolved (albeit
somewhat implicitly) into
a more usable model, and then migrated back towards higher abstraction.
The current model may be too close to a domain model, and we need to come
(asap) to 
a more usable form, though possibly deriving that from the current model. 

In particular the heavy referencing that is going on between experiments and
protocols
is very hard to use, especially in XML messaging. A possible way to treat
those is to use
the "name" attributes that most referenced objects have, and which moreover
in their contexts
are almost always unique (names of parameters, representation objects types
inside the protocol,
properties inside their representation object).

In Volute I have added a denormalised version of the UML diagram, under
http://volute.googlecode.com/svn/trunk/projects/theory/snapdm/input/SimDB_DM
_denormalised.xml.
I have mainly simplified it by turning references from experiment to
protocol into attributes with the name of
the referenced object. In relational databases this is not so important, but
for XML it makes a huge difference if
we do not have to model the referencing using IVOIdentifier-s for example.

I was planning to bring this up next time I have proper time to work on
SimDB,
which hopefully starts in a week or so and should last until the interop
with short interruptions.)

> Here they are, in a semi-dependent order:
> 
> 1) Remove the Simulator, PostProcessing, and ClusterFinder classes.  
> All these classes provide is a very limited taxonomy. 
> Instead, add a "Class" or "Type" attribute to the Protocol 
> class. This attribute can be an enumeration, like the 
> RepresentationObject, e.g., "simulator", "initial conditions 
> generator", "cluster finder", "custom", etc. The collection 
> of Physics instances can be brought up to the Protocol level, 
> since many Protocols model physical processes, not just simulators.
> 
This is a common choice one needs to make, whether creating a new
type/class, 
or whether to use a "type" attribute that indicates the "type" of object.
I just want to explain the kind of thoughts I think should go into making a 
decision one way or another.

Reasons where one might choose to introduce a new subclass are when the
structure of the
corresponding type of object differs from its siblings. Currently the main
difference between
Simulator and the other subclasses of Protocol is indeed the collection of
Physics
objects (indicating which physical differential equations for example were
"simulated").
Any protocol that simulates physics would therefore have to be a Simulator.
This includes codes that add new physics to existing results, for example
semi-analytical 
galaxy formation routines. 

The problem with using an attribute to indicate the type is that particular
structural 
constraints become harder to express. For example we might want to insist
that for 
some protocol to be classsified as a Simulator it must at least have 1
Physics object 
in its collection. This is easy to express using the cardinality property on
composition relations,
but requires some constraint expression language when using attributes.
The PostProcessing type that we had was meant to include protocols that *do
not* add new physics.
Agai this was simply expressed now.

The main question for the domain model is whether the real world can be this
clearly demarcated.
For the logical model we need to decide whether it is easier to work with
one or the other.

In the TAP interface to SimDB the difference would be that in the "typed"
version one can pose the query

select * from Simulator where ...

The alternative is to write

select * from Protocol where type="simulator" and ...

The latter query btw can still be asked as well, as Protocol is also
queriable and there will be a type column (though currently generated as
DTYPE by the code generator I created with Laurent).

In the XMl representation a Simulator would be represented by an element of
name 
<aSimulator> 
.... 

The alternative is to have (using always elements for attribute mappings)

<aProtocol>
<type>simulator</type>
...

> 2) Similarly, remove the Simulation, PostProcessing, and 
> ClusterDetection. The type of these experiments is defined by 
> the type of protocol they are created with. Again, the 
> AppliedPhysics collection can be moved up the Experiment 
> class, along with the reference to the protocol, and the 
> execution time.
> 
The decision on this will have to follow the decision about the previous
suggestion.

> 3) Remove ExperimentRepresentationObject and 
> ExperimentProperty. I've brought these up before, and I still 
> think they are being used to represent a linking table that 
> doesn't need to be explicitly declared. There are 1..* 
> references from the Experiment to a representation's 
> properties; from there the representation can be found.
> 
Indeed we have discussed this before but it is good to mention it again.
In TAP there does not exist such a 1..* referencing concept. 
And in our UML profile we don't have this type of aggregation, we introduce
associative 
classes like the ones you mentioned.
But the reasons why I did so is that they allow one to describe explicitly
the choice made 
in an experiment which of the a protocol's possible representation objects
to use. 
This is explicitly possible in many simulation codes. 
E.g. most SPH codes allow one to have pure dark matter simulations, but one
can also add gas, and stars. 
Then for each of these choices one can make explicit choices which
properties to calculate.

Since this is possible, the most accurate model (in my opinion) is the one
we have.
So in the domain model I'd like to keep it.

In the logical model again we have to look at different usages.
In particular if there we want to remove many of the references and use
names instead,
as is done in the denormalised version of the model. 
In that case it becomes especially important to keep the
ExperimentRepresentationObject 
to give context to the names of the ExperimentProperty. For in most cases
the positions of
particles are named x,y,z whether star, gas or dark matter particles.

> 4) CompositeProtocol and CompositeExperiment could go (and 
> therefore ChildProtocol and ChildExperiment). While I can see 
> a use case for defining a CompositeProtocol for running an 
> experiment, I'm not sure it's necessary for describing one. 
> And, it gets confusing that CompositeProtocol can define its 
> own parameters and representations, and so can the 
> ChildProtocols. This makes it unclear where to define these 
> things. Likewise for CompositeExperiment. And, the Project 
> class serves as another mechanism to aggregate experiments.
> 
I agree with this suggestion.

> My final comment is a suggestion for the contents of the Note 
> on the data model:
> 
> 1) An overview of the model, including the packages and major 
> classes (Experiment, Protocol, Snapshots, etc.).
> 
> 2) A discussion of how characterization is treated in the model.
> 
> 3) XML Schemas for the top-level container classes (Protocol, 
> Experiment, etc.).
> 
Agreed with this.

> 4) VOTable serializations of the the container classes. (This 
> is of particular interest to me, since I've been working on 
> that as part of the SimDAP Note.)
> 
Why do you think this is necessary?
So far we considered to use TAP as one side of the protocol , and
consequently to use its
prescription for specifying the tabular representation of the model. 
This may come in a variety of forms, VODataService, TAP_SCHEMA tables,
possibly VOTable.
But the later is not the main one for TAP as far as I see.

> 5) The table and catalog definitions can in as a reference 
> for anyone building a VODataService using the model.
Is this equivalent to what I suggested in the previous comment?

> 
> 3, 4 and 5 can go in as appendices, and the automatically 
> generated documentation can either be an appendix or stay as 
> stand-alone document. Once TAP is sorted out, writing the 
> SimDB standard should mostly consist of writing an XML Schema 
> that subclasses the TAP registry schema, and providing 
> whatever description of the database schema is required for TAP.
> 

Cheers

Gerard