XML Schema for the Simulation Data Model

Wed Feb 13 09:22:46 PST 2008

Hi Rick

(To all, my apologies for another long email. At the end I make some
suggestions how we might avoid this by moving the discussion to the wiki
pages. Hope that will work)

My comments below:

> Gladly. I appreciate the time difference, since it allows more time
> for me to respond.
Same here, though it tends to lure me into working through the day on a
reply, which then gets very long.

> Yes, I think you captured the model I had in my head while I was
> drafting the schema nicely. I am still coming around to UML, and
> haven't gotten in the habit of modeling with a program. I tend to use
> a lot scratch paper and whiteboard space. But, I can see the
> advantages in a case like this where we need a medium for communication.
Whiteboard is by far the best tool for data modeling. UML tools simply allow
you to record the whiteboard sessions!
I have started porting the documentation of the data model to the wiki pages
(http://www.ivoa.net/twiki/bin/view/IVOA/IvoaTheory_SNAPDataModelOnlinDoc )
I have also added a page documenting the small subset of UML modeling
elements I have been using:
http://www.ivoa.net/twiki/bin/view/IVOA/UmlSyntaxRule
Both are very much work in progress.

> Yes, even with my changes, these are how things map between the
> models. This actually points out something I wanted to achieve, which
> was a reduction in the different types of things being described by
> the model.
> 
I think the goal of the discussion should include decisions on which
elements can be skipped and which have to remain. I think we should there
also get the input of the other participants who should compare their own
models. This is an action we agreed on in Cambridge and also Patrizia and
Ugo (I hope you're reading this) have started work on this. And also the
French are active.

> Not necessarily model the pipeline, as much as record the pipeline.
Ah yes, in this IVOA work I have gotten used to using "model" with the
meaning "create a model of the metadata", so indeed record or describe a
pipeline was what I meant.

> From your comments, I realized that I treat what SNAP calls a
> Protocol as something that is implicit, and a result of what is
> actually done, rather than being explicitly spelled out in the
> beginning. I think the need to document a Protocol to publish
> simulation results is too much overhead for little gain. And, I
> suspect a new Protocol would have to be written for every research
> project, and there would be a lot of time spent documenting one-off
> results.
> 
I do not quite agree. When designing models for the VO we must be careful
with implicit knowledge. Also, in my experience, once you know something can
be modeled explicitly, but you leave it out, it has a habit of popping up
later anyway, so I always try to be as explicit as possible to start with.
We can then at the next stage try to smooth over some of the details.

The advantage of having the protocol explicit and in some detail in the
model and the SNAP registry is that it can be reused. For example, a given
public version of Enzo is a protocol that could (should) be registered once,
and then everyone using if only needs to refer to it. This allows us to ask
questions like "give me all simulations that used Enzo version a.b.c".
The level of detail with which we need to model the protocol is up for
debate. It has obtained some extra structure recently based on the use cases
from Franck. His code (protocol) has MANY input parameters and he would like
to avoid re-describing those for each simulation he runs.

But we should not overdo things.

> > 2. The concept of ObjectType is missing in your model.
> 
> Yes, at the level and ubiquity at which it appears in the SNAP model,
> it is missing, quite deliberately. I did not originally fully agree
> with Dave DeYoung's comment at the last InterOp that the SNAP was
> heavily biased towards particle simulations, but I think SPH has left
> its mark here in the ObjectType.
> 
> When thinking of scale-free turbulence simulations done on a grid, my
> mind does not conjure up objects in the sense of objects in the
> simulation. Now, if a SNAPSimulation was an ObjectType (which it is
> not) I could imagine defining properties of the simulation that I
> wanted to characterize, and referencing those properties in a
> snapshot. Likewise for TargetProcess--that seems applicable for
> describing the interesting properties of a turbulence run.
> 
Actually I think this may be a misunderstanding of the concept represented
by the ObjectType and its concrete sub-types, RepresentationObjectType and
TargetObjectType. Maybe the names are misleading?
ObjectType is really mainly intended to be a hook to hang sets of properties
on that belong together and can serve as base class for the other two. It's
meaning is very much following the idea of OO design, where objects have
properties and possibly child objects (in the SNAP DM since today upon
request by "the French"). We could have called it Thing instead.

RepresentationObjectType is supposed to describe the "object(s)" that your
simulation uses when representing the part of the universe that you are
simulating. This can be particles in an n-body simulation, or clusters in a
group finder. In an image (result of an observation) it would be a pixel.
The TargetObjectType is used together with TargetProcess to describe the
goal of your experiment. The former represent real world "object-s" that are
the goal of the experiment, the latter a particular physical process that
you are using. 

In your example of "scale-free turbulence simulations done on a grid" the 
Representationobjecttype would be a grid cell and you could define the
properties that are included in your simulation such as possibly position,
temperature, pressure, mass etc.
What the goal of your experiment is, is somewhat up to you decide, possibly
taking into account what other people might be interested in. If you study
turbulence in some gas cloud you could have a target object type instance
"cloud" and a process instance for "turbulence". Once the semantics WG comes
up with usable and standard ontologies we can use those as labels in our
objects. Until then I have limited the SNAP DM down to using UCDs, Subject
keywords form astro journals and a few numerical types.

One important goal of the validation of the model is to see whether we can
actually describe the goal of the experiment in this way, and whether they
provide information for the first question that users ask 

> As an aside, I am not sold either way on axes vs. properties. I just
> happen to find axes easier to describe to others.
I find that the natural way to think about things is that they have
properties, not axes. This is why I keep insisting on property as name for
the concept. It also is used in all OO design and languages (even explicit
now in C#). 
I have come up with a compromise which includes "axis" in the place I could
see its use. I have renamed the reference from Characterisation to Property
in the model from "property" to "axis". See the latest data model on the
wiki. So in its *use* of ObjectType.Property, the Characterisation type
calls id axis. That is OK.
> 
> Yes, this is the SPH imprint. 
I hope it is clear from the foregoing that it is actually not SPH inspired.

> Unfortunately, I can't tell from
> looking at the model how to describe something that isn't comprised
> of distinct objects. I tried to write an instance document that was
> counting the grid cells, but that just seemed silly. And, more
> importantly, that information is available in the parameter file.
About the parameter file, each simulator code (protocol) will likely have
different ways of representing its input information and it is not in all
cases clear what the effects of certain input parameters will be. We need a
common way of expressing this kind of information. This was the motivation
for the Characterisation DM and is the motivation for similar elements in
the SNAP DM. How useful they are for discovery we need to find out.

> There could be some useful information in the number of grids in an
> adaptive mesh simulation, but I don't believe people are going to
> search for simulations of galaxy clusters with more than 100,000
> grids--they're just going to look for simulations of galaxy clusters.
They may be interested in simulations of galaxy clusters that use grid cells
to represent them, rather than SPH particles. And they may want to know what
properties you calculate for these grid cells. This is all that
representation object type is supposed to support. 
I agree that the characterization of the collections of grid cells may be of
less interest at discovery time. It is in the model but could be ignored.

There is likely a place for some kind of characterization in the target
object type. For example, I may be interested in simulations that create a
cluster of about 1e14 M_solar. This was noted by Laurent and Herve as well
and is currently not (yet) included in the model.

> I think the ObjectType and ObjectCollection have their uses. But,
> there needs to be a means to describe the simulation when there is
> nothing remotely resembling an object present. Perhaps part of this
> is my perception that grids and grid cells should not be treated as a
> collection of objects. Individual grid cells have little meaning
> without the structure of the entire mesh.
It may correct that a grid cell is different from a particle, which can move
all over the pace, whereas neighbouring cells are relevant for mesh
simulations. To me this mainly means that there is a range of ways to
represent the world in simulations. Recently Volker Springel showed results
his work on hydrocodes using voronoi tessellations for discretising space.
Here grid cells flow past each other from one time step to the next and
neighbours may loose contact.

In any case, the object type is there to indicate what kinds of objects are
used in the simulation, and what their properties are. Apart from discovery,
this knowledge is also useful to get an indication of the content of the
data files. For a mesh these will (I suppose) represent the individual cells
in some way. This can be implicit, as could be imagined easily for a
regular, non-adaptive grid, or for the pixels in a FITS file.
One also needs to consider individual grid cells when making cut-outs for
example. So the whole grid is not always required. As another example I
often query the Millennium database for individual grid cells (the density
field is stored on a grid) that have particular properties, so I can
correlate these with properties of galaxies inside these grid cells.

> I turned this comment into a suggestion, and removed Group, and
> spelled out the references (which I think are only used for the
> characterization. I also went ahead with something I knew would have
> to be done, and split Program into its own resource. This allows for
> a much richer description of the software, the ability to reference
> its parameters, etc.
> 
Ok. The protocol can/must still be worked out in more detail. I know Franck
and coworkers were thinking about this as well, I think for the case of
registering simulation and related codes. For the use-case of discovery we
may need not everything that could be said about them, but it can not hurt
to analyse this part of the model somewhat further.

> I've used the Resource inheritance on the assumption that it will
> make it easier for existing registries to register Simulations and
> Programs. At the very least, they can handle the common elements. And
> since (as you note later), I've removed TargetObject and
> TargetProcess, I need the Content to describe the purpose of the
> simulation.
When discussing the relation to the registry with Ray Plante we noted that
individual simulations are in general too fine grained to register as
individual Resource-s. So in general it is not true that a simulation "is a"
(in the sense of type inheritance) Resource. 
But this should not be a problem. Once we have a SNAP registry, it will not
be hard to create some XSLT scripts to turn those SNAP resources for which
it is appropriate into Registry Resource documents. 
For discovery I wanted to model the coarse Content model in more detail,
leaving it in would lead to redundancy, especially since many features are
required. Hence I went away from inheritance, which in my experience is one
of the most easily abused features in data modeling efforts, often causing
unnatural constraints. I prefer to post-pone such modeling features to the
end.

> Another comment on Program vs. Protocol and Simulator: Protocol and
> Simulator just seemed more abstract than necessary, and used new
> terms to describe things where old ones were fine (software, program,
> code, etc.). The idea of an executable with input and output just
> gets totally lost in the SNAP model. Now, I will admit that the idea
> of documenting protocols is very attractive, but perhaps that's
> something for SNAP 2.0.
> 
I borrowed the term protocol from an early model 
http://www.ivoa.net/internal/IVOA/IvoaDataModel/DomainModelv0.9.1.doc
which itself was inspired by a similar construct in a book on "Analysis
patterns". It can be used also to describe how one can do a telescope
observation, or calibration. SNAPProtocol-s will always include some kind of
Program, but that is not the only thing we need to describe it. The various
protocols are an attempt at classifying the different types of codes and
their maning and we need to decide how far we want to go.
A somewhat more refined classification allows one to ask questions like
"give me all the group finders you know of in the SNAP registry".
Also, implicitly it allows one to classify the different types of
experiments (and their results etc), simply by the type of protocol they
refer to. 

> > 5. You do not have TargetObjectType, TargetProcess, Algorithm,
> > Physics,
> > SNAPWebService. These were all introduced explicitly to support
> > discovery
> > (first 4) and execution (SNAPWebService) in the SNAP protocol.
> 
> Understood, which is why I've gone with the Resource inheritance,
> since it provides a mechanism to describe the purpose of the
> simulation. And Algorithm and Physics come from Program, where the
> Methods and InputParameters determine which equations are being
> solved and how. As for web service, I'm skipping that one until I
> have a web service that needs describing.
> 
Physics is intended to represent evolution equations describing physical
processes which are translated into code using some numerical algorithm.
This distinguishes (in the model) a simulator from for example a group
finder which has algorithms and code and is a program, but not a simulator.

> > All in all it seems though that the models are pretty compatible,
> > with the
> > SNAP model being more general and comprehensive, as one should
> > expect for a
> > model that needs general application.
> 
> I agree on the compatibility, especially in light of my changes, but
> I disagree on the generality statement. Since my model only records
> the programs used, and the inputs and outputs, it can be used to
> describe a greater range of simulations (or program executions).
> Basically, because the model makes fewer assumptions about what is
> being modeled, I think it has a greater scope. What it lacks is the
> level of detail that the SNAP model has, which I'm not sure is
> required to adequately describe most simulations sufficiently to
> support discovery.
> 
> For amusement I've posted a trivial example of describing a copy as a
> simulation.
> http://lca.ucsd.edu/projects/rpwagner/wiki/CPInstance
> 
> I don't know if this is a good demonstration of generality, or just a
> reminder that valid XML does not necessarily imply valid meaning.

The fact that you can do almost everything with a particular model is not
necessarily a good thing. It may mean that too little information is
available to support the use cases. 
For SNAP we are interested in programs that do particular kinds of things,
one of which is that they produce a representation of 3d space. Some of
these programs do so by following evolution equations from initial
conditions, others analyse the results of these and come up with an object
catalogue etc. We furthermore want to be able to give sufficient information
to a registry so that user can make up their mind from the metadata whether
they may be interested in a particular resource. 

> 
> > For now I see your model as an alternative representation of (a
> > subset of)
> > the information in the full model, that can have its particular
> > application
> > area.
> 
> I think subset is the right idea. And the application area I'm
> targeting is "sufficient for users to find simulations of galaxy
> clusters or compressible turbulence done by Enzo or Gadget". And this
> summarizes my needs based motivation. What I'm trying for is SNAP-
> Lite, or ReallySimpleSNAP; just enough elements to describe the
> process, and ones that existing registries can handle.
> 
As I mentioned above, existing registries are not necessarily the right
place for publishing simulations. This is what we want the more detailed
"SNAP registry" for.  
Also, the approach that we wanted to follow was first to create a model that
properly analyses the domain, possibly in more detail than we later need or
want for the use in discovery. The current SNAP model is doing a step from
such a generic analysis model for the domain to the logical model for use in
discovery in SNAP, but is likely not quite there yet.
I think it is too early yet to start developing simple, lite versions of
such the final model that may not do justice to various types of
simulations, like in not supporting some of the questions we think are
important. 

> 
> My model is much closer to SNAP than the Horizon, GalICS, or ITVO
> models. Those are all oriented around a single code. I use Enzo as
> the example, because that's most of the data I'm dealing with, but
> that does not mean it's the only code that can be represented.
> 
I agree that your model, being based on a more general SNAP model, is more
general than those other models. But those models may have features that we
can learn from and provide test cases that we need or want to be able to
support.

As promised, a long email.

It would be good if we can cristalise out of this and following discussions
a list of issues that warrant more thought. There is a page on the wiki
dedicated to this, written as a kind of FAQ and currently very underused:
http://www.ivoa.net/twiki/bin/view/IVOA/IvoaTheory_SNAPDataModelDiscussion

This page can also serve as a documentation of the discussions we have and
the decisions we have made. This may shortcut discussions that will no doubt
come up when we want to present the models to the wider IVOA community. If
they can read about the discussions we have had and see the arguments in a
somewhat concise manner they do not have to bring these up again. 

Please feel free to add questions/comments there.

Further, let us if possible focus the discussion on the model that is under
development on the theory pages. If there are arguments for changing it
let's present these changes with their motivation explicitly. And if an
alternative model is presented, let's try to describe it in terms of
adjustments to the existing model, as we have tried doing now. Then at least
there is a consistent history and we can avoid the impression of starting
from scratch.

Best regards 

Gerard