XML Schema for the Simulation Data Model

Wed Feb 13 00:30:53 PST 2008

Hi Gerard,

First, I've moved my documents to a separate page on the Twiki with  
the long-winded URL:
http://www.ivoa.net/twiki/bin/view/IVOA/ 
IVOATheorySimulationCADACDatamodel

I've posted revisions to the--now three--schemas based on your input,  
along with some commentary and screenshots.

Second, I understand that I have come late to the discussion about  
SNAP, and the simulation data models, I am not trying to slow down  
the process. However, I started this effort to see if there could be  
a reduced, or simplified, model that met most simulation data  
providers' needs. I'm glad you took the time to respond, since this  
is exactly the type of discussion I hoped to have.

> Thanks for posting your work on the theory mailing list.
> Please let's keep discussing the work here and not go offline too  
> soon.

Gladly. I appreciate the time difference, since it allows more time  
for me to respond.

> I am trying to understand the relation of your schema to the SNAP  
> data model
> and schema we are working on. To this end I have created a UML  
> version of it
> which I have attached as a JPG. I also attached three JPGs of the  
> SNAP model
> as today updated on the theory wiki. The updates are not very  
> involved,
> mainly some details refined and cleaned up.

Yes, I think you captured the model I had in my head while I was  
drafting the schema nicely. I am still coming around to UML, and  
haven't gotten in the habit of modeling with a program. I tend to use  
a lot scratch paper and whiteboard space. But, I can see the  
advantages in a case like this where we need a medium for communication.

> Comparing your model with this latest version of the SNAP DM I  
> think the
> following correspondences can be made (I am ignoring detailed  
> differences in
> attributes etc) :
>
> Rick's model 			SNAP model			
> ----------------------------------------
> ProgramType				SNAPProtocol, SNAPSimulator
> SimulationType			SNAPProject (1 below)
> RunType				SNAPSimulation (1)
> CharacterisationAxisType	Property (of ObjectType) (2 below)
> CharacterisationType		Characterisation
> ParameterType			InputParameter+ParameterSetting
> inputSnapshot			InputDataset

Yes, even with my changes, these are how things map between the  
models. This actually points out something I wanted to achieve, which  
was a reduction in the different types of things being described by  
the model.

> 1. You have a SimulationType and a RunType. The latter seems to  
> correspond
> to a SNAPSimulation, as it contains the collection of input  
> parameters and
> snapshots and has its own reference to a ProgramType.
> At first I assumed that your SimulationType corresponded to a  
> number of
> SNAPSimulation-s, all with the same program and characterization.  
> But from
> the example instance document you sent around I guess it is  
> actually more
> like the SNAPProject. Is this correct?
> Your SimulationType has a reference to ProgramType as well. Is this  
> supposed
> to mode a kind of pipeline?

Not necessarily model the pipeline, as much as record the pipeline.  
 From your comments, I realized that I treat what SNAP calls a  
Protocol as something that is implicit, and a result of what is  
actually done, rather than being explicitly spelled out in the  
beginning. I think the need to document a Protocol to publish  
simulation results is too much overhead for little gain. And, I  
suspect a new Protocol would have to be written for every research  
project, and there would be a lot of time spent documenting one-off  
results.

> 2. The concept of ObjectType is missing in your model.

Yes, at the level and ubiquity at which it appears in the SNAP model,  
it is missing, quite deliberately. I did not originally fully agree  
with Dave DeYoung's comment at the last InterOp that the SNAP was  
heavily biased towards particle simulations, but I think SPH has left  
its mark here in the ObjectType.

When thinking of scale-free turbulence simulations done on a grid, my  
mind does not conjure up objects in the sense of objects in the  
simulation. Now, if a SNAPSimulation was an ObjectType (which it is  
not) I could imagine defining properties of the simulation that I  
wanted to characterize, and referencing those properties in a  
snapshot. Likewise for TargetProcess--that seems applicable for  
describing the interesting properties of a turbulence run.

> This makes it impossible to have multiple explicitly defined types  
> of objects inside a
> single simulation. In the SNAP DM each object type is defined  
> explicitly
> with its own set of properties. Note that my choice of using the name
> Property has been a point of contention by some of the  
> Characterisation DM
> people, who wanted me to use Axis. I see you have chosen their  
> side ;).

As an aside, I am not sold either way on axes vs. properties. I just  
happen to find axes easier to describe to others.

> I feel that for many simulations the ObjectType is definitely and  
> explicitly
> present. For example some of the SPH simulations I have access to  
> here have
> dark matter, star and gas particles, each with its own properties.

Yes, this is the SPH imprint. Unfortunately, I can't tell from  
looking at the model how to describe something that isn't comprised  
of distinct objects. I tried to write an instance document that was  
counting the grid cells, but that just seemed silly. And, more  
importantly, that information is available in the parameter file.  
There could be some useful information in the number of grids in an  
adaptive mesh simulation, but I don't believe people are going to  
search for simulations of galaxy clusters with more than 100,000  
grids--they're just going to look for simulations of galaxy clusters.

> I can see though that when someone's database is ever going to  
> contain only
> one type of simulation, one might want to remove the extra  
> "indirection" of
> the ObjectType.
> Obviously related to this is the absence of ObjectCollection. In  
> the SNAP
> model this is the anchor that ties a list of characterizations to the
> properties of a particular object type. If you remove one, you can  
> remove
> the other.
> Note that only today I added a ChildObject to the model. This is  
> the outcome
> of an offline (sorry!) discussion with mainly Laurent Bourges and  
> Herve
> Wozniak. They model galaxies being built from disks and bulges,  
> each with
> their own properties.

I think the ObjectType and ObjectCollection have their uses. But,  
there needs to be a means to describe the simulation when there is  
nothing remotely resembling an object present. Perhaps part of this  
is my perception that grids and grid cells should not be treated as a  
collection of objects. Individual grid cells have little meaning  
without the structure of the entire mesh.

> 3. I assume Group and GroupedQuantity are borrowed from the  
> Spectrum data
> model's XSD serialization?  Because of single inheritance you have  
> a problem
> with ProgramType, which can now not be a Resource. If instead you  
> had made
> Resource a Group (impossible of course in the IVOA context), you  
> could have
> ProgramType be a Resource as well.
> I must say I don't like Group very much. ID and IDREF are useful  
> only when
> the element being referenced exists in the same XML document. This  
> I think
> will often not be the case. I see it as an example of inheritance  
> run wild.

I turned this comment into a suggestion, and removed Group, and  
spelled out the references (which I think are only used for the  
characterization. I also went ahead with something I knew would have  
to be done, and split Program into its own resource. This allows for  
a much richer description of the software, the ability to reference  
its parameters, etc.

> Btw, I have for a while wanted to remove the inheritance of  
> Resource from
> the SNAP data model, and done so in today's update. It is too  
> restrictive I
> find. I think one can take a SNAP model instance and turn it into a  
> Resource
> if one wants to register it, but that does not mean it "is a"  
> resource in
> our model. There are more flexible ways of using existing models  
> than always
> using inheritance. In particular the Content of Resource is very  
> cumbersome.
> The SNAP model is supposed to describe the Content already.

I've used the Resource inheritance on the assumption that it will  
make it easier for existing registries to register Simulations and  
Programs. At the very least, they can handle the common elements. And  
since (as you note later), I've removed TargetObject and  
TargetProcess, I need the Content to describe the purpose of the  
simulation.

> 4. You have InputParameter and ParameterSetting merged into 1,
> ParameterType. Note that I have added an attribute "value",  
> representing the
> "xsd:string" inheritance in your ParameterType. In an earlier  
> version of the
> SNAP DM I had made the same choice for simplicity. However Franck  
> LePetit
> for example agrgued that redefining the list of parameters for his
> simulation types would be very costly.
> If one runs parameter studies with lists of 100s of parameters it  
> is better
> to have the parameters defined once on the Protocol (where they belong
> really), and only add the parameter settings on the experiment.  
> Problem is
> that in XML this is often more involved, as one needs to somehow  
> reference
> the parameter that may not exists in the same XML document (so  
> IDREF will
> not work) etc etc.
> Again, in one's particular database I can well see people choosing  
> one or
> the other. For the SNAP DM I have now chose for the more correct way.

This point also contributed to my splitting off Program. This way,  
all of the default parameters can be defined, and only the changed  
parameters need to be recorded.

Another comment on Program vs. Protocol and Simulator: Protocol and  
Simulator just seemed more abstract than necessary, and used new  
terms to describe things where old ones were fine (software, program,  
code, etc.). The idea of an executable with input and output just  
gets totally lost in the SNAP model. Now, I will admit that the idea  
of documenting protocols is very attractive, but perhaps that's  
something for SNAP 2.0.

> 5. You do not have TargetObjectType, TargetProcess, Algorithm,  
> Physics,
> SNAPWebService. These were all introduced explicitly to support  
> discovery
> (first 4) and execution (SNAPWebService) in the SNAP protocol.

Understood, which is why I've gone with the Resource inheritance,  
since it provides a mechanism to describe the purpose of the  
simulation. And Algorithm and Physics come from Program, where the  
Methods and InputParameters determine which equations are being  
solved and how. As for web service, I'm skipping that one until I  
have a web service that needs describing.

> All in all it seems though that the models are pretty compatible,  
> with the
> SNAP model being more general and comprehensive, as one should  
> expect for a
> model that needs general application.

I agree on the compatibility, especially in light of my changes, but  
I disagree on the generality statement. Since my model only records  
the programs used, and the inputs and outputs, it can be used to  
describe a greater range of simulations (or program executions).  
Basically, because the model makes fewer assumptions about what is  
being modeled, I think it has a greater scope. What it lacks is the  
level of detail that the SNAP model has, which I'm not sure is  
required to adequately describe most simulations sufficiently to  
support discovery.

For amusement I've posted a trivial example of describing a copy as a  
simulation.
http://lca.ucsd.edu/projects/rpwagner/wiki/CPInstance

I don't know if this is a good demonstration of generality, or just a  
reminder that valid XML does not necessarily imply valid meaning.

> For now I see your model as an alternative representation of (a  
> subset of)
> the information in the full model, that can have its particular  
> application
> area.

I think subset is the right idea. And the application area I'm  
targeting is "sufficient for users to find simulations of galaxy  
clusters or compressible turbulence done by Enzo or Gadget". And this  
summarizes my needs based motivation. What I'm trying for is SNAP- 
Lite, or ReallySimpleSNAP; just enough elements to describe the  
process, and ones that existing registries can handle.

> In that it would be similar to similar models for example from
> Patrizia Manzato and from the Horizon team (see the links in the  
> "Existing
> data models..." paragraph in
> http://www.ivoa.net/twiki/bin/view/IVOA/ 
> IVOATheorySimulationDatamodel )

My model is much closer to SNAP than the Horizon, GalICS, or ITVO  
models. Those are all oriented around a single code. I use Enzo as  
the example, because that's most of the data I'm dealing with, but  
that does not mean it's the only code that can be represented.

--Rick

>
>
> ________________________________________
> From: owner-theory at eso.org [mailto:owner-theory at eso.org] On Behalf  
> Of Rick
> Wagner
> Sent: Tuesday, February 12, 2008 2:08 AM
> To: theory at ivoa.net
> Subject: XML Schema for the Simulation Data Model
>
> Hi,
>
> After working to understand the current SNAP Data Model (in  
> particular the
> current proposed XML Schema), I decided to distill it into a single  
> document
> with fewer types. I've had some success, so I've post the Schema,  
> and a
> sample instance document on the Twiki attached to the
> IVOATheorySimulationDatamodel page:
>
> Schema
> http://www.ivoa.net/internal/IVOA/IVOATheorySimulationDatamodel/ 
> Simulation.x
> sd
>
> Sample Instance
> http://www.ivoa.net/internal/IVOA/IVOATheorySimulationDatamodel/ 
> SimulationIn
> stance.xml
>
> At the bottom of the page there are links screen shots of the  
> elements and
> data types, which help to show they're relations.
>
> This schema keeps the method of characterization as the SNAP model, by
> defining the axes up front (or, at the top), but is less abstract.  
> It treats
> a simulation and its data as the results of running a program with  
> defined
> input parameters, and does not try describe everything about the  
> method and
> numerical representation. To me, these are things defined by the  
> program
> (the software), and could be handled by defining separate  
> VOResource for
> "Program" or "Software Project".
>
> If this works looks interesting to anyone, I would be glad to write  
> up a
> fuller description, any even put some documentation in the Schema and
> instance documents.
>
> I plagiarized heavily from both them SNAP Data Model, and the Spectral
> Schema, so any credit should go to Gerard and the DAL, Data  
> Modeling group,
> and annoyed comments sent my way.
>
> --Rick
>
> ---------------------------------------------------------------------- 
> ---
> Rick Wagner, Graduate Student Researcher
> UCSD Physics
> 9500 Gilman Drive
> La Jolla, CA 92093-0424
> Email: rwagner at physics.ucsd.edu
> WWW: http://lca.ucsd.edu/projects/rpwagner
> (858) 822-4784 Phone
> ---------------------------------------------------------------------- 
> ---
> Measuring programming progress by lines of code is
> like measuring aircraft building progress by weight.
> --Bill Gates
> ---------------------------------------------------------------------- 
> ---
>
>
>
> <RickWagner.jpg><SNAP__postprocessing.jpg><SNAP__simulation.jpg><SNAPD 
> ataModel.jpg>

------------------------------------------------------------------------ 
-
Rick Wagner, Graduate Student Researcher
UCSD Physics
9500 Gilman Drive
La Jolla, CA  92093-0424
Email:  rwagner at physics.ucsd.edu
WWW:    http://lca.ucsd.edu/projects/rpwagner
(858) 822-4784 Phone
------------------------------------------------------------------------ 
-
Measuring programming progress by lines of code is
like measuring aircraft building progress by weight.
--Bill Gates
------------------------------------------------------------------------ 
-