Simulation database data model : note to review

Thu Dec 2 09:26:48 PST 2010

Dear all (dm & theory)

I just include some coments on the SimDB/DM document

The first point is that the description of the document quoted by Mireille
"a Note about the data model defined for simulation data"
an the description of the posted document looks to me to be different and now
it is not clear to me what is the final goal of the document. 

As far as I know from the discussions along this time, and as it is quite clear 
in the appendix, the datamodel is a datamodel *for a database* of 
theoretical model, but not a datamodel for theoretical models.
However, there are some parts in the text of the main document that 
suggest another thing (a datamodel of theoretical models that can be used 
for a Database). Although both DataModels should share some
similar classes, not all classes are relevant (even needed) in both 
cases. I think that this distinction must be explicitly done in the main 
document. As an example, in the executive summary it is written: "it 
is a model for meta-data describing simulations"  instead "it is a model 
for meta-data describing databases of simulations" which is more 
correct (please,  correct me otherwise).

If the DataModel is intended to provide a description of theoretical data, 
i.e. a datamodel for theory and not just a datamodel for a database of 
theoretical results I think that some examples are needed 
in the document, in particular VOTables of final theoretical products 
(I have put same simple examples that can be addressed at the end of the e-mail).

However in the document it is clearly quoted that it is not the case
(with their implications and problems). I just propose to clarify 
it from the begining (maybe focusing in the DB aspect of the
model) and quote that the SimDB/DM allows, but it is not intended 
to define, a SimDM. 

In addition,  the model include some access data fields (Sect 4.7 and 4.8 
in particular) Following the SSAP it is the access protocol which includes its own data 
model for access spectra. However, I agree that here we cover a different 
and more complex case, but again, if the DM is for a DataBase definition, 
I understand completely the need of a field for that. I am not sure if it has 
really sense in the case of a datamodel for simulations...

Before include my comments about the document, I also wonder
how much DataModels must be consistent/compatible each other. 

[Note: It is not possible to me to attend the Nara Interop, so I will
acknowledge a lot if the discussion is summarized somewhere 
in the mailing-list, both theory and DM.
Thanks in advance]

a) Grammar/semantic issues:
 - In the document the word "simulation" is used with different meanings 
 (simulation code, simulation result, simulation run, simulation class
 etc...).  I  think that it is better do not mixing everything in a single word 
 "simulation" but to be more specific in each case specially for
 simulation code and simulation result (and, for instance, insist 
 that Simulation class is  always in bold and Uppercase)

 I understand that it is quite difficult to solve this problem of
 specification, but in such a complex document it improves a lot 
 its understanding.

   - The word "protocol" is used along the document too many times, and 
 some times with different meanings (see also bellow) I think that it
 would be quite more useful to look for equivalent words to avoid confusions
 As an examples, in the excutive Summary, what it means "SimDB protocol"? 
 Does it referee to SimDB architecture? stucture? access?   

 After a long reflexion, I think that the class "Protocol" in the
 Datamodel is "dangerous". In the context of the datamodel, it referes to 
 the desing of the experiment. In  the context of IVOA it refers to the 
 transmision and data access. I think that it 
 would be much better to look for another word in the DataModel (for
 instance,  "procedure"?).

 - I would avoid using the expression "web services". I think that the
  intention is to provide a reference for VO services, which is more in 
 the context of IVOA. It  implies the change of the "Webservice" name 
 in the data model to just "service".

-----------------
b) Document specific comments:

1.- In Sect 2: History

 I think the paragraph
 "The design and execution of these simulations has become a
specialised field of  astrophysics, and is these days often 
performed in large collaborations. And while it is still true 
that their results are studied by these groups only, more and
more of these theoretical data are being published online (see for
instance the Appendix B of [28])."
 is misleading. It maybe cover the case of cosmological simulations,
 but not for  most of theoretical data used (synthesis models, 
 isochones and lot of other products which (a) they are performed 
 by *small* groups, (b) their results are WIDELY used by the
 community (7000 citations just for Bruzual & SB99 synthesis 
 models, around 15000 cites for isochrones of Geneva and 
 Padova tracks and more than 1000 different gruops make
 use of those results) and (c) Results has been usually published
 on line and they are accessible from the WWW
 and they would acknowledge an easy-to-use DM that
 would be implemented in their services  they already
 offers to the community.

 Since, after Victoria 2010 interop, simulation in the VO refers to
 any kind of theoretical result I think that it is better to include 
both situations in the text.

 Respect the historical background of SimDB (Cambridge workshop
 etc...), please, comment that it refers JUST to cosmological simulations, 
 not the overall simulations world (i.e. non-cosmological ones).

2.- In the analysis for build up SimDB/DM (Sect. 3.3), is based in
 the analysis of specific questions that the DM help to answer.
 The basic question is what scientist would want to ask to find interesting
 simulations.
  Following the document suggestion for questions in the case of
  non-cosmologicas simulations, I include some question  scientist 
 (code producers and users) have asked me when I had try to 
  convince them to include their models in the VO and/or use the 
  VO to access them. The questions are
 more focused in the use of, in most cases, in the use a particular set of
 simulations rather than search for simulations.

     - How the results are parametrized?
     - Can I access grids of models? can I access individual results?
     - Which are the inputs ingredients (usually, which data
           collections are used?)
     - How I can run a simulation? Can I do it on-the-fly?
     - Can include my simulations in the VO in a easy way?,
           What I should do?
     - Can i compare different simulations? Can I compare the
          simulation with my data?
     - Which simulations provide diagnostic tools? (i.e.
           distance/extinction/quasi-scale free quantities)
     - Can I combine the results of different simulations in a single
           file adapted for my needs (e.j. own code)? 

3.- In the Domain model (Sect 3.4)

I think that the difference (and implications of it) between
Simulation and Postprocessing (i.e. Simulator/PostProcesor 
and Simulation/PostProcessing) must be more explicit and clear.
After reading the example and the appendix, the situation that is
intended to be mapped in the SimDB DM look to be more clear, 
but it is still not clear when a Procedure (i.e. Protocol) can
we defined as a Simulation or a PostProcessor 

 As an example the use case of synthesis codes, must be they
considered as a  PostProcessor of an stellar evolution code and 
an atmosphere code/library (they just make a 
sum over stars)? or a photoionization code, in some cases, is it a
postprocessor of synthesis  codes that includes the physics 
of nebular emission if they use it as input? or is it a simulator
 by its own right? In think that the difference is explained in Sect.
 4.3 where the especial case of  Simulator is explained (but 
 there is no similar explanation for PostProcessor)

 Can be included a Postprocessor/Postprocessing without an associated
 simulator/simulation? It looks t be the case in the Procedure (i.e. protocol) 
 class, but not in the Experiment class.

4.- In Sect. 4.1. Packages

   The section describes the simdb/DM package where protocol means
 procedure, isn't it?
  I understand the needed of the DAL package to be included in the
  SimDB/DM but I am not  sure if it is needed if SimDB/DM pretends to 
  be a DataModel for simulations.

5.- In the Sect 4.2. Resource

Another issue is in the argued child-dependency of Experiment and Result.
It would be the case of a cosmological simulations and
computational results but not necessary for
for all theoretical results, (especially it simulation
includes all theoretical results as decided in 
the Victoria 2010 interop).

There are simulations that are just libraries of models, like
empirical atmosphere libraries (Note: although they are 
based on "observations", they are theoretical results by the
own right as far as each element in the library represent 
a theoretical class of object, in this case an stellar type). 
In this cases there is no computational 
"procedure" (i.e. protocol) to produce the library. It can 
be argued that there is always a procedure to obtain
such result (like a by-eye classification of stars), but it 
is a case where  "Experiment" and "Procedure" has a 
fuzzy meaning. Similar questions arises from "priors" 
like the one used in photo-z codes: They are theoretical 
abstractions but they have not formal
Simulator/PostProcessor  associated. Again, there is 
a "procedure" to obtain them, but such a procedure has 
a fuzzy meaning or it is difficult (even for the scientist 
that produces these results) to map them in a 
child-dependency way.
As a final example, there is also theoretical results that combine the
results of different computational codes and  algorithms, like the 
case of isochrones computed including the results of different 
codes (as an common example, the inclusion of one group 
evolutionary tracks up to the He-flash, the  re-parametirzation 
of other results using semi-empirical mass-lost
rates for RedGiants, the use of other group tracks 
for the He-burnig phase and thermal pulses and the 
final use of completely different tracks for WD evolution. 
In this case there is no simple code that compute the 
overall stellar evolution, so real scientist mix-up results 
from different codes (sometimes incompatible each other) 
and produce their own theoretical result. Maybe it is not a 
good practice to do science, but it is a common one and 
sometimes the only possible one.

 Although a minor point in this section it have some implications for
 the following sections (see below).

At the end of the section it is clearly explicit that the SimDB/DM
does not provide a real SimDM. I think that it should be more 
explicit at the beginning of the document.

6.- In Sect. 4.3 

 As explained in the previous comment, not all theoretical results are
the result of a single computing program neither be represented 
in a single one. If the text are presented in terms of "Simulators" 
that may *or may not* be associated with a computer program 
the document can increase in clarity. Of course the case of a single
program is a good example, but the model must be a bit more generic
and not just to map this quite particular case.

 My suggestion is relax the comments about Algorithm, Physics etc.  

 I also find a bit confusing the definition of simulator just by the
 inclusion of Physics and algorithms (even more of it aims to
 separate the Simulator and PostProcessor classes.

 As an example, there is no physics in isochrone based synthesis codes
 that are just provide a weighted sum of stars along
 an isochrone, but there is some physics in other synthesis code
 (Fueld Consumption Theorem based ones) that actually 
 include physics and algorithms in the computation. However only 
 synthesis codes developers are aware of such distinction. 
 So, depending who include the code in the SimDB, it can be registered 
 as a Simulator or a PostProcessor, and depending
 who use the SimDB database will look for it in a different class....  

 Again, I find the problem in the PostProcessor/Simulator
characterization.

7.- Sect 4.4

 Most of my previous comments are quite related with the
"InputParameter" class. It is defined, in a single software code 
 base, but without the inclusion of most common used cases of mixing
of code results or "non-software based" theoretical results
 (I just refer again to the scientific literature to just have a
simple idea of the most common case I describe).

 I think that the some of my comments maybe solved with a change of
 "InputParameters" to "ProcedureParameters" or just "Parameters"
 and relaxing the "Procedure" (i.e. Protocol in the document) definition.

 In this case it is possible to create directly a Procedure that just
 provide access to results in a used-oriented defined way
 (formally it can be a set of programs without physics, nor algorithms
 that just provide access to theoretical results, that
 can be considered a "procedure" by its own right a well as an
 semi-empirical stellar library for population synthesis is a 
 "procedure" by itself.
 Another example case that do not fit in the current model, with may
 fit in the proposed extension is a code that provides
 synthetic photometry from spectra: No ObjectType can be defined,
 neither Target, neither input "parameters" in a
 way consistent with the current SimDB/DM model.

 However, in this case it must be studied, the Field class (currently
 under ObjectType class): The type of Procedure 
 (i.e. Protocol in the document) may hav no associated object class,
 neither and experiment goal (but just provide 
 acces to data). The current model (in Sect. 4.6) assume that a
 property class (where the Field class appear) is defined 
 under the ObjectType class. In the situation mentioned before the 
 Property class is a subclass  of no ObjectType.

 One possible solution is to have a generic Field class that can be
 directly used in the "ProcedureParameter" class
 without the need of obtain it from the ObjectType class.

 Although this suggestion would solve the previous case, it is still
 not clear how to describe a code that use as Input a file or
 a collection of files and that is a wide common use case of
 theoretical astronomy research where result of a filed (like stellar
 evolution) are used for other fields (like stellar clusters, galaxy
 evolution or cosmology).

8.- Sect 4.5

  Just in the proposed generalization, as well as objectType, the
  TargetObject and Proccess have a more general and fuzzy 
  meaning that the one described in the text, that would be useful
  just like an example.

10.- 4.8 Data access service

 I am not sure if the data access service should be included in the
 datamodel, since iti is a Access Protocol 
 issue. I just quote the SSAP example that contains its own DataModel
 for access spectra. 

 I understand that, in a formal model *for a DataBase* it looks to be
 natural. However it is dependent on the 
 propose the DataBase has been designed for, and just for this case. 

Notes to the Appendix:

A.2: Quantities and Units:
There are also some efforts related with the characterization for
describe physical quantities in 
general in the IVOA proposed recomendation:
http://www.ivoa.net/Documents/SSLDM/20101004/index.html

------------------

Examples to be included in the document if aimed for a SimDM and not
just SimDB DM

If the DataModel is intend to provide for description of theoretical
data, i.e. a datamodel for theory
and not just a datamodel for a database of theoretical results it should
be a simple example
of serialization of the model in a single VOTable. I just propose two
examples:

a) A code called "myCode1 v.1" that uses the
Stefan-Boltzmann Law to link stellar evolution (m, L_bol, Teff) with
atmosphere parameters (log g, log Teff)

log L_bol = cte1 +  2 log R + 4 log Teff
log g = cte2 + log m - log R    --->  log g = cte3 + log m - 0.5 log
L_bol + 2 log Teff

b) And another code "myCode2 v1" that associate the stellar magnitudes
(from an external grid "star magnitude grid v1"
parametrized as log g, log Teff) to an isocrones (from an external grid
"isocrone grid v1", parametrized as t, m(t), L_bol(t), Teff(t)) to
produce a theoretical color magnitude diagram.

So, how the corresponding VOTables looks in these cases?

NOTE: This two example are related with the uses case defined since
Kioto Interop (but I think that they was proposed even earlier)

With best regards

Miguel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ivoa.net/pipermail/dm/attachments/20101202/a11aaf3b/attachment-0001.html>