Simulation database data model : note to review
Gerard
gerard.lemson at mpe.mpg.de
Fri Dec 3 10:19:33 PST 2010
Hi Miguel
Thanks for your feedback, I will read it in the plane.
But I do want to give a quick first answer just before I leave.
The SimDB/DM is NOT meant to be a model for a database containing simulation
data.
SimDB/DM is a model for describing simulations, including how they were run,
what their goals were and some characterisation of their results . It
therefore can best be seen as a model for metadata describing simulations.
>From this model we can (and do) derive different representations that can be
useful in particular circumstances.
For this proposal the XML schema and UTYPE representations are required.
So we have a schema that defines the format of XML documents containing such
descriptions.
We have examples of such documents, and these should go into the document
prepared by Franck.
They are not VOTables. The metadata is too hierarchical to easily fit in the
flat table structure.
We do have a relational database mapping, useful for example in a TAP
context.
I will discuss this during a DAL session. But it still is mainly about
metadata, not data.
I hope to have a chance to discuss your other comments in Nara.
Cheers
Gerard
From: dm-bounces at ivoa.net [mailto:dm-bounces at ivoa.net] On Behalf Of Miguel
Cerviño
Sent: Thursday, December 02, 2010 6:27 PM
To: dm at ivoa.net; theory at ivoa.net
Subject: RE: Simulation database data model : note to review
Dear all (dm & theory)
I just include some coments on the SimDB/DM document
The first point is that the description of the document quoted by Mireille
"a Note about the data model defined for simulation data"
an the description of the posted document looks to me to be different and
now
it is not clear to me what is the final goal of the document.
As far as I know from the discussions along this time, and as it is quite
clear
in the appendix, the datamodel is a datamodel *for a database* of
theoretical model, but not a datamodel for theoretical models.
However, there are some parts in the text of the main document that
suggest another thing (a datamodel of theoretical models that can be used
for a Database). Although both DataModels should share some
similar classes, not all classes are relevant (even needed) in both
cases. I think that this distinction must be explicitly done in the main
document. As an example, in the executive summary it is written: "it
is a model for meta-data describing simulations" instead "it is a model
for meta-data describing databases of simulations" which is more
correct (please, correct me otherwise).
If the DataModel is intended to provide a description of theoretical data,
i.e. a datamodel for theory and not just a datamodel for a database of
theoretical results I think that some examples are needed
in the document, in particular VOTables of final theoretical products
(I have put same simple examples that can be addressed at the end of the
e-mail).
However in the document it is clearly quoted that it is not the case
(with their implications and problems). I just propose to clarify
it from the begining (maybe focusing in the DB aspect of the
model) and quote that the SimDB/DM allows, but it is not intended
to define, a SimDM.
In addition, the model include some access data fields (Sect 4.7 and 4.8
in particular) Following the SSAP it is the access protocol which includes
its own data
model for access spectra. However, I agree that here we cover a different
and more complex case, but again, if the DM is for a DataBase definition,
I understand completely the need of a field for that. I am not sure if it
has
really sense in the case of a datamodel for simulations...
Before include my comments about the document, I also wonder
how much DataModels must be consistent/compatible each other.
[Note: It is not possible to me to attend the Nara Interop, so I will
acknowledge a lot if the discussion is summarized somewhere
in the mailing-list, both theory and DM.
Thanks in advance]
a) Grammar/semantic issues:
- In the document the word "simulation" is used with different meanings
(simulation code, simulation result, simulation run, simulation class
etc...). I think that it is better do not mixing everything in a single
word
"simulation" but to be more specific in each case specially for
simulation code and simulation result (and, for instance, insist
that Simulation class is always in bold and Uppercase)
I understand that it is quite difficult to solve this problem of
specification, but in such a complex document it improves a lot
its understanding.
- The word "protocol" is used along the document too many times, and
some times with different meanings (see also bellow) I think that it
would be quite more useful to look for equivalent words to avoid confusions
As an examples, in the excutive Summary, what it means "SimDB protocol"?
Does it referee to SimDB architecture? stucture? access?
After a long reflexion, I think that the class "Protocol" in the
Datamodel is "dangerous". In the context of the datamodel, it referes to
the desing of the experiment. In the context of IVOA it refers to the
transmision and data access. I think that it
would be much better to look for another word in the DataModel (for
instance, "procedure"?).
- I would avoid using the expression "web services". I think that the
intention is to provide a reference for VO services, which is more in
the context of IVOA. It implies the change of the "Webservice" name
in the data model to just "service".
-----------------
b) Document specific comments:
1.- In Sect 2: History
I think the paragraph
"The design and execution of these simulations has become a
specialised field of astrophysics, and is these days often
performed in large collaborations. And while it is still true
that their results are studied by these groups only, more and
more of these theoretical data are being published online (see for
instance the Appendix B of [28])."
is misleading. It maybe cover the case of cosmological simulations,
but not for most of theoretical data used (synthesis models,
isochones and lot of other products which (a) they are performed
by *small* groups, (b) their results are WIDELY used by the
community (7000 citations just for Bruzual & SB99 synthesis
models, around 15000 cites for isochrones of Geneva and
Padova tracks and more than 1000 different gruops make
use of those results) and (c) Results has been usually published
on line and they are accessible from the WWW
and they would acknowledge an easy-to-use DM that
would be implemented in their services they already
offers to the community.
Since, after Victoria 2010 interop, simulation in the VO refers to
any kind of theoretical result I think that it is better to include
both situations in the text.
Respect the historical background of SimDB (Cambridge workshop
etc...), please, comment that it refers JUST to cosmological simulations,
not the overall simulations world (i.e. non-cosmological ones).
2.- In the analysis for build up SimDB/DM (Sect. 3.3), is based in
the analysis of specific questions that the DM help to answer.
The basic question is what scientist would want to ask to find interesting
simulations.
Following the document suggestion for questions in the case of
non-cosmologicas simulations, I include some question scientist
(code producers and users) have asked me when I had try to
convince them to include their models in the VO and/or use the
VO to access them. The questions are
more focused in the use of, in most cases, in the use a particular set of
simulations rather than search for simulations.
- How the results are parametrized?
- Can I access grids of models? can I access individual results?
- Which are the inputs ingredients (usually, which data
collections are used?)
- How I can run a simulation? Can I do it on-the-fly?
- Can include my simulations in the VO in a easy way?,
What I should do?
- Can i compare different simulations? Can I compare the
simulation with my data?
- Which simulations provide diagnostic tools? (i.e.
distance/extinction/quasi-scale free quantities)
- Can I combine the results of different simulations in a single
file adapted for my needs (e.j. own code)?
3.- In the Domain model (Sect 3.4)
I think that the difference (and implications of it) between
Simulation and Postprocessing (i.e. Simulator/PostProcesor
and Simulation/PostProcessing) must be more explicit and clear.
After reading the example and the appendix, the situation that is
intended to be mapped in the SimDB DM look to be more clear,
but it is still not clear when a Procedure (i.e. Protocol) can
we defined as a Simulation or a PostProcessor
As an example the use case of synthesis codes, must be they
considered as a PostProcessor of an stellar evolution code and
an atmosphere code/library (they just make a
sum over stars)? or a photoionization code, in some cases, is it a
postprocessor of synthesis codes that includes the physics
of nebular emission if they use it as input? or is it a simulator
by its own right? In think that the difference is explained in Sect.
4.3 where the especial case of Simulator is explained (but
there is no similar explanation for PostProcessor)
Can be included a Postprocessor/Postprocessing without an associated
simulator/simulation? It looks t be the case in the Procedure (i.e.
protocol)
class, but not in the Experiment class.
4.- In Sect. 4.1. Packages
The section describes the simdb/DM package where protocol means
procedure, isn't it?
I understand the needed of the DAL package to be included in the
SimDB/DM but I am not sure if it is needed if SimDB/DM pretends to
be a DataModel for simulations.
5.- In the Sect 4.2. Resource
Another issue is in the argued child-dependency of Experiment and Result.
It would be the case of a cosmological simulations and
computational results but not necessary for
for all theoretical results, (especially it simulation
includes all theoretical results as decided in
the Victoria 2010 interop).
There are simulations that are just libraries of models, like
empirical atmosphere libraries (Note: although they are
based on "observations", they are theoretical results by the
own right as far as each element in the library represent
a theoretical class of object, in this case an stellar type).
In this cases there is no computational
"procedure" (i.e. protocol) to produce the library. It can
be argued that there is always a procedure to obtain
such result (like a by-eye classification of stars), but it
is a case where "Experiment" and "Procedure" has a
fuzzy meaning. Similar questions arises from "priors"
like the one used in photo-z codes: They are theoretical
abstractions but they have not formal
Simulator/PostProcessor associated. Again, there is
a "procedure" to obtain them, but such a procedure has
a fuzzy meaning or it is difficult (even for the scientist
that produces these results) to map them in a
child-dependency way.
As a final example, there is also theoretical results that combine the
results of different computational codes and algorithms, like the
case of isochrones computed including the results of different
codes (as an common example, the inclusion of one group
evolutionary tracks up to the He-flash, the re-parametirzation
of other results using semi-empirical mass-lost
rates for RedGiants, the use of other group tracks
for the He-burnig phase and thermal pulses and the
final use of completely different tracks for WD evolution.
In this case there is no simple code that compute the
overall stellar evolution, so real scientist mix-up results
from different codes (sometimes incompatible each other)
and produce their own theoretical result. Maybe it is not a
good practice to do science, but it is a common one and
sometimes the only possible one.
Although a minor point in this section it have some implications for
the following sections (see below).
At the end of the section it is clearly explicit that the SimDB/DM
does not provide a real SimDM. I think that it should be more
explicit at the beginning of the document.
6.- In Sect. 4.3
As explained in the previous comment, not all theoretical results are
the result of a single computing program neither be represented
in a single one. If the text are presented in terms of "Simulators"
that may *or may not* be associated with a computer program
the document can increase in clarity. Of course the case of a single
program is a good example, but the model must be a bit more generic
and not just to map this quite particular case.
My suggestion is relax the comments about Algorithm, Physics etc.
I also find a bit confusing the definition of simulator just by the
inclusion of Physics and algorithms (even more of it aims to
separate the Simulator and PostProcessor classes.
As an example, there is no physics in isochrone based synthesis codes
that are just provide a weighted sum of stars along
an isochrone, but there is some physics in other synthesis code
(Fueld Consumption Theorem based ones) that actually
include physics and algorithms in the computation. However only
synthesis codes developers are aware of such distinction.
So, depending who include the code in the SimDB, it can be registered
as a Simulator or a PostProcessor, and depending
who use the SimDB database will look for it in a different class....
Again, I find the problem in the PostProcessor/Simulator
characterization.
7.- Sect 4.4
Most of my previous comments are quite related with the
"InputParameter" class. It is defined, in a single software code
base, but without the inclusion of most common used cases of mixing
of code results or "non-software based" theoretical results
(I just refer again to the scientific literature to just have a
simple idea of the most common case I describe).
I think that the some of my comments maybe solved with a change of
"InputParameters" to "ProcedureParameters" or just "Parameters"
and relaxing the "Procedure" (i.e. Protocol in the document) definition.
In this case it is possible to create directly a Procedure that just
provide access to results in a used-oriented defined way
(formally it can be a set of programs without physics, nor algorithms
that just provide access to theoretical results, that
can be considered a "procedure" by its own right a well as an
semi-empirical stellar library for population synthesis is a
"procedure" by itself.
Another example case that do not fit in the current model, with may
fit in the proposed extension is a code that provides
synthetic photometry from spectra: No ObjectType can be defined,
neither Target, neither input "parameters" in a
way consistent with the current SimDB/DM model.
However, in this case it must be studied, the Field class (currently
under ObjectType class): The type of Procedure
(i.e. Protocol in the document) may hav no associated object class,
neither and experiment goal (but just provide
acces to data). The current model (in Sect. 4.6) assume that a
property class (where the Field class appear) is defined
under the ObjectType class. In the situation mentioned before the
Property class is a subclass of no ObjectType.
One possible solution is to have a generic Field class that can be
directly used in the "ProcedureParameter" class
without the need of obtain it from the ObjectType class.
Although this suggestion would solve the previous case, it is still
not clear how to describe a code that use as Input a file or
a collection of files and that is a wide common use case of
theoretical astronomy research where result of a filed (like stellar
evolution) are used for other fields (like stellar clusters, galaxy
evolution or cosmology).
8.- Sect 4.5
Just in the proposed generalization, as well as objectType, the
TargetObject and Proccess have a more general and fuzzy
meaning that the one described in the text, that would be useful
just like an example.
10.- 4.8 Data access service
I am not sure if the data access service should be included in the
datamodel, since iti is a Access Protocol
issue. I just quote the SSAP example that contains its own DataModel
for access spectra.
I understand that, in a formal model *for a DataBase* it looks to be
natural. However it is dependent on the
propose the DataBase has been designed for, and just for this case.
Notes to the Appendix:
A.2: Quantities and Units:
There are also some efforts related with the characterization for
describe physical quantities in
general in the IVOA proposed recomendation:
http://www.ivoa.net/Documents/SSLDM/20101004/index.html
------------------
Examples to be included in the document if aimed for a SimDM and not
just SimDB DM
If the DataModel is intend to provide for description of theoretical
data, i.e. a datamodel for theory
and not just a datamodel for a database of theoretical results it should
be a simple example
of serialization of the model in a single VOTable. I just propose two
examples:
a) A code called "myCode1 v.1" that uses the
Stefan-Boltzmann Law to link stellar evolution (m, L_bol, Teff) with
atmosphere parameters (log g, log Teff)
log L_bol = cte1 + 2 log R + 4 log Teff
log g = cte2 + log m - log R ---> log g = cte3 + log m - 0.5 log
L_bol + 2 log Teff
b) And another code "myCode2 v1" that associate the stellar magnitudes
(from an external grid "star magnitude grid v1"
parametrized as log g, log Teff) to an isocrones (from an external grid
"isocrone grid v1", parametrized as t, m(t), L_bol(t), Teff(t)) to
produce a theoretical color magnitude diagram.
So, how the corresponding VOTables looks in these cases?
NOTE: This two example are related with the uses case defined since
Kioto Interop (but I think that they was proposed even earlier)
With best regards
Miguel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ivoa.net/pipermail/dm/attachments/20101203/b546af18/attachment-0001.html>
More information about the dm
mailing list