Simulation database data model : note to review

Gerard gerard.lemson at mpe.mpg.de
Fri Dec 3 10:19:33 PST 2010


Hi Miguel 

Thanks for your feedback, I will read it in the plane.

But I do want to give a quick first answer just before I leave.

 

The SimDB/DM is NOT meant to be a model for a database containing simulation
data.

SimDB/DM is a model for describing simulations, including how they were run,
what their goals were and some characterisation of their results . It
therefore can best be seen as a model for “metadata” describing simulations.

 

>From this model we can (and do) derive different representations that can be
useful in particular circumstances.

For this proposal the XML schema and UTYPE representations are required.

So we have a schema that defines the format of XML documents containing such
descriptions.

We have examples of such documents, and these should go into the document
prepared by Franck.

They are not VOTables. The metadata is too hierarchical to easily fit in the
flat table structure.

 

We do have a relational database mapping, useful for example in a TAP
context.

I will discuss this during a DAL session. But it still is mainly about
metadata, not data.

 

I hope to have a chance to discuss your other comments in Nara.

 

Cheers

 

Gerard

 

 

 

 

From: dm-bounces at ivoa.net [mailto:dm-bounces at ivoa.net] On Behalf Of Miguel
Cerviño
Sent: Thursday, December 02, 2010 6:27 PM
To: dm at ivoa.net; theory at ivoa.net
Subject: RE: Simulation database data model : note to review

 

 





Dear all (dm & theory)







I just include some coments on the SimDB/DM document





The first point is that the description of the document quoted by Mireille

"a Note about the data model defined for simulation data"

an the description of the posted document looks to me to be different and
now

it is not clear to me what is the final goal of the document. 

 

As far as I know from the discussions along this time, and as it is quite
clear 



in the appendix, the datamodel is a datamodel *for a database* of 



theoretical model, but not a datamodel for theoretical models.



However, there are some parts in the text of the main document that 



suggest another thing (a datamodel of theoretical models that can be used 



for a Database). Although both DataModels should share some



similar classes, not all classes are relevant (even needed) in both 



cases. I think that this distinction must be explicitly done in the main 



document. As an example, in the executive summary it is written: "it 



is a model for meta-data describing simulations"  instead "it is a model 



for meta-data describing databases of simulations" which is more 



correct (please,  correct me otherwise).







If the DataModel is intended to provide a description of theoretical data, 



i.e. a datamodel for theory and not just a datamodel for a database of 



theoretical results I think that some examples are needed 



in the document, in particular VOTables of final theoretical products 

(I have put same simple examples that can be addressed at the end of the
e-mail).







However in the document it is clearly quoted that it is not the case



(with their implications and problems). I just propose to clarify 



it from the begining (maybe focusing in the DB aspect of the



model) and quote that the SimDB/DM allows, but it is not intended 



to define, a SimDM. 







In addition,  the model include some access data fields (Sect 4.7 and 4.8 



in particular) Following the SSAP it is the access protocol which includes
its own data 



model for access spectra. However, I agree that here we cover a different 



and more complex case, but again, if the DM is for a DataBase definition, 



I understand completely the need of a field for that. I am not sure if it
has 



really sense in the case of a datamodel for simulations...







 

Before include my comments about the document, I also wonder

how much DataModels must be consistent/compatible each other. 

 

 

[Note: It is not possible to me to attend the Nara Interop, so I will



acknowledge a lot if the discussion is summarized somewhere 



in the mailing-list, both theory and DM.



Thanks in advance]















a) Grammar/semantic issues:



 - In the document the word "simulation" is used with different meanings 



 (simulation code, simulation result, simulation run, simulation class



 etc...).  I  think that it is better do not mixing everything in a single
word 



 "simulation" but to be more specific in each case specially for



 simulation code and simulation result (and, for instance, insist 



 that Simulation class is  always in bold and Uppercase)







 I understand that it is quite difficult to solve this problem of



 specification, but in such a complex document it improves a lot 



 its understanding.







   - The word "protocol" is used along the document too many times, and 



 some times with different meanings (see also bellow) I think that it



 would be quite more useful to look for equivalent words to avoid confusions



 As an examples, in the excutive Summary, what it means "SimDB protocol"? 



 Does it referee to SimDB architecture? stucture? access?   







 After a long reflexion, I think that the class "Protocol" in the



 Datamodel is "dangerous". In the context of the datamodel, it referes to 



 the desing of the experiment. In  the context of IVOA it refers to the 



 transmision and data access. I think that it 



 would be much better to look for another word in the DataModel (for



 instance,  "procedure"?).







 - I would avoid using the expression "web services". I think that the



  intention is to provide a reference for VO services, which is more in 



 the context of IVOA. It  implies the change of the "Webservice" name 



 in the data model to just "service".











-----------------



b) Document specific comments:







1.- In Sect 2: History







 I think the paragraph



 "The design and execution of these simulations has become a



specialised field of  astrophysics, and is these days often 



performed in large collaborations. And while it is still true 



that their results are studied by these groups only, more and



more of these theoretical data are being published online (see for



instance the Appendix B of [28])."



 is misleading. It maybe cover the case of cosmological simulations,



 but not for  most of theoretical data used (synthesis models, 



 isochones and lot of other products which (a) they are performed 



 by *small* groups, (b) their results are WIDELY used by the



 community (7000 citations just for Bruzual & SB99 synthesis 



 models, around 15000 cites for isochrones of Geneva and 



 Padova tracks and more than 1000 different gruops make



 use of those results) and (c) Results has been usually published



 on line and they are accessible from the WWW



 and they would acknowledge an easy-to-use DM that



 would be implemented in their services  they already



 offers to the community.











 Since, after Victoria 2010 interop, simulation in the VO refers to



 any kind of theoretical result I think that it is better to include 



both situations in the text.







 Respect the historical background of SimDB (Cambridge workshop



 etc...), please, comment that it refers JUST to cosmological simulations, 



 not the overall simulations world (i.e. non-cosmological ones).















2.- In the analysis for build up SimDB/DM (Sect. 3.3), is based in



 the analysis of specific questions that the DM help to answer.



 The basic question is what scientist would want to ask to find interesting



 simulations.



  Following the document suggestion for questions in the case of

  non-cosmologicas simulations, I include some question  scientist 

 (code producers and users) have asked me when I had try to 

  convince them to include their models in the VO and/or use the 

  VO to access them. The questions are



 more focused in the use of, in most cases, in the use a particular set of



 simulations rather than search for simulations.







     - How the results are parametrized?



     - Can I access grids of models? can I access individual results?



     - Which are the inputs ingredients (usually, which data



           collections are used?)



     - How I can run a simulation? Can I do it on-the-fly?



     - Can include my simulations in the VO in a easy way?,



           What I should do?



     - Can i compare different simulations? Can I compare the



          simulation with my data?



     - Which simulations provide diagnostic tools? (i.e.



           distance/extinction/quasi-scale free quantities)



     - Can I combine the results of different simulations in a single



           file adapted for my needs (e.j. own code)? 















3.- In the Domain model (Sect 3.4)







I think that the difference (and implications of it) between



Simulation and Postprocessing (i.e. Simulator/PostProcesor 



and Simulation/PostProcessing) must be more explicit and clear.



After reading the example and the appendix, the situation that is



intended to be mapped in the SimDB DM look to be more clear, 



but it is still not clear when a Procedure (i.e. Protocol) can



we defined as a Simulation or a PostProcessor 







 As an example the use case of synthesis codes, must be they



considered as a  PostProcessor of an stellar evolution code and 



an atmosphere code/library (they just make a 



sum over stars)? or a photoionization code, in some cases, is it a



postprocessor of synthesis  codes that includes the physics 



of nebular emission if they use it as input? or is it a simulator



 by its own right? In think that the difference is explained in Sect.



 4.3 where the especial case of  Simulator is explained (but 



 there is no similar explanation for PostProcessor)







 Can be included a Postprocessor/Postprocessing without an associated



 simulator/simulation? It looks t be the case in the Procedure (i.e.
protocol) 



 class, but not in the Experiment class.







4.- In Sect. 4.1. Packages







   The section describes the simdb/DM package where protocol means



 procedure, isn't it?



  I understand the needed of the DAL package to be included in the



  SimDB/DM but I am not  sure if it is needed if SimDB/DM pretends to 



  be a DataModel for simulations.











5.- In the Sect 4.2. Resource







Another issue is in the argued child-dependency of Experiment and Result.



It would be the case of a cosmological simulations and



computational results but not necessary for



for all theoretical results, (especially it simulation



includes all theoretical results as decided in 



the Victoria 2010 interop).











There are simulations that are just libraries of models, like



empirical atmosphere libraries (Note: although they are 



based on "observations", they are theoretical results by the



own right as far as each element in the library represent 



a theoretical class of object, in this case an stellar type). 



In this cases there is no computational 



"procedure" (i.e. protocol) to produce the library. It can 



be argued that there is always a procedure to obtain



such result (like a by-eye classification of stars), but it 



is a case where  "Experiment" and "Procedure" has a 



fuzzy meaning. Similar questions arises from "priors" 



like the one used in photo-z codes: They are theoretical 



abstractions but they have not formal



Simulator/PostProcessor  associated. Again, there is 



a "procedure" to obtain them, but such a procedure has 



a fuzzy meaning or it is difficult (even for the scientist 



that produces these results) to map them in a 



child-dependency way.



As a final example, there is also theoretical results that combine the



results of different computational codes and  algorithms, like the 



case of isochrones computed including the results of different 



codes (as an common example, the inclusion of one group 



evolutionary tracks up to the He-flash, the  re-parametirzation 



of other results using semi-empirical mass-lost



rates for RedGiants, the use of other group tracks 



for the He-burnig phase and thermal pulses and the 



final use of completely different tracks for WD evolution. 



In this case there is no simple code that compute the 



overall stellar evolution, so real scientist mix-up results 



from different codes (sometimes incompatible each other) 



and produce their own theoretical result. Maybe it is not a 



good practice to do science, but it is a common one and 



sometimes the only possible one.







 Although a minor point in this section it have some implications for



 the following sections (see below).







At the end of the section it is clearly explicit that the SimDB/DM



does not provide a real SimDM. I think that it should be more 



explicit at the beginning of the document.











6.- In Sect. 4.3 







 As explained in the previous comment, not all theoretical results are



the result of a single computing program neither be represented 



in a single one. If the text are presented in terms of "Simulators" 



that may *or may not* be associated with a computer program 



the document can increase in clarity. Of course the case of a single



program is a good example, but the model must be a bit more generic



and not just to map this quite particular case.







 My suggestion is relax the comments about Algorithm, Physics etc.  















 I also find a bit confusing the definition of simulator just by the



 inclusion of Physics and algorithms (even more of it aims to



 separate the Simulator and PostProcessor classes.











 As an example, there is no physics in isochrone based synthesis codes



 that are just provide a weighted sum of stars along



 an isochrone, but there is some physics in other synthesis code



 (Fueld Consumption Theorem based ones) that actually 



 include physics and algorithms in the computation. However only 



 synthesis codes developers are aware of such distinction. 



 So, depending who include the code in the SimDB, it can be registered 



 as a Simulator or a PostProcessor, and depending



 who use the SimDB database will look for it in a different class....  







 Again, I find the problem in the PostProcessor/Simulator



characterization.











7.- Sect 4.4







 Most of my previous comments are quite related with the



"InputParameter" class. It is defined, in a single software code 



 base, but without the inclusion of most common used cases of mixing



of code results or "non-software based" theoretical results



 (I just refer again to the scientific literature to just have a



simple idea of the most common case I describe).







 I think that the some of my comments maybe solved with a change of



 "InputParameters" to "ProcedureParameters" or just "Parameters"



 and relaxing the "Procedure" (i.e. Protocol in the document) definition.











 In this case it is possible to create directly a Procedure that just



 provide access to results in a used-oriented defined way



 (formally it can be a set of programs without physics, nor algorithms



 that just provide access to theoretical results, that



 can be considered a "procedure" by its own right a well as an



 semi-empirical stellar library for population synthesis is a 



 "procedure" by itself.



 Another example case that do not fit in the current model, with may



 fit in the proposed extension is a code that provides



 synthetic photometry from spectra: No ObjectType can be defined,



 neither Target, neither input "parameters" in a



 way consistent with the current SimDB/DM model.







 However, in this case it must be studied, the Field class (currently



 under ObjectType class): The type of Procedure 



 (i.e. Protocol in the document) may hav no associated object class,



 neither and experiment goal (but just provide 



 acces to data). The current model (in Sect. 4.6) assume that a



 property class (where the Field class appear) is defined 



 under the ObjectType class. In the situation mentioned before the 



 Property class is a subclass  of no ObjectType.







 One possible solution is to have a generic Field class that can be



 directly used in the "ProcedureParameter" class



 without the need of obtain it from the ObjectType class.











 Although this suggestion would solve the previous case, it is still



 not clear how to describe a code that use as Input a file or



 a collection of files and that is a wide common use case of



 theoretical astronomy research where result of a filed (like stellar



 evolution) are used for other fields (like stellar clusters, galaxy



 evolution or cosmology).











8.- Sect 4.5







  Just in the proposed generalization, as well as objectType, the



  TargetObject and Proccess have a more general and fuzzy 



  meaning that the one described in the text, that would be useful



  just like an example.



















10.- 4.8 Data access service







 I am not sure if the data access service should be included in the



 datamodel, since iti is a Access Protocol 



 issue. I just quote the SSAP example that contains its own DataModel



 for access spectra. 







 I understand that, in a formal model *for a DataBase* it looks to be



 natural. However it is dependent on the 



 propose the DataBase has been designed for, and just for this case. 











Notes to the Appendix:











A.2: Quantities and Units:



There are also some efforts related with the characterization for



describe physical quantities in 



general in the IVOA proposed recomendation:



http://www.ivoa.net/Documents/SSLDM/20101004/index.html























------------------







Examples to be included in the document if aimed for a SimDM and not



just SimDB DM











If the DataModel is intend to provide for description of theoretical



data, i.e. a datamodel for theory



and not just a datamodel for a database of theoretical results it should



be a simple example



of serialization of the model in a single VOTable. I just propose two



examples:







a) A code called "myCode1 v.1" that uses the



Stefan-Boltzmann Law to link stellar evolution (m, L_bol, Teff) with



atmosphere parameters (log g, log Teff)







log L_bol = cte1 +  2 log R + 4 log Teff



log g = cte2 + log m - log R    --->  log g = cte3 + log m - 0.5 log



L_bol + 2 log Teff











b) And another code "myCode2 v1" that associate the stellar magnitudes



(from an external grid "star magnitude grid v1"



parametrized as log g, log Teff) to an isocrones (from an external grid



"isocrone grid v1", parametrized as t, m(t), L_bol(t), Teff(t)) to



produce a theoretical color magnitude diagram.











So, how the corresponding VOTables looks in these cases?











NOTE: This two example are related with the uses case defined since



Kioto Interop (but I think that they was proposed even earlier)







 

 

With best regards

 

Miguel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ivoa.net/pipermail/dm/attachments/20101203/b546af18/attachment-0001.html>


More information about the dm mailing list