VOTable for simulations

Thu Aug 31 08:19:25 PDT 2006

Dear Claudio
>From the VOTable spec, in particular section 2.2, I gather that they indeed
already included 
support for multi-dimensional arrays. This seems then indeed the natural way
to support at least 
uniform grids coming from simulations as well. Some comments:

>From their example I gather that arraysize="41x41x41x3" means "three data
cubes of dimensions 41x41x41", 
not "one 3D-vector valued datacube of dimensions 41x41x41".
"41x41x41" would mean "41 2D datafields of dimension 41x41". I think that
therefore a 3D vector field 
could/has to be encoded as (for example)

<?xml version="1.0"?>
<VOTABLE xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xmlns="http://vizier.u-strasbg.fr/xml/VOTable-1.1.xsd">
 <RESOURCE name=myVectorField>
   <TABLE name="VelocityField" ID="Vel">
      <FIELD name="vx" ID="vx1" ucd="phys.veloc;pos.cartesian.x"
datatype="float" 
             arraysize="41x41x41x1"   unit="km/s" />
      <FIELD name="vy" ID="vy1" ucd="phys.veloc;pos.cartesian.y"
datatype="float" 
             arraysize="41x41x41x1"   unit="km/s" />
      <FIELD name="vz" ID="vz1" ucd="phys.veloc;pos.cartesian.z"
datatype="float" 
             arraysize="41x41x41x1"   unit="km/s" />
      <DATA>
        <BINARY>
          <STREAM href="file:///scratch/myhome/test.bin"/>
        </BINARY>
      </DATA>
    </TABLE>
  </RESOURCE>
</VOTABLE>  

This makes the content of the individual field components more explicit.
Each gets it own UCD for example.
I have removed the rank attribute for the moment.
There is no way yet to specify the spatial coordinates of the grid cells.
For a grid one can specify 
the spatial coordinates in general in a shorthand way, for example using a
set of standard parameters 
as in the FITS array keywords (see
http://fits.gsfc.nasa.gov/standard21b/fits_standard.pdf 5.4.2.5), 
CRPIXn, CDELTn etc. I think we need to specify something like that here as
well, it is definitly more 
efficient than having separate cubes with the coordinates.
Luckily in general our coordinate system will not require the full WCS like
formalism in general.

Still I also liked your original approach, which, as I commented in my
earlier reply, seemed to lead to a kind of equivalence in XML of the FITS
image specification. I wonder whether the VOTable group has considered to
put image data in an XML form of FITS just as they did for the FITS binary
table. I'll pose the question on their mailing list. Tough we can use the
multi-dimensional array, it seems not as natural.

Then, though it is possible to use this same formalism for particle data as
well, I think there the tabular approach is more natural in many
circumstances. In particular in the work that I have been doing with
databases,
the natural representation of more complex individual objects is as a table,
with all the properties, including
now the positions, in a row. The way to store such tabular datasets in
binary form is specified exactly in the 
the existing VOTable spec, in section 5.3. An equivalent C-struct oriented
format in binary files is what I have encountered consistently for more
complex objects coming for example from the postprocessing of cosmological
simulations at the MPA in Garching.

But you're right that many people also store particle data in individual
arrays for each particle property.
That is more naturally mapped in the sense of your rank 1/2 examples. Making
the same adjustment as above for
the datacubes I would propose to allow also something as in the following
example:

<?xml version="1.0"?>
<VOTABLE xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xmlns="http://vizier.u-strasbg.fr/xml/VOTable-1.1.xsd">
 <RESOURCE name=myParticles>
   <TABLE name="Particles" ID="NBody">
      <FIELD name="x" ID="x1" ucd="pos.cartesian;pos.cartesian.x"
datatype="float" 
             arraysize="100000x1"   unit="Mpc" />
      <FIELD name="y" ID="y1" ucd="pos.cartesian;pos.cartesian.y"
datatype="float" 
             arraysize="100000x1"   unit="Mpc" />
      <FIELD name="z" ID="z1" ucd="pos.cartesian;pos.cartesian.z"
datatype="float" 
             arraysize="100000x1"   unit="Mpc" />
      <FIELD name="vx" ID="vx1" ucd="phys.veloc;pos.cartesian.x"
datatype="float" 
             arraysize="100000x1"   unit="km/s" />
      <FIELD name="vy" ID="vy1" ucd="phys.veloc;pos.cartesian.y"
datatype="float" 
             arraysize="100000x1"   unit="km/s" />
      <FIELD name="vz" ID="vz1" ucd="phys.veloc;pos.cartesian.z"
datatype="float" 
             arraysize="100000x1"   unit="km/s" />      <DATA>
        <BINARY>
          <STREAM href="file:///scratch/myhome/test.bin"/>
        </BINARY>
      </DATA>
    </TABLE>
  </RESOURCE>
</VOTABLE>  

I would advocate supporting both representations for particle data, tabular
and (1D) array.
In the latter case we still need something to distinguish between particle
data and image data.
Your rank basically does that, just the name might be unfortunate. We might
want to be more explicit
about the kind of data that is stored, an attribute with values MESH, N_BODY
maybe ?

In your example you use an HDF5 binary file. VOTable does not support that,
though it does support FITS,
I suppose as BINARY table (see VOTable spec section 5.2). Is there a natural
mapping from VOTable key words 
to HDF metadata structures ? Or shall we first concetrate on the binary
serialisations specified in VOTable ?

Cheers

Gerard

________________________________________
From: Claudio Gheller [mailto:c.gheller at cineca.it] 
Sent: Thursday, August 31, 2006 2:55 PM
To: Gerard
Cc: theory at ivoa.net; Ugo Becciani; R. Smareglia
Subject: Re: VOTable for simulations

Ciao Gerard, 
in the meantime I had thought a litlle about possible formats for the
VOTable. In fact I come to the conclusion that there is little new to add to
the already existing VOTable specification, both for grids and for
particles. 
The only parameters that I think we have to add is the "rank" parameter (it
may already exist, but I could have missed it).
Rank is the only parameter that makes grids different from particles,
scalars from vectors. For the rest, particles are completely the same as
grids. NO different approaches are needed.

In practice:
Rank = 1  --> scalar on particles (a sequence of scalar values associated to
the N particles,  one info  per particle, N values)
Rank = 2 --> vector on particles (sets of three values per particles, Nx3)
Rank = 3 --> scalar on grids (one value per grid point, NxNxN - assuming a
cubic grid for simplicity)
Rank = 4 --> vector on grids (set of three values per grid point, NxNxNx3)
At the moment let's consider only 3D simulations.
>From the example belowe you can notice that "rank" is a ridondant info that
can be obtained also directly from the "arraysize" parameter. But you must
go through a parsing and therefore it could be useful to keep it highlighted
in a specific parameter.

In this version, more variables, of different sizes, can be stored in the
SAME file. The file could have different formats (fits, hdf... that must be
specified properly). I assume, for the moment, a raw binary file, where
variables are written one after the other (the standard table structure in
row and colums is not efficient or even possible). The entry point for each
variable can be easily calculated using the "arraysize" and "datatype"
parameters. Furthermore, the order in which they are sotred must be
specified. And this could be the order in which the FIELDs are stored in the
VOTable.

Example: our data file contains a scalar field on a mesh, a vector field on
a mesh, a scalar and a vector fields on particles:

<?xml version="1.0"?>
<VOTABLE xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://vizier.u-strasbg.fr/xml/VOTable-1.1.xsd
">
       <RESOURCE name=myTestResource>
               <TABLE name="BmTemperature" ID="MyTestTable" >
                       <FIELD name="BmTemperature" ID="myTestObject1" ucd=""
datatype="float" rank="3" arraysize="41x41x41"   unit="K" />
                       <FIELD name="BmVelocity"    ID="myTestObject2" ucd=""
datatype="float" rank="4" arraysize="41x41x41x3" unit="km/sec" />
                       <FIELD name="ParticlPos"    ID="myTestObject3" ucd=""
datatype="float" rank="2" arraysize="10000x3"    unit="Mpc" />
                       <FIELD name="PartDens"      ID="myTestObject4" ucd=""
datatype="float" rank="1" arraysize="10000"      unit="g/cm3" />
                       <DATA><BINARY>
                       <STREAM href="file:///scratch/myhome/test.h5"/>
                       </BINARY></DATA>
               </TABLE>
       </RESOURCE>
</VOTABLE>  

Let me know your opinion.
Claudio

Gerard wrote: 
Hi Claudio
Sorry for the late reply to this email. I'm Cc-ing the theory group as well

I gather you are thinking of grid simulation data here, so this mail does
not apply to N-body. Anyway, for that I think we can use the VOTable spec as
it stands, in particular section 5.3 dealing with binary serialisation (see
http://www.ivoa.net/Documents/REC/VOTable/VOTable-20040811.pdf ).

In the case you address, would it make sense to try to mimick FITS in the
naming of key words, so use NAXIS for rank, and NAXIS1 for size0, NAXIS2 for
size1 etc for the dimensions ? If I am not mistaken VOTable itself is based
on the FITS binary table spec, so your proposal might be seen as a
translation of a FITS datacube (IMAGE). Did we actually not think about
using FITS as is for (uniform) grid simulations ? In that case your proposal
could also be used I guess, where iso STREAM we'd have FITS as in standard
VOTable usage (though I don't know whether votable presumes that the FITS
file contains a table).

I am not sure whether FITS images/datacubes allow multiple values per cell
(i.,e. have an array size), but don't think so. Otherwise we could probbaly
generalise in that direction. 
Do you propose to follow the VOTable/FITS directions on little-vs big-endian
?

Cheers

Gerard

-----Original Message-----
From: Claudio Gheller [mailto:c.gheller at cineca.it]
Sent: Thursday, July 20, 2006 12:37 PM
To: Gerard Lemson; Ugo Becciani; Alessandro Costa; Marco Comparato; R.
Smareglia
Subject: VOTable for simulations

Dear friends,

I have tried to figure out the structure of a VOTable for simulated
data. In the following the result.
I made the following assumptions:

1. data are binary
2. the binary file is a raw stream of byte, with no structure (no fits,
no hdf...). It is external to the VOTable (at the moment I've not
considered base64 conversion for performance reasons)
3. Each file has an  XML descriptor associated. The descriptor at
present gives only the necessary infos to deal with the file.
4. Each file contains ONE variable. This is suggested for the following
reasons
- data rank and size can change from variable to variable.
- complex description
- The association direct XML header file - bin file - variable, is
easier to handle.
- smaller files
- files easier to handle by external applications (also not VO-compliant)
- drawback: proliferation in the number of files
However we can consider the support to more complex files or even
formats, like FITS or HDF5. But let's start with something simple.

At this point I made the Snap program create binary files (at present
still HDF5, but just for backward compatibility) and associated XMLs.
For example:
test.h5 ----> snapped data
test.h5.xml ----> associated VOTable:

<?xml version="1.0"?>
<VOTABLE xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://vizier.u-strasbg.fr/xml/VOTable-1.1.xsd
">
        <RESOURCE name=myTestResource>
                <TABLE name="BmTemperature" ID="MyTestTable" >
                        <FIELD name="BmTemperature" ID="myTestObject"
ucd="" datatype="float" arraysize="41x41x41" unit="Kelvin" />
                        <PARAM name="rank" datatype="int" value="3"/>
                        <PARAM name="size0" datatype="long" value="41"/>
                        <PARAM name="size1" datatype="long" value="41"/>
                        <PARAM name="size2" datatype="long" value="41"/>
                        <DATA><BINARY>
                        <STREAM href="file:///scratch/myhome/test.h5"/>
                        </BINARY></DATA>
                </TABLE>
        </RESOURCE>
</VOTABLE>

Notice that the rank and size of the dataset is expressed in the
arraysize keyword of FIELD. It is also written in the 4 PARAM fields.
This is just to avoid the parsing of the string to get the basic info of
rank and size and to have them directly as numbers (with their precise
type). At present there are no UCD and no reference to the SNAP
protocol, since both are not yet defined. I'm working on the latter...

This is the very first attempt!!! Let me know all your comments.
Claudio

--
------------------------------------
Dr. Claudio Gheller, Ph.D.
High Performance System Division
CINECA - Bologna - Italy
Tel. +39-051-6171560
Fax. +39-051-6137273
------------------------------------

-- 
------------------------------------
Dr. Claudio Gheller, Ph.D.
High Performance System Division
CINECA - Bologna - Italy
Tel. +39-051-6171560
Fax. +39-051-6137273
------------------------------------