II ) Formats and vocabularies

Sun Oct 26 19:48:02 PDT 2008

Part A on Ucs and utypes
--------------------------------------------------------------------------------
*Step 1 FB answers to Fabien
---------------------------------------------------------------------------------
> 1.2 Stop using ucd and utypes. What is needed is a unique, 
> straightforward >identifier. UCD and uTypes are derived from complex 
> theoretical consideration >making them difficult to parse and 
> understand. In practice, experience shows that software developers 
> use both of them only as >a static string identifier to identify a 
> field (i.e. softwares don't try to >make use of the hierachy of 
> classes). For this purpose, the JSON variable name >is just enough, 
> and all it needs is to be clear and self-explanatory, e.g. 
> >'instrument' instead of 'meta.id;instr' or 'ssa:DataID.Instrument'. 
> >'centralPosition' instead of 'pos.eq.ra;meta.main' and 
> 'pos.eq.dec;meta.main' etc..
       ----> Oh !!! I strongly disagree as you can imagine.
     1 ) Ucd and utypes are not at all the same thing
        Ucd is a standard denomination for physical quantities.
If you want to see that at utypes it is only the utype of the General 
model of physical quantities. But many piece of data/metadata Structure 
may have the same UCD. It is there mainly to allow comparison Of values 
in different fields/variables when they give the same physical 
quantity. You cannot avoid them totally.

        Utypes are totally different. They are there to point to 
attributes/classes in a given datamodel in a non hierarchical context:
Ascii, Votable , relational database or FITS.
        Just to take the latter, the utype in FITS context is an IVOA 
standardize FITS KEYWORD. See the FITS serialisation of Spectrum For 
details.
         The fact that there is no general standard for the utype 
syntax, or that there are short version (FITS keywords) and long 
version of utypes is another issue,  AND DON'T HAVE To be confused with 
the usefulness of the utype concept. A discussion on syntax will take 
place in Baltimore (Mireille
?)
         In a hierarchical context now (Xml or JSON) you definitely 
don't need utypes and the Xml indeed don't have them...
--------------------------------------------------------------------------------
*Step 2 Fabien answers to FB
-------------------------------------------------------------------------------

OK, I meant in a hierarchical context only indeed.
---------------------------------------------------------------------------------
*Step 3 Juan de Dios advice
--------------------------------------------------------------------------------
Lets get into the semantics: utypes are only needed if you are 
serialising into a VOTable... and in any case you would need anything 
similar to utypes in that case. Of course, VOTable is not the best XML 
you can do with Characterisation, which I believe it must be provided, 
unless otherwise required, in its own XML flavour. In that case, there 
is nothing to be gained by the JSON representation versus the XML one.
What we _do need_ to do is promote Char with things which really answer 
user needs.

And for leaving out UCDs... I think UCDs must evolve, but I do think 
there is a lot of value in them, specially from the new suggestions 
that arise, and from an intersection with the emerging IVOA 
vocabularies.
----------------------------------------------------------------------------------
Part B Formats: xml, votable and JSON
-----------------------------------------------------------------------------------
*Step1 FB answers to fabien
-----------------------------------------------------------------------------------
> 1.1 For increasing readability allow a serialization in the JSON 
> format. JSON is very easy to parse, and also to be read by humans. 
> Note that it is >possible to convert JSON to xml if needed. See 
> json.org for more info.

         ----> very good point. After the discussions we had in 
Cambridge Thomas Boch made an attempt to take a char example in xml and 
convert it in JSON the result is much less verbose (no closing tags eg) 
and more human readable Of course it is just an attempt, not a standard 
yet. I attach Thomas' work here....
In my opinion JSON is JAS (Just another serialisation) not THES (The 
serialisation !). If people have xml libraries and ready to read the 
xml or are happy with VOTABLE let them keep their preferred format.

"characterisation": {
         "characterisationAxis": [
             {
                 "accuracy": {
                     "statError": {
                         "ErrorRefVal": {
                             "stc:Error2": {
                                 "stc:C1": ".000055",=20
                                 "stc:C2": ".000055"
                             }
                         },=20
                         "flavor": "statistical"
                     }
                 },=20
                 "axisName": "spatial",=20
                 "calibrationStatus": "CALIBRATED",=20
                 "coordsystem": {
                     "id": "TT-ICRS-WAVELENGTH-TOPO",=20
                     "xlink:href": =
"ivo:\/\/STClib\/CoordSys#TT-ICRS-TOPO",=20
                     "xlink:type": "simple"
                 },=20
                 "coverage": {
                     "bounds": {
                         "limits": {
                             "Coord2VecInterval": {
                                 "stc:HiLimit2Vec": {
                                     "stc:C1": "190.37601",=20
                                     "stc:C2": "11.369167"
                                 },=20
                                 "stc:LoLimit2Vec": {
                                     "stc:C1": "190.37157",=20
                                     "stc:C2": "11.364722"
                                 }
                             },=20
                             "coord_system_id": "TT-ICRS-WAVELENGTH-TOPO"
                         }
                     },=20
                     "location": {
                         "coord": {
                             "coord_system_id": =
"TT-ICRS-WAVELENGTH-TOPO",=20
                             "stc:Position2D": {
                                 "stc:Name1": "RA",=20
                                 "stc:Name2": "Dec",=20
                                 "stc:Value2": {
                                     "stc:C1": "190.37379",=20
                                     "stc:C2": "11.366944"
                                 }
                             }
                         }
                     }
                 },=20
                 "independentAxis": "true",=20
                 "numBins2": {
                     "I1": "16",=20
                     "I2": "16"
                 },=20
                 "regularsamplingStatus": "true",=20
                 "resolution": {
                     "resolutionRefVal": {
                         "stc:Resolution2": {
                             "stc:C1": "1.4",=20
                             "stc:C2": "1.4"
                         }
                     },=20
                     "unit": "arcsec"
                 },=20
                 "samplingPrecision": {
                     "samplingPrecisionRefVal": {
                         "samplingPeriod": {
                             "stc:C1": "1.0",=20
                             "stc:C2": "1.0"
                         }
                     },=20
                     "unit": "arcsec"
                 },=20
                 "ucd": "pos",=20
                 "undersamplingStatus": "false",=20
                 "unit": "deg"
             },=20
             {
                 "axisName": "time",=20
                 "calibrationStatus": "UNCALIBRATED",=20
                 "coordsystem": {
                     "idref": "TT-ICRS-WAVELENGTH-TOPO"
                 },=20
                 "coverage": {
                     "location": {
                         "coord": {
                             "coord_system_id": =
"TT-ICRS-WAVELENGTH-TOPO",=20
                             "stc:Time": {
                                 "stc:TimeInstant": {
                                     "stc:ISOTime": "2004-05-24T22:23:58"
                                 }
                             }
                         }
                     }
                 },=20
                 "independentAxis": "true",=20
                 "numBins1": "1",=20
                 "ucd": "time",=20
                 "unit": "none"
             },=20
             {
                 "accuracy": {
                     "statError": {
                         "ErrorRefVal": {
                             "stc:Error": "0.0001"
                         },=20
                         "flavor": "statistical"
                     }
                 },=20
                 "axisName": "spectral",=20
                 "calibrationStatus": "CALIBRATED",=20
                 "coordsystem": {
                     "idref": "TT-ICRS-WAVELENGTH-TOPO"
                 },=20
                 "coverage": {
                     "bounds": {
                         "limits": {
                             "CoordScalarInterval": {
                                 "stc:HiLimit": "0.56548382",=20
                                 "stc:LoLimit": "0.4140"
                             },=20
                             "coord_system_id": "TT-ICRS-WAVELENGTH-TOPO"
                         }
                     },=20
                     "location": {
                         "coord": {
                             "coord_system_id": =
"TT-ICRS-WAVELENGTH-TOPO",=20
                             "stc:Spectral": {
                                 "stc:Value": "0.4858137"
                             }
                         }
                     }
                 },=20
                 "independentAxis": "true",=20
                 "numBins1": "2048",=20
                 "regularsamplingStatus": "false",=20
                 "resolution": {
                     "resolutionBounds": {
                         "resolutionLimits1": {
                             "stc:HiLimit": "101.142",=20
                             "stc:LoLimit": "48.3233"
                         }
                     },=20
                     "resolutionRefVal": {
                         "stc:Resolution": "78.6162"
                     },=20
                     "unit": "km\/s"
                 },=20
                 "samplingPrecision": {
                     "samplingPrecisionRefVal": {
                         "samplingPeriod": "40.0000"
                     },=20
                     "unit": "km\/s"
                 },=20
                 "ucd": "em",=20
                 "undersamplingStatus": "false",=20
                 "unit": "um"
             },=20
             {
                 "accuracy": {
                     "statError": {
                         "ErrorBounds": {
                             "ErrorLimits1": {
                                 "stc:HiLimit": "1.12e-16",=20
                                 "stc:LoLimit": "5.80e-19"
                             }
                         },=20
                         "ErrorRefVal": {
                             "stc:Error": "5.63e-17"
                         },=20
                         "flavor": "statistical"
                     }
                 },=20
                 "axisName": "flux",=20
                 "calibrationStatus": "UNCALIBRATED",=20
                 "coordsystem": {
                     "id": "UNKNOWN"
                 },=20
                 "coverage": {
                     "bounds": {
                         "limits": {
                             "CoordScalarInterval": {
                                 "stc:HiLimit": "1.1838107e-14",=20
                                 "stc:LoLimit": "-2.8933970e-15"
                             },=20
                             "coord_system_id": "UNKNOWN"
                         }
                     },=20
                     "location": {
                         "coord": {
                             "coord_system_id": "UNKNOWN",=20
                             "stc:ScalarCoordinate": {
                                 "stc:Value": "2.3519587e-17"
                             }
                         }
                     }
                 },=20
                 "independentAxis": "false",=20
                 "numBins1": "0",=20
                 "regularsamplingStatus": "true",=20
                 "ucd": "phot",=20
                 "undersamplingStatus": "false",=20
                 "unit": "counts"
             }
         ]
     }
}
--------------------------------------------------------------------------------
*Step2 Fabien answers FB
--------------------------------------------------------------------------------
No problem with that (at least for xml). VOTable simply doesn't fit the 
needs to have structured information.
Fabien's metadata in JSON in the same mail ....
// Beginning of the main astrox structure
{
	// Unique identifier for this astrox file. It points to a valid 
permanent URL where this file is located
	"astroxUri":"http://archive.eso.org/astrox/MY_NGASID",
	// Last modif date (UTC) of this file, useful for automatic updating 
by e.g. crawlers
	"timeStamp" : "2008-07-01 16:34:15",
	"title": "Observation of galaxy clusters blablabla",

	"creator": "ESO/ADP",
	"creationDate": "2006-01-01 16:34:15",
	"publisher": "ESO SAF",
	"copyrights": "ESO (C) 2008",
	"collection": "QC UVES Spectra",
	"accessRights": "Anonymous",
	"originalSource": "http://www.eso.org/astrox/MY_NGASIDXX",

	"instrumentSetup":
	{
		"facility": "ESO"
		"telescope": "VLT/UT1",
		"instrument": "WFI",
		"filter": "V",
		"mode": "Image",
		"mvm":
		{
			"param1": "maValeur"
		}
	}

	// Contain high-level meta-data on the characterization of the data set
	"characterization":
	{
		// Link to the binary low-level char file
		"lowlevelcharacterizationUrl": "http://www.eso.org/astrox/MY_NGASID.astrox",

		// 2D spatial direction Ra, Dec in ICRS (deg)
		"spaceAxis":
		{
			"centralPos": [10.15, -20.18],
			// Union of convex polygons containing all the relevant data
			"boundingBox": [[[-36.52279463, -0.93122968], [-36.73541852, 
-0.9630006], [-36.767187, -0.75040453], [-36.55457444, -0.71863535]]],

			// Define a (possibly hierachical) preview image using the format 
implemented in Stellarium
			// The description of preview should not be considered as 
scientificaly valid characterization.
			// It is somewhat subjective and should be used only for display
			"preview":
			{
				"credit" : "Grasslands Observatory",
				"imageUrl" : "nebulae/default/m2.png",
				"worldCoords": [[[-36.52279463, -0.93122968], [-36.73541852, 
-0.9630006], [-36.767187, -0.75040453], [-36.55457444, -0.71863535]]],
				"textureCoords" : [[[0,0], [1,0], [1,1], [0,1]]],
				"minResolution" : 0.2148810463,
				"maxBrightness" : 13.9,
				"alphaBlend" : true,
				"subTiles":
				[
					"http://mySubTile1.json",
					"http://mySubTile2.json"
				]
			},
			// Define a (possibly hierachical) footprint geometry. A footprint 
is like the thresholding of the (spatial in this case) transmission 
curve
			"footprint":
			{
				"worldCoords": [[[-36.52279463, -0.93122968], [-36.73541852, 
-0.9630006], [-36.767187, -0.75040453], [-36.55457444, -0.71863535]]],
				"minResolution" : 0.2148810463,
				"subFootprint":
				[
					"http://mySubFootprint1.json",
					"http://mySubFootprint2.json"
				]
			}
		},
		// lambda in heliocentric standard of rest (m)
		"wavelengthAxis":
		{
			"centralPos": 1235.5,
			// Union of ranges containing all the relevant data
			"boundingBox": [[1235.6, 1235.458]],
			// Transmission curve: two arrays in m/value between 0 and 1. Zero 
is assumed oustide the array.
			// I provided this one as an example with values between 0 and 1, 
but I think the correct way should be to give the value in maximum SNR 
or error
			"transmission": [[12255,12258,122569,12289],[0,0.125,0.56,0.236,0]]
		},
		// time in TT (s)
		"timeAxis":
		{
			// Union of ranges containing all the relevant data
			"boundingBox": [[256368.6, 256368.6]]
		},
		// intensity (W?)
		"intensistyAxis":
		{
			// Union of ranges containing all the relevant data
			"boundingBox": [[123.6, 124.6]]
		},
		// 3D observer position X,Y,Z in ICRS heliocentric (m)
		"observerPosAxis":
		{
			// Union of 3D convex polygons containing the observer position
			"boundingBox": [[[10023.6, 124.6, 1235.0], [10023.6, 124.6, 1235.0], 
[10023.6, 124.6, 1235.0], [10023.6, 124.6, 1235.0], [10023.6, 124.6, 
1235.0], [10023.6, 124.6, 1235.0], [10023.6, 124.6, 1235.0], [10023.6, 
124.6, 1235.0]]]
		}
	},

	// The list of all the data files (including calibration) associated 
to this data set
	"dataSources":
	{
		"mainDataSet":
		{
			"format": "application/fits",
			"FITShdu": 1,
			"url": "http://data.eso.org/myFileID.fits",
			"fileSize": 512.12,
			// Follow the numpy array interface 
http://numpy.scipy.org/array_interface.shtml
			"shape": [1024, 256],
			"typestr": "f"
		}
	},

	// Other astrox instances which are logically childs of the current 
one, e.g. multi-extension FITS files
	// or element of a survey etc..
	// The only constraint is that the characterization of the parent 
'contains' the union of the characterization of the childs.
	// This is the basic concept which allows to search through trees of 
datasets using neutral characterization descriptors
	"subAstrox":
	{
		"chip1": "http://www.eso.org/astrox/MY_OTHER_NGAS_ID",
		"chip2": "http://www.eso.org/astrox/MY_NGASID#chip2",
		"chip3":
		{
			// Full included astrox structure can be put here as well
		}
	},

	// Info on the target if any
	"target":
	{
		// TODO in relation with the Observation data model
	},

	// FITS keywords.
	"FITS":
	{
		"NAXIS": 2,
		"NAXIS1": 500,
		"NAXIS2": 500,
		"CRVAL1": 10.23,
		"CRVAL2": 156.23,
		"CRPIX1": 12.3,
		"CRPIX2": 156.36,
		"CTYPE1": "RA---TAN",
		"CTYPE2": "DEC--TAN",
		"CD1_1": 15.2,
		"CD1_2": 0.02,
		"CD2_1": 269.2,
		"CD2_2": 0.01,
		"RADECSYS": "ICRS",
		"EPOCH": "2000",
		"EQUINOX": "J2000"
		// Etc..
	},

	// ESO specific meta-data
	"ESO":
	{
		"PIName": "Gerard Dupont",
		"transmissionCurveURL": "http://myCurveService.xml",
		"programID": "XXXX.DDD-ABC",
		"OBName": "string",
		"OBID": "1256",
		"category": "SCIENCE",	// ESO DPR CAT
		"mode": "SPECTRUM",		// ESO DPR TECH
		"type": "OBJECT",		// ESO DPR TYPE
		"processingType": "highlyProcessed",
		"accessFlag": "Anonymous"
	}
}


--------------------------------------------------------------------------------
*Step 3  Fabien's answer to Anita
--------------------------------------------------------------------------------
> VOTable is generally well-liked.  Hence I am against adopting yet 
> another language...

If VOTable would allow to store structured information I would also be 
against a new format. But there is here a real new need, and nothing is 
currently used for it, excepted for prototypes. We have basically the 
choice between xml, JSON and other serializations. Before the choice is 
taken, I just wanted to point out the qualities of JSON.
--------------------------------------------------------------------------------
*Step 4 Anita replies
----------------------------------------------------------------------------------
I appreciate that Fabien, but it is a matter of who is going to ahve to 
understand it.  Astronomers can use VOTable precisely because it is 
simple.  VO engineers and software experts can use whatever is best 
within their domains, but if anyone outside - even data publishers, 
since most archives only have part-time maintainers at best - then it 
has to be already widely used.  Astronomers are not usually going to 
learn a new language just for VO.  We are not yet seen as that 
indispensible.
--------------------------------------------------------------------------------
*Step 5 Fabien again
--------------------------------------------------------------------------------
if the VO would work, astronomer would not have to see how it works.
They would just use the tools. So the main users for such a format are 
the engineers making the tools and the data providers who expose their 
data.
But the main problem is not even here. The real problem is that VOTable 
simply doesn't suit the need for characterization. VOTable is good for 
tabular data, such as a SIA output, but not for structured data which 
is what we have here.
(A bit off topic: JSON is actually very widely used for web 
applications and there are about 10 parsers libraries for each major 
programming languages, and it took me half a day to code my own parser)
------------------------------------------------------------------------------
*Step 6 Igor's comment
------------------------------------------------------------------------------
1) I'm strictly against JSON or whatever pseudo-XML. Characterisation 
(or whatever other, e.g. STC) metadata is not supposed to be 
human-readable. If it's looks too complex for the data providers, they 
have to fire their software engineers and hire more qualified ones.
---------------------------------------------------------------------------------
*Step 7 Fabien answers to Igor
-------------------------------------------------------------------------------
The only objective way to choose a serialization format is to measure 
the pro and con for each class of users. Ease of use and 
human-readableness are 2 very important criteria for engineers.
-------------------------------------------------------------------------------
*Step 8 Gretchen on VOTABLE
------------------------------------------------------------------------------
The VOTable data model is however in my view more a transport mechanism 
that provides a simple framework for higher level abstraction and 
generalization.  The specific data models which characterize region,  
time,  spectral distribution, etc.
need to account for the complexity and heterogeneity or information is 
lost.  I don't see how this can not be obvious.
-------------------------------------------------------------------------------
*Step 9 Juan de Dios comment
-------------------------------------------------------------------------------
But I don't think JSON is much better than XML for readability, and I 
think is more fragile than XML in case of partial truncation. And 
relationships (hierarchical or purely relational) have to be specified 
by foreign keys, which hamper readability.

In any case, you can see JSON notation and XML notation as 
complementary, one syntactic sugar for the other, not as something that 
really gains you anything from the implementation or human readability 
point of view.
-----------------------------------------------------
*Step 10 FB eventually to Fabien
--------------------------------------------------------------------------
    Model and FORMAT are not the same thing.

   Contrary to Igor I have nothing against A JSON format for a given datamodel.
   But it's definitly to early (at least) to say that this is to replace xml
    Why should we not just add JSON as a new serialisation BESIDE the 
previous ones ?
    (I feel like repeating what I already wrote OK ....)
--------------------------------------------------------------------------------
*Step 11 FB on VOTABLE and datamodels
--------------------------------------------------------------------------------
By gathering all this discussion on Formats, it occurs to me I didn't
say a word in favor of VOTABLE.
    I strongly disagree with the statement that VOTABLE is not suited to
transporting modelized metadata. Of course structuration has limits in
VOTABLE but is not at all impossible. This has been shown IN PRACTICE
several times.
    Of course the VOTABLE "model" is nothing more than a static version of
relational model, and the actual metadata model semantic is conveyed by
the values and not by the tags. But utypes, GROUPS combinations with
RESOURCES, key/reference mechanism indeed allow some structuration.
    There is a LOT of VO litterature and references as well as implementations
doing that.

       Just a couple of them:
              The recent Note by Ochsenbein, Rots and McDowell 
"Referencing STC in VOTABLE"
              SSA recommendation describes utypes and their meaning and
the Extension DAL mechanism and its relationships with models.
              Spectrum DM recommendation provides some rules to map a Datamodel
in VOTABLE.
              Older references are the IVOA notes:
"DAL query Response with Extensions: Use cases and implementation 
rules. Example of SIAP"
and "Data Model serialisation in VOTable"


    So beside FITS, Xml and maybe JSON, VOTABLE is still another serious
candidate to modelized metatada serialisation. Don't rule it out.