Spectral DM document update

Tue Oct 10 10:09:38 PDT 2006

Hi Anita -

Just to be clear here, we are discussing the format of actual spectral
datasets containing typically thousands of data points.  VOTable,
FITS, and native XML are all possible serializations.  SSA (the
Spectrum model) defines standard Spectrum serializations for these
file formats.  The issue under discussion is how to deal with bulk
array-oriented data in the native XML serialization.

The Spectrum data model is actually two things: metadata describing
the spectrum dataset (this part is in common with the SSA protocol
and comes back in the query response), and a model for the data itself
(this is only used in an actual Spectrum dataset instance containing
the data samples).

SSA does not merely describe external, native project format spectra
since - unlike in the case of a FITS image - there is no standard
astronomical format for spectra.  It is possible in some cases to
pass-though native datasets which are in some project-specific data
format, but in general we are talking about transforming external data
into the standard SSA/Spectrum defined format at access time.

The most space-efficient representation for large spectral datasets
will be FITS binary table, but for a typical 1-D spectrum of several
thousand data points, VOTable or native XML is quite workable, is
preferable for passing complex metadata, and can be easier for modern
software to manage.  Even CSV could be adequate in some cases.

 	- Doug

On Tue, 10 Oct 2006, Anita Richards wrote:

>
> For many high-resolution science-ready spectra, you have typically thousands 
> of data points which all share the same characteristics apart from the 
> spectral coordinate and the flux density (and possibly the statistical error 
> on the flux density).
>
> In such a case, thre may be e.g. 10 or 20 other pieces of metadata (times, 
> positions, position errors, spectral resolution per bin, accuracy of central 
> spectral coordinate etc. etc.) which do not need repeating.
>
> As I understand it, the Spectrum model can 'be' the data in which case there 
> would indeed by a horrible bloat; a 100 M VOTable is far more reasonable than 
> 1 G one.  For some instances - and especially for SEDs, with a few points, 
> often with very different metadata, that is reasonable. But I think I agree 
> with Norman, that in practice the Spectrum model will be far more useful in 
> the case of large data sets, for describing data which are in any recognised 
> format (including xml) than for reproducing it.
>
> cheers
> a
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> Dr. Anita M. S. Richards, AstroGrid Astronomer
> MERLIN/VLBI National Facility, University of Manchester, Jodrell Bank 
> Observatory, Macclesfield, Cheshire SK11 9DL, U.K. tel +44 (0)1477 572683 
> (direct); 571321 (switchboard); 571618 (fax).
>
>
> On Tue, 10 Oct 2006, Norman Gray wrote:
>
>> 
>> Greetings.
>> 
>> Arguably, this whole discussion is moot.  If you're transporting enough 
>> data that XML efficiency becomes an issue, then you probably shouldn't be 
>> using XML -- that's not what it's for.  A Swiss army knife is a wonderful 
>> thing, but shouldn't be used for brain surgery.
>> 
>> As Doug said:
>> 
>> On 2006 Oct 9 , at 16.09, Doug Tody wrote:
>> 
>>> (None of this may matter in the end as most people will probably use
>>> VOTable and FITS for spectra, but nonetheless array handling in XML
>>> is an important issue to consider).
>> 
>> While I take the second point, I would still maintain that using XML for 
>> this sort of transport is probably an abuse of tools.
>> 
>> There are ways of being efficient about XML, if that's what's really 
>> required.  I have a paper sitting here by Peter Buneman and co at 
>> Edinburgh, on `Vectorizing and Querying Large {XML} Repositories', DOI 
>> 10.1109/ICDE.2005.150 <http://dx.doi.org/10.1109/ICDE.2005.150>.  It 
>> describes a scheme (and points to others) for effectively compressing away 
>> the XML overhead, and transparently making it column-accessible, without 
>> actually losing the useful structuring.  Bob Mann is one of the authors and 
>> could probably say more about it.
>> 
>> If bulk data and XML structuring are both seen as vital, then something 
>> like this is, I would think, a more stable solution to the problem than the 
>> parser-inside-parser solution of having strings of numbers within XML.
>> 
>> All the best,
>> 
>> Norman
>> 
>> 
>> --
>> 
>> ----------------------------------------------------------------------------
>> Norman Gray  /  http://nxg.me.uk
>> eurovotech.org  /  University of Leicester, UK
>> 
>> 
>> 
>> 
>