[obs-tap]access_format

Rob Seaman seaman at noao.edu
Thu Apr 21 11:51:25 PDT 2011


Interesting discussion.  I think I need to hear more to develop an opinion.  However, any discussion of FITS-based compression - or astronomical data compression in general - should be informed by recent developments in this area.  In particular, there appears to be an assumption here that compression is necessarily an external operation (like gzip) applied to some more fundamental format (like FITS).  Rather, FITS tile compression:

	http://fits.gsfc.nasa.gov/registry/tilecompression.html

is a native FITS format itself.  A FITS object (typically with extension ".fits") may be composed of individual subfiles (FITS extensions) that themselves may each be compressed or not.  The FPACK utilities for reading/writing such files:

	http://heasarc.nasa.gov/fitsio/fpack

happen to append a ".fz" extension by default, but this is merely a convenience by analogy with the ".gz" extension of gzip.  Indeed, FPACK is layered on CFITSIO and like jpeg and similar formats, the reality is that the compressed format maybe be written or read completely transparently.  That is, no separate utility like fpack is required to the purpose.  Create the first copy compressed (as with jpeg) and use it as such throughout the workflow, whether or not that workflow relies on VO services or protocols.

See the NOAO user notes:

	http://archive.noao.edu/doc/SDM_fpack_usernotes.html

and contained references for more discussion.  The target is moving fast, so you might want to start with a more recent paper:

	http://arxiv.org/abs/1007.1179

whose abstraction of the underlying problem emphasizes that compression is really a question of efficient data representation, not an externally applied "scheme" devoid of theoretical underpinnings.

Note that such legal FITS files (whatever the extensions) can contain compressed tables, not just compressed images, and certainly not just externally compressed blobs of bits:

	http://fits.gsfc.nasa.gov/tiletable.pdf

Which is to say that it is too constraining to attempt to require a compression suffix (like ".gz") after a bundling suffix (like ".tar" or ".zip") perhaps after a format suffix (like ".fits").  All of these roles may be combined.  On the other hand, a library like CFITSIO may well view a ".fits.gz" file as the same thing (mostly) as a ".fits" file, but the logistics may be very different.

Are compression issues on the agenda for DM sessions in Naples?  TCG agenda?  Other working groups?

Rob
--

On Apr 20, 2011, at 4:44 AM, Laurent Michel wrote:

> I do not agree with the idea that compression must be infered from the filename suffix for data files compressed after construction. The filename can be, for instance, a service URL which has no reason to ends with a compression suffix (e.g. http://service.my.domain/getfile?compressed=yes&filname=mydata.fits).  I'm not sure that the after-built compression must be notified in the access_format column considering that this issue has been solved by all clients. Can't we simply consider that file.fits and file.fits.gz are both fits file?
> 
> On 19/04/2011 17:10, Mireille Louys wrote:
>> 
> 
>> 4.7. Access Format (access_format)
>> The access_format column emphasizes information about the format of the data product if downloaded as a file. The values should describe (in increasing detail) the overall file format as well as the structure of the data within the file.  This data model fields is important to evaluate for data discovery and data retrieval.  MIME types can be used for that in existing protocols ( like http). However, when dealing with observations as in ObsTAP service, more information about the astronomical arrangement of data into predefined formats is very useful . For instance we want to distinguish between various formats like aedm (ALMA) , evla, MUSE multi-extension fits files( IFU) etc? Providing this information speeds up the interpretation step for client application consuming these files on one hand , and improves data selection in the discovery step on the other hand.
>> 
>>     ...
>> Compression may be applied at different levels:
>> ? after the data file is built
>> ? after binding a bunch of files into an archive file (like in .gzip,
>> .7zip, .gz, .tar.gz, etc.)
>> ? directly on the file content (jpeg, hcompress in fits images, multi-resolution compression (.MRC files as in MR/1 application)
>> In this case extension file name conveys the information directly on the file content.
>> No suffix means there is no compression applied.
>> Example of combined access format could be a concatenation of mime short name , with compression suffix.



More information about the dm mailing list