DataLink issues
Petr Skoda
skoda at sunstel.asu.cas.cz
Wed Sep 26 04:05:15 PDT 2012
Hi all
I was waiting to see more discussion about datalink ongoing (in particular
I expected to see comments by Markus Demleitner ;-) but as it seems to be
converging to some extra-big beast I think the pragmatic view is
necessary:
The question which is not still answered is - what is datalink needed or
useful for ? Not what it could do in principal !
I still do not see clear boundary between DAL (specific, detailed..)
service andand discovery service of alternative representation - which is
what DL is presented as.
In May I was hoping DL could solve my (still pending) wish of
postprocessing spectra (cutout, normalize, resolution degradation etc ...)
in a clear and well formalized way. After practical experience with that I
have found a catch in data format as a primary source of complexity. And
as I remember the end of special session about DL was a general acceptance
of DL as a "easily clickable" fork of accessing static data.
That's why we decided with Markus to prototype the SSA getData operation
instead of relying on DL to solve it in universal "S*APv2" way on GDS
level, which is imho still too ambitious taking into account current
achievements in practical implementations of IVOA protocols in tools.
But now I see that the tendency of making DL the all-purpose
"one-link-to-rule-them-all" standard.
So lets take it in simple case of SSA analogy, which is still field I
understand well (I hope;-): sorry for diverging to SSA:
In SSA we have couple of parameters (PQL like ;-) which in all current
implementations (except mine) means a restriction of selection - i.e
giving the BAND, TIME and FORMAT I restrict the space of possible answers
only to data set containing the given spectral range, exposed in given
epoch and expressed in given format . In my effort I am trying to
introduce the postprocessing using the parameters as a control set of
params according which the postprocessing works - e.g. I command the
service to cut the region I want using BAND or if giving FLUXCALIB I force
the data to be on-the-fly normalized. To do this I need two services - one
interprets the params as control params the other makes just the selection
as was the original intention.
But the catch is in FORMAT. All data in our case are in NATIVE (1D FITS)
and the on-the-fly processing is done when asking for FITS, VOTABLE or
COMPLIANT. But the postprocessing can work only on the bintable FITS (for
which we have simple cutouts in wavelengths - not pixels as the WCS may be
non-linear - OTOH we cannotgive the number of pixels of output for cutout
operation in metadata) . This means that the postprocessing implies the
conversion of format and it is difficult to state the metadata for
spectral coverage or for dagta size just after basic querydata operation -
as the exact cutout may be done on pixels which may represent different
wavelengths than requested in BAND - and say after rebinning the amount of
data is resulting from algorithm that was not yet run after queryData -
only the getData will know it when finished.
So in general, the problem of virtual data sets is we cannot describe them
in metadata before the real operation is run.
Please notice I use as well parameter FORMAT as a operation with unknown
results ;-) But to be able to work with it in intuitive manner - we have
to pretend the existence of all files in different format (having multiple
times the number of rows in a database) and the SSA keyword will restrict
the number of returned rows (whan asking for COMPLIANT it returns only VOT
and bintable fits as applic.fits , when asking for ALL or ANY it gives as
well the image/fits , if asking NATIVE gives me only that one). So I have
to be sure I am able to do the data conversion always and pretend I have
it in a database.
Here you see the complications with simple protocol like SSA (although so
far it is the best way how to do some real spectroscopic analysis instead
of just displaying the static data)
And suppose the ambitions of DL : accesing complex data sets, process them
using given parameters, access more specific metadata according to the
nature of data set, even UWS service for computational intensive tasks
...... !!!
I think that it would be nightmare to implement the clients to work with
it. Beware that we need to state the parameters for DL, formalize the
semantics of them and they would need to be able to be sent deeper to the
underlying DAL service or even UWS param description.
All the complexity would follow from the begining - suppose the DL on a
spectrum with given PUBID returns the existence of service that allows its
cutout and rebinning. How I will pass the params to it ? Will the client
open special window with form ? What if there is another link to image of
the source and the serice allows the cutout , rotation and rebinning of
image or other conversions ....
So the client would need to understand all possible parameters of all
possible protocols which might be the endpoints of DL. What happens in
theoretical science is an order more complicated.
OTOH the idea is tempting - suppose you click on discovery results and
once you see there is a observation of ALMA on given place on sky you
could see the automatic menu of datalinks list, you select image, in CO
line and together the spectrum in given range and a visibility map etc
..... But this would in fact replace all the VO stuff. We could call
different tools bound by SAMP according the the semantic type of returned
datalink - fantastic. But it requires all rigid description of semantics,
automatic selection of DAL protocols AND THEIR PARAMETERS..
-----------------------------------------
So I would suggest keep it simple (and not stupid ;-)
IMHO DL has a sense only without the parameters - just to discovery more
representations od data available in archive. Here I support fully Doug !
DL should serve as a information about the existence of another
representation of the same data set - e.g small size preview in jpg
attached to large fits, maybe the continuum normalized spectrum linked to
flux uncalibrated, visibility function or uv-coverage graph linked to
radio image etc ...
I am not sure where is the difference with ObsTAP science level category
(raw data, mosaic ...) but I am sure the DL attached to normalized or
flux-calibrated spectrum could point to just extracted one and from it to
raw 2D image - it might associate the calibrations to science frame etc
.... In fact all the current package of data from observing blocks (all
raw, intermediate data, calibrations, dispersion curves, photon transfer
statistics etc... what is currently provided by most archives could be
directly linked .... We could as well describe the provenance of data
pointing to different stages of processing by DL.
We have curently the starnge situation in astronomical data archiving. On
one hand there is a whole machinery of raw data archives, pipelines
resulted interproducts - quality check frames, calibrations etc - and this
is well maintained by observatories (e.g. ESO observing block packages).
On the other hand there is very "enthusiastic and amateur level" final
processing of results put (if lucky) in some archive (in ideal case
VO-compatible) and unfortunately the scientific usable data results are
not connected with the original stuff.
IN ESO - they have archive for raw data of all instruments but the science
level products are missing and incomplete (e.g. UVES spectra) in VO
services. They want people to reduce data themselves and return them back.
But there is again no clear solution how to link the reduced scicne data
with original archive. IMHO the DL could be an ideal way how to do it in
proper VO way.
It would give the data archives clear recommendation how to make raw data
archive gradualy coinnected to scicne data once available and it is what
the journals are calling for - the tracebility of remark on data set in
journoul up to raw data on given VO archive.
Just the example why it might be useful already now on existing archives:
Several days ago I was looking to a set of spectra stacked (line profiles)
and one was quite strange - so I wanted to identify what was on the raw
frame. Having DL in my SSA service (and in SPLAT) I might be able directly
click on some general accref (highlighted DatasetID) within returned list
of SSA response to see the link to original image and call the viewer ds9
to see the cosmics or saturated parts on original frames to reveal the
origin of problems.....
Final words - putting too much semantic burden on the DL could kill the
original clever idea and I cannot imagine the practical work of clients
with the parameters
Best regards
Petr
*************************************************************************
* Petr Skoda Phone : +420-323-649201, ext. 361 *
* Stellar Department +420-323-620361 *
* Astronomical Institute AS CR Fax : +420-323-620250 *
* 251 65 Ondrejov e-mail: skoda at sunstel.asu.cas.cz *
* Czech Republic *
*************************************************************************
More information about the dal
mailing list