DataLink issues

Wed Sep 26 04:05:15 PDT 2012

Hi all

I was waiting to see more discussion about datalink ongoing (in particular 
I expected to see comments by Markus Demleitner ;-) but as it seems to be 
converging to some extra-big beast I think the pragmatic view is 
necessary:

The question which is not still answered is - what is datalink needed or 
useful for ? Not what it could do in principal !

I still do not see clear boundary between DAL (specific, detailed..) 
service andand discovery service of alternative representation - which is 
what DL is presented as.

In May I was hoping DL could solve my (still pending) wish of 
postprocessing spectra (cutout, normalize, resolution degradation etc ...) 
in a clear and well formalized way. After practical experience with that I 
have found a catch in data format as a primary source of complexity. And 
as I remember the end of special session about DL was a general acceptance 
of DL as a "easily clickable" fork of accessing static data.

That's why we decided with Markus to prototype the SSA getData operation 
instead of relying on DL to solve it in universal "S*APv2" way on GDS 
level, which is imho still too ambitious taking into account current 
achievements in practical implementations of IVOA protocols in tools.

But now I see that the tendency of making DL the all-purpose 
"one-link-to-rule-them-all" standard.

So lets take it in simple case of SSA analogy, which is still field I 
understand well (I hope;-): sorry for diverging to SSA:

In SSA we have couple of parameters (PQL like ;-) which in all current 
implementations (except mine) means a restriction of selection - i.e 
giving the BAND, TIME and FORMAT I restrict the space of possible answers 
only to data set containing the given spectral range, exposed in given 
epoch and expressed in given format . In my effort I am trying to 
introduce the postprocessing using the parameters as a control set of 
params according which the postprocessing works - e.g. I command the 
service to cut the region I want using BAND or if giving FLUXCALIB I force 
the data to be on-the-fly normalized. To do this I need two services - one 
interprets the params as control params the other makes just the selection 
as was the original intention.

But the catch is in FORMAT. All data in our case are in NATIVE (1D FITS) 
and the on-the-fly processing is done when asking for FITS, VOTABLE or 
COMPLIANT. But the postprocessing can work only on the bintable FITS (for 
which we have simple cutouts in wavelengths - not pixels as the WCS may be 
non-linear - OTOH we cannotgive the number of pixels of output for cutout 
operation in metadata) . This means that the postprocessing implies the 
conversion of format and it is difficult to state the metadata for 
spectral coverage or for dagta size just after basic querydata operation - 
as the exact cutout may be done on pixels which may represent different 
wavelengths than requested in BAND - and say after rebinning the amount of 
data is resulting from algorithm that was not yet run after queryData - 
only the getData will know it when finished.

So in general, the problem of virtual data sets is we cannot describe them 
in metadata before the real operation is run.

Please notice I use as well parameter FORMAT as a operation with unknown 
results ;-) But to be able to work with it in intuitive manner - we have 
to pretend the existence of all files in different format (having multiple 
times the number of rows in a database) and the SSA keyword will restrict 
the number of returned rows (whan asking for COMPLIANT it returns only VOT 
and bintable fits as applic.fits , when asking for ALL or ANY it gives as 
well the image/fits , if asking NATIVE gives me only that one). So I have 
to be sure I am able to do the data conversion always and pretend I have 
it in a database.

Here you see the complications with simple protocol like SSA (although so 
far it is the best way how to do some real spectroscopic analysis instead 
of just displaying the static data)

And suppose the ambitions of DL : accesing complex data sets, process them 
using given parameters, access more specific metadata according to the 
nature of data set, even UWS service for computational intensive tasks 
......  !!!

I think that it would be nightmare to implement the clients to work with 
it. Beware that we need to state the parameters for DL, formalize the 
semantics of them and they would need to be able to be sent deeper to the 
underlying DAL service or even UWS param description.

All the complexity would follow from the begining - suppose the DL on a 
spectrum with given PUBID returns the existence of service that allows its 
cutout and rebinning. How I will pass the params to it ? Will the client 
open special window with form ?  What if there is another link to image of 
the source and the serice allows the cutout , rotation and rebinning of 
image or other conversions ....

So the client would need to understand all possible parameters of all 
possible protocols which might be the endpoints of DL. What happens in 
theoretical science is an order more complicated.

OTOH the idea is tempting - suppose you click on discovery results and 
once you see there is a observation of ALMA on given place on sky you 
could see the automatic menu of datalinks list, you select image, in CO 
line and together the spectrum in given range and a visibility map etc 
..... But this would in fact replace all the VO stuff. We could call 
different tools bound by SAMP according the the semantic type of returned 
datalink - fantastic. But it requires all rigid description of semantics, 
automatic selection of DAL protocols AND THEIR PARAMETERS..

-----------------------------------------

So I would suggest keep it simple (and not stupid ;-)

IMHO DL has a sense only without the parameters - just to discovery more 
representations od data available in archive. Here I support fully Doug !

DL should serve as a information about the existence of another 
representation of the same data set - e.g small size preview in jpg 
attached to large fits, maybe the continuum normalized spectrum linked to 
flux uncalibrated, visibility function or uv-coverage graph linked to 
radio image etc ...

I am not sure where is the difference with ObsTAP science level category 
(raw data, mosaic ...) but I am sure the DL attached to normalized or 
flux-calibrated spectrum could point to just extracted one and from it to 
raw 2D image - it might associate the calibrations to science frame etc 
.... In fact all the current package of data from observing blocks (all 
raw, intermediate data, calibrations, dispersion curves, photon transfer 
statistics etc... what is currently provided by most archives could be 
directly linked .... We could as well describe the provenance of data 
pointing to different stages of processing by DL.

We have curently the starnge situation in astronomical data archiving. On 
one hand there is a whole machinery of raw data archives, pipelines 
resulted interproducts - quality check frames, calibrations etc - and this 
is well maintained by observatories (e.g. ESO observing block packages). 
On the other hand there is very "enthusiastic and amateur level" final 
processing of results put (if lucky) in some archive (in ideal case 
VO-compatible) and unfortunately the scientific usable data results are 
not connected with the original stuff.

IN ESO - they have archive for raw data of all instruments but the science 
level products are missing and incomplete (e.g. UVES spectra) in VO 
services. They want people to reduce data themselves and return them back. 
But there is again no clear solution how to link the reduced scicne data 
with original archive. IMHO the DL could be an ideal way how to do it in 
proper VO way.

It would give the data archives clear recommendation how to make raw data 
archive gradualy coinnected to scicne data once available and it is what 
the journals are calling for - the tracebility of remark on data set in 
journoul up to raw data on given VO archive.

Just the example why it might be useful already now on existing archives:

Several days ago I was looking to a set of spectra stacked (line profiles) 
and one was quite strange - so I wanted to identify what was on the raw 
frame. Having DL in my SSA service (and in SPLAT) I might be able directly 
click on some general accref (highlighted DatasetID) within returned list 
of SSA response to see the link to original image and call the viewer ds9 
to see the cosmics or saturated parts on original frames to reveal the 
origin of problems.....

Final words - putting too much semantic burden on the DL could kill the 
original clever idea and I cannot imagine the practical work of clients 
with the parameters

Best regards

Petr

*************************************************************************
*  Petr Skoda                         Phone : +420-323-649201, ext. 361 *
*  Stellar Department                         +420-323-620361           *
*  Astronomical Institute AS CR       Fax   : +420-323-620250           *
*  251 65 Ondrejov                    e-mail: skoda at sunstel.asu.cas.cz  *
*  Czech Republic                                                       *
*************************************************************************