ObsCore update discussion : adding Axes information in Obscore table

Thu Apr 16 10:16:59 CEST 2015

Dear Data Modellers,

I've not closely followed the discussion, so this may be a dumb
question, but let me still ask it:  What use cases is this to cover?

Is it, in essence, "Give me all lightcurves/spectra/spectral
cubes/polarisation images satisfying these conditions"?  If so, then
I have to say I feel

On Wed, Apr 15, 2015 at 07:50:33PM +0200, Louys Mireille wrote:
>  * s_dim1, s_dim2 = the coverage in sampling elements ( pixels) for
>    each spatial axis
>  * em_dim = the coverage in spectral elements along the energy axis
>  * t_dim = the coverage in the time axis, as number of time bins
>  * pol_dim = the coverage in the polarization axis, as number of
>    polarization states

is both a bit too much and a bit too little.

I believe it's too much because it's using 6 columns to convey the
information, and it contains lots of information that's not actually
necessary for the one use case I've outlined above.  Six columns may
not seem much, but 

gavo=# select count(*) from tap_schema.columns where table_name='ivoa.obscore';
 count 
-------
    29

in current DaCHS, so that's a 20% increase.  Non-IVOA people quite
usually complain that IVOA data models are too complex, so this is a
non-trivial issue, and IMHO we should have strong ("we gain 20% in
usefulness") use cases where people actually need the actual number
of pixels for a common discovery operation.  Are we sure we have
those?

At the same time, I believe it's too little, as I can easily think of
cubes that have axes that cannot be described in this way (in
astroparticle physics, one axis might be particle type, for instance;
for visibilities, I'd be reluctant to talk about spatial axes; you
could easily have three spatial dimensions with density values --
think GAIA --, etc).  I don't think we should plan on changing
obscore everytime new instruments producing interesting new data
products come around.

I liked much better the idea that has been suggested at some recent
Interop.  Let me sketch it out here again (I don't know who to credit
for it -- speak up, if you're reading this):

Just add one column obs_axes (or whatever), which would contain a
string like (RE syntax)

(/[a-z]+/)*

For each (non-degenerate) axis actually present, we'd have one code,
where the s, em, t, pol suggested by Mireille might suffice for now
(though I'd like some guideline what to do with visibilities).  

The examples provided would then look like this:

>  * MUSE data cube
> 
>     s_dim1   = 300
>     s_dim2   = 300
>     em_dim   = 3463
>     pol_dim      = 1
>     pol_state = I
>     t_dim      = 1

/s/s/em/

>  * 2MASS: 2D image
> 
>     s_dim1   = 300
>     s_dim2   = 300
>     em_dim   = 1
>     pol_dim  = 1
>     pol_state = I
>     t_dim = 1

/s/s/

>  * STIS spectroscopy (1D):
> 
>     s_dim1   = 1
>     s_dim2   = 1
>     em_dim   = 1024
>     pol_dim  = 1
>     pol_state = I
>     t_dim = 1

/em/

>  * STIS spectroscopy (2D long slit):
> 
>     s_dim1    = 1024
>     s_dim2    = 1
>     em_dim    = 1024
>     pol_dim   = 1
>     pol_state = I
>     t_dim = 1

/s/em/

>  * ALMA:
> 
>     s_dim1    = 1000
>     s_dim2    = 1000
>     em_dim   = 3000
>     pol_dim  = 4
>     pol_state = I/U/V/Q
>     t_dim = 1

/s/s/em/pol/

I claim that's enough for typical discovery problems; for instance:

* Give me spectral cubes:

  WHERE obs_axes='/s/s/em/'

* Give me anything that has a spectral axis

  WHERE obs_axes LIKE '%/em/%'

* Give me time series

  WHERE obs_axes='/t/'

* Give me things that have both resolved time and resolved
  polarization

  WHERE obs_axes LIKE '%/t/%' and obs_axes LIKE '%/pol/%'

The one drawback I can see is that the prevalence of % at the
beginning of patterns isn't really index-friendly, and hence queries
with *only* constraints of this type may involve all-table seqscans.
I'd claim that such queries would be fairly rare, since you'd usually
have additional constraints on position or other fields.

Again: If we have usecases that justify increasing field count by
20%, I retract this entire post.  All I'm saying is we shouldn't
column count lightly.

Cheers,

           Markus