Obscore 1.1 Erratum 3: Drop obs_id non-NULL requirement

Thu Jul 7 10:40:01 CEST 2022

Dear Markus,
Le 06/07/2022 à 19:50, Markus Demleitner a écrit :
> Dear François,
>
> On Wed, Jul 06, 2022 at 05:44:46PM +0200, BONNAREL FRANCOIS wrote:
>> In the radio domain (see JIVE service for example) and also the High energy
>> domain we often face the case where several dataproducts are produced from
>> the same observation. But we can imagine services where some observations
>> contain several dataproducts and some others only a single one (just by
>> chance).
> Right.  But even if such archives do not want to use datalink in such
> a situation (which I suspect would almost always be the preferable
> solution now that we have datalink),

These use cases don't fit with the "main item to linked resource" 
DataLink scheme.

The basic item in ObsCore is some product with some consistency in the 
characterization. (s_fov, s_ra, s_dec, em_min, em_max, t_main, t_max, 
etc.... should make sense and be selective enough) which is not the case 
of the observation as a whole in the most general case.

Obvious example is a radio interferometry observation where you get 
several targets for the same "observation" and several spectral windows 
, sometimes significantly apart.

>   that in no way depends on having
> obs_id mandatory, does it?*
An observation is a different concept than a dataproduct/dataset. So the 
observation_id is really an dditional information in the most general 
case. This is for the theoretical aspect. But the issue is pragmatic 
too. See below
>> If you want to aggregate all the obs_publisher_did, or (s_ra, s_dec) or
>> whatever property of the products belonging to the same observations I think
>> the GROUP BY will fail if we relax "obs_id = null".
> Ummm... how so?  Of course, when a service that has this kind of
> thing *also* has datasets with obs_id NULL, all these will end up in
> a single aggregate, but that is, for all I can see, as good or as bad
> as any other arrangement in this situation; and even when data
> providers choose to do such a thing and users see unfavourable
> consequences, it's easy to fix by appending an "AND obs_id IS NOT
> NULL"; when people are savvy enough to reconstruct observations using
> GROUP BY, that clause will be a breeze for them.
I think I disagree there. The use case is to associate observations to 
all their derived dataproducts. The fact that there is one single  
dataproduct by chance or several doesn't matter. And if it single some 
day, it could be different an other day in case youre continouasly 
processing your observations and produce new dataproducts.
>
> Perhaps it would help if you wrote down a concrete use case and a
> query addressing it that has a less desirable outcome when we drop
> the requirement on *all* obscore services to have obs_id non-NULL.
> Note that of course individual data providers are still free to have
> local non-NULL constraints if their actual data holdings require
> that.

Query DataLink associated DataLink services to get all the links of an 
observation (meaning all the dataproducts derived from this 
observation). For this we need to get first the list of 
obs_publisher_did for each observation and use them in multi ID DataLink 
query

(This would also require to know the DataLink root URL for each service)

Something like

"select obs_id, string_agg(obs_publisher_did, ',') as publish_did_list 
from obscore group by obs_id"

Then parsing publish_did_list to build the Dalink url 
https://Organisation/dl-root?ID=...&ID=...&ID=...

>> And it's easy to create obs_id from obs_publisher_did in the case of unique
>> dataproduct in an observation
> The problem is not *filling* obs_id.  The problem is *validating* the
> non-NULL requirement, which is fairly resource-intensive (a seqscan
> of the entire ivoa.obscore table, or maintaining an appropriate index
> on all tables contributing to ivoa.obscore).

I must confess, I'm not very familiar with validators and have to trust 
you there.

But anything important will require some resource consumption, and I 
still think observation  is an important concept and obs_id is very useful.

Can other people speak ?

Cheers

François

>
> I'd still suggest we should only require this investment from our
> adopters if we actually have a good reason to do so (as in: X breaks
> if we don't).  And that I still can't see.
>
>         -- Markus