[Heig] [EXTERNAL] [BULK] Re: Post running meeting thoughts

Thu Apr 2 18:35:37 CEST 2026

Hi Tess,

I think whether end users want to search for specific datasets vary depending on the data collection and types of data, and the number of and size of data products in the collection.

If the data collection consists solely of individual observations processed through standard data processing pipelines and that need further user data analysis that require the set of associated data products to extract science then I agree that the ability to search for the individual observation event list (or perhaps event bundle) with associated data products accessible using datalink is likely sufficient.

If the data collection contains advanced data products (for example the Chandra Source Catalog data products) the usage patterns change.  Our experience is that users doing catalog science typically identify potential sources of interest and then want to retrieve subsets of data products for those sources, often in several rounds.  

For example, they may identify hundreds, thousands, or in some cases tens of thousands of candidate sources matching their search criteria, and may subsequently download (e.g.) the light curves for all of the observations of these sources (on average 3 times as many as number of sources), do some automated pre-filtering on the light curves, and then download (e.g.) the cutout event lists surrounding the individual observation detections for further analysis.  They might subsequently come back to download the region definitions, and perhaps the individual observation PHA spectra of the detections.

This is a very different usage pattern where end users are retrieving particular data products for potentially a large number of objects, and subsequently refining the list and downloading additional data products, sometimes in multiple steps.  One reason for this approach is scale.  For example, there are roughly 100x the number of data products, and 10x the data volume, for the Chandra Source Catalog vs. the Chandra data archive data products for the set of processed science observations.

Could this be done by requiring end users to search for observations and then using datalink to access the individual data products?  Probably not, because many of our data products merge data from multiple observations and it would be very difficult to encode the necessary source — stack detection — observation detection linkages correctly.  In any case, doing queries like this en masse and then having to select subsets of datalinks is going to be much more difficult than a simple ObsCore query that directly returns the records (and access_urls) that you are looking for.

With regard to your specific question regarding RMFs.  I don’t know that users will download RMFs without either concurrently or previously downloading the PHA.  Occasionally users will search for RMFs (and ARFs) separately from PHA spectra because they have previously retrieved the latter.  On the other hand, we do for example see end users downloading PSFs independently from the primary datasets.  This is likely because the catalog includes a vast set of high quality PSFs (of order 10M) covering the entire Chandra field of view and PSFs are rather expensive to generate.

We have specifically tried very hard to focus on data discovery in the proposed ObsCore extensions note, and have used actual experience - how do we see our users wanting to work - to help guide our proposals.

Thanks,
—Ian

> On Mar 20, 2026, at 11:26, Jaffe, Tess (GSFC-6601) via heig <heig at ivoa.net> wrote:
> 
> Hi everybody,
>  
> I agree with Francois on a number of things, but especially that there is a lot of misunderstanding and misrepresentation going on here.  Nobody has ever expressed reluctance to ensure that HEA-specific ancillary products such as responses etc. are made available easily through VO protocols.  Let’s focus on what the issue actually is, because I think the discussion has lost sight of it.
>  
> In my opinion, the main issue is not whether things like response matrices are science data, are needed by the users, or should be in the VO.  I think we all agree that this is obvious.  The question is what is the best method for making them accessible in the needed context  and how far we need to customize what goes in the ObsCore table itself for different fields.  That then is a question about discoverability and complexity.   
>  
> Having an individual row in an ObsCore table enables a user to search for that one specific thing.  The best practice recommendation for the use of ObsCore is that the access_url be a datalink.  So for a given product listed in an ObsCore table, three queries are needed:  one to find the product, one to get its datalinks, and then one to download the file(s).  I cannot recall having heard of a use case where somebody was interested in finding only the RMFs from a given instrument in a given year.  (Please let me know if you have a use case for this so that we can address it directly. I can think of calibration projects, but this is an edge case that can be addressed another way.)  Users will instead want to find all of the spectra from some source/time/waveband.  That is why ObsCore has a row for such a product.  Nobody disputes that to do the scientific analysis on that spectrum requires the user to also have an RMF.  But that RMF does not need to be independently discoverable, just correctly linked to the spectrum that is of interest. 
>  
> Francois has proposed a number of solutions to this.  ObsCore has a very reasonable amount of flexibility and specificity, and it is quite important to worry about adding unnecessary complexity and size. (I myself was worried about the additional complexity of the datalink layer, but now in implementation, I’m becoming a fan.) The radio extension doc you may note proposes a number of fields that are all about discovery.  It then states, “Auxiliary datasets such as uv distribution map, dirty beam maps, frequency/amplitude plots, phase/amplitude plots are useful for astronomers to check data quality. In that case DataLink … may provide a solution to attach these auxiliary data to ObsCore records.”  That makes sense to me.  
>  
> So I suggest we follow what the radio folks are doing.  With this in mind, I think that three of the proposed columns -- T_intervals , Obs_mode , Event_type – are very clearly applicable to data discovery and should  be added to the ObsCore table.  But some of the other proposed fields would be better added in datalinks with a HEA-specific vocabulary.  We should discuss these on a case-by-case basis after having agreed on the purpose of a row in ObsCore.  
>  
> I hope this helps the discussion move along productively.
> 
> Tess
> 

—

Dr. Ian Evans
Astrophysicist
Chandra X-ray Center
Center for Astrophysics | Harvard & Smithsonian

Office: (617) 496 7846 | Cell: (617) 699 5152
60 Garden Street | MS 81 | Cambridge, MA 02138

 <http://cfa.harvard.edu/>cfa.harvard.edu <http://cfa.harvard.edu/> | Facebook <http://cfa.harvard.edu/facebook> | Twitter <http://cfa.harvard.edu/twitter> | YouTube <http://cfa.harvard.edu/youtube> | Newsletter <http://cfa.harvard.edu/newsletter>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ivoa.net/pipermail/heig/attachments/20260402/640ceb5f/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PastedGraphic-2.png
Type: image/png
Size: 581 bytes
Desc: not available
URL: <http://mail.ivoa.net/pipermail/heig/attachments/20260402/640ceb5f/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PastedGraphic-3.png
Type: image/png
Size: 21717 bytes
Desc: not available
URL: <http://mail.ivoa.net/pipermail/heig/attachments/20260402/640ceb5f/attachment-0003.png>