Hotwired 3 TDIG Breakout

Mon Nov 18 21:17:39 PST 2013

Good comments so far.  Suggest focusing on the use cases up front.  Without speculating on solutions, suggest that they be kept general purpose.  Would not assume that tens of thousands of transients per LSST visit must correspond to tens of thousands of VOEvent alert packets (though might correspond to that number of cutouts or other per-event data structures - on the other hand, one could amalgamate the cutouts into a data cube).

For instance one might entertain a hierarchical event structure, perhaps even a single alert per visit to provide telescope pointing, timestamp and provenance.  This could reference a tabular structure, one row per transient/variable celestial event in the field with positional offsets.  The table would have columns referencing the cutouts, etc.

The container (notionally a epub structure does seem a reasonable place to start, but it would be good to have some idea of how this scales to hundreds of thousands of sub-documents - tar certainly scales to many thousands of files) should be able to support zero, one, a few, or many VOEvent packets and associated files / data objects.  Or perhaps it might be a generic container mechanism that just happens to contain events.

Presumably the registry will want to know about flavors of containers, too.

Compression should be a consideration early in the process.  Note that the tradeoffs for many small files (e.g., FITS, but more general than this) are not the same as for a few large files.  Also checksums / message digests.

Might consider use cases for incremental containers.  One could imagine a workflow that emits bursts of events that are not explicitly cadenced by visit - for instance, the events might be collected / partitioned by class (e.g., moving objects, known variables, "new" transients) for different streams that might have very different duty cycles than the nominal 60s / visit.  Each visit might then generate a delta update to a prior container which is accumulated for minutes, hours or even a whole night.  This accumulation could happen early in the workflow, or perhaps well downstream.

It should be possible to combine and divide containers, perhaps to sort / reorder them, maybe more boutique operations.

Rob
--

On Nov 18, 2013, at 3:59 AM, John Swinbank <swinbank at transientskp.org> wrote:

> Dear all,
> 
> It seems to me there are three reasons why one might want a VOEventContainer style system:
> 
> 1. Transportation — it is possible to ship an event plus supporting data (thumbnails, etc) as a single unit, rather than relying on the recipients fetching URLs to retrieve supporting data. That potentially eases the load on the originator (who no longer needs to worry about a horde of call-back requests after issuing an alert, nor about long-term hosting), and on the recipient (who has all the information they need to respond immediately, rather than waiting to retrieve more data before they can make an educated decision).
> 
> 2. Convenience — we can store and reason about large groups of events at a time, rather than millions of individual events.
> 
> 3. Space efficiency — we can avoid duplicated information within the container.
> 
> Are there others I’ve missed?
> 
> The simplest such system would, indeed, be just to make a giant zip/tar.gz/whatever which contained all the events and whatever else you wanted to bundle with them. We should be clear if & why that’s inadequate before deciding to build something else.
> 
> Some potential issues for consideration:
> 
> - What is legitimate to include in the container? Just VOEvents (one per .xml file?) and images (FITS?)? Other XML or image formats? Arbitrary data files?
> 
> - Is it useful to attach metadata to the container? For example, describing the creator of the container and an inventory of its contents.
> 
> - How much structure do we need within the container? For example, should all events be named in a standard way? Doing so might avoid the need for an inventory: you can just use a listing of the container contents instead.
> 
> - Would the VOEvent format itself need modifying? In particular, is it legitimate for an event to use a file:// based URI, which is only valid within the context of the container?
> 
> - Do we regard the container contents as canonical, or does it act as a cache for material stored elsewhere? One could imagine using fully-blown remote URIs for references within the container, but providing a means of transforming that URL into a local one (eg, replace "http://xx.yy/" with "file://“). The reader does the transformation, checks if the resource is available within the container, and, if not, they can fetch it from the remote URI.
> 
> - We could imagine saving space by making heavy use of references within the container to avoid storing duplicate information. However, my hunch is that, as Rick suggests, the amount of saved space might not be substantial, and it adds complexity. We could also go wild, and so something like a full-blown template engine: that is, provide a skeleton laying out the structure of all events within the container, as well as separate lists of content for insertion into the event. I’d imagine you could get significant savings here, but again the complexity goes up, and you can no longer (trivially) extract a single event from the container just by using zip or tar — you now have to go through the process of filling in the template to get at the data.
> 
> I have no answers to the above at the moment, and I’m not necessarily arguing that all of the above are smart or worthwhile ideas, but we should at least consider these sort of issues and be clear about the use cases being addressed.
> 
> Cheers,
> 
> John
> 
> 
> On 17 Nov 2013, at 3:03 , Frederic V. Hessman <Hessman at Astro.physik.Uni-Goettingen.DE> wrote:
> 
>> The fundamental problem isn’t to zip or not to zip (gzipped tar files should be just as good).
>> 
>> The problem is that the “manifest" was supposed to contain basic info normally present in a complete VOEvent document so that the packaged events could reference it and hence be smaller.  This is an understandable approach from the perspective of the LSST but means that the multiple original events issued in such a package cannot be fully viewed outside of the package without the manifest.
>> 
>> This suggests to me that we should include a generic “include” mechanism for VOEvents:  the LSST promises to maintain it’s own list of referenced content so that a stripped down event can be reconstituted without reference to the package in which it was distributed.
>> 
>> However, I took the standard VOEvent example and looked to see how to trim it down to it’s bare-bones size, also with the aid of external references, but frankly, I don’t see how much gain is to be had.  It would be nice to have a brute force complete LSST VOEvent to play with….
>> 
>> Is isn’t the sending of packages of complete VOEvents in a zip file enough data compression?   For this, we don’t need anything new.
>> Rick
>> 
>> 
>> On 16 Nov 2013, at 13:14, Roy Williams <roy at caltech.edu> wrote:
>> 
>>> OK I think I see now. There are two kinds of container here:
>>> 
>>> (1) How to put a single VOEvent together with images and other binary content. Zip would be a good way to do this, with images and other files referenced by file:// links. It could have a suffix .xmlx for example.
>>> 
>>> (2) How to collect together a lot of (1) into a single big object for data transfer. Here we can have a manifest.xml which is data about the collection itself, common metadata etc, together with all the .xmlx files. Then all these can be zipped up into a .xmlxx file.
>>> 
>>> Is that what we agreed?
>>> Roy
>>> 
>>> On 11/16/13 11:30 AM, Matthew Graham wrote:
>>>> Hi,
>>>> 
>>>> I agree with John - people wanted some sort of structure to collect
>>>> common metadata. Tim Jenness suggested looking at epub which is zip
>>>> plus a structured inventory.
>>>> 
>>>> Cheers,
>>>> 
>>>> Matthew
>>>> 
>>>> On Nov 16, 2013, at 10:46 AM, Roy Williams wrote:
>>>> 
>>>>> 
>>>>> On 11/15/13 5:29 PM, John Swinbank wrote:
>>>>>> - Mike introduced his proposal for a “VOEventContainer”:
>>>>>> essentially a means of bundling multiple VOEvents together with
>>>>>> supporting data such as images into a single entity. This could
>>>>>> address both issues surrounding bulk transport as well as the
>>>>>> stated aim of the LSST folks to include cut-out images with their
>>>>>> events. The proposal received a positive response from the
>>>>>> audience, but there was some quibbling over technical details.
>>>>>> Mike will introduce his proposal and kick off a discussion as to
>>>>>> its implementation on the mailing list.
>>>>> 
>>>>> I heard the opposite of this. My impression was that there was a
>>>>> rejection of the idea of inventing a special IVOA format for
>>>>> bundling content. I heard a consensus that *zip* is a perfectly
>>>>> good way to bundle multiple events, to bundle events with binary
>>>>> content such as images. That zip is well known, widely implemented,
>>>>> and trusted. It is the solution used in similar cases, such as epub
>>>>> and docx format, where an XML document is bundled with images. Try
>>>>> it yourself: cp document.docx document.zip unzip document.zip What
>>>>> I heard is that this is a solved problem, and there is no need to
>>>>> invent anything new for bundling XML files and their images.
>>>>> 
>>>>> Roy
>>>>> 
>>>>> --- Caltech LIGO roy at caltech.edu 626 395 3670
>>>>> 
>>>> 
>>> 
>>> -- 
>>> ---
>>> Caltech LIGO
>>> roy at caltech.edu
>>> 626 395 3670
>> 
>