Hotwired 3 TDIG Breakout

Tue Nov 19 09:14:03 PST 2013

I’ve just managed (finally) to get myself subscribed to this list so I haven’t seen earlier messages. I’ll reply to the thread.

On Nov 18, 2013, at 22:17 , Rob Seaman <seaman at noao.edu> wrote:

> Good comments so far.  Suggest focusing on the use cases up front.  Without speculating on solutions, suggest that they be kept general purpose.  Would not assume that tens of thousands of transients per LSST visit must correspond to tens of thousands of VOEvent alert packets (though might correspond to that number of cutouts or other per-event data structures - on the other hand, one could amalgamate the cutouts into a data cube).
> 

One thing that has to be handled is the case of a single event that comes with cutouts and other support information. Is that handled by VOEvent v3 or is that handled as a VOEventContainer with a VOEvent v2 description referring to supporting binaries. 

> For instance one might entertain a hierarchical event structure, perhaps even a single alert per visit to provide telescope pointing, timestamp and provenance.  This could reference a tabular structure, one row per transient/variable celestial event in the field with positional offsets.  The table would have columns referencing the cutouts, etc.
> 
> The container (notionally a epub structure does seem a reasonable place to start, but it would be good to have some idea of how this scales to hundreds of thousands of sub-documents - tar certainly scales to many thousands of files) should be able to support zero, one, a few, or many VOEvent packets and associated files / data objects.  Or perhaps it might be a generic container mechanism that just happens to contain events.
> 

I’m pretty sure that zip and tar are no different in this regard. The advantage of zip is that it’s what everyone else is doing so programmatically unpacking the zip in your application and looking around inside it is a solved problem.

> Presumably the registry will want to know about flavors of containers, too.
> 
> Compression should be a consideration early in the process.  Note that the tradeoffs for many small files (e.g., FITS, but more general than this) are not the same as for a few large files.  Also checksums / message digests.
> 

Well, there are two sides to this. The zip container implementation gives you standard compression automatically (which is critical for xml) and nothing should care if the FITS file itself is tile-compressed (modulo the FITS I/O library understanding it - re my discussion regarding FITS standard versus FITS convention).

> Might consider use cases for incremental containers.  One could imagine a workflow that emits bursts of events that are not explicitly cadenced by visit - for instance, the events might be collected / partitioned by class (e.g., moving objects, known variables, "new" transients) for different streams that might have very different duty cycles than the nominal 60s / visit.  Each visit might then generate a delta update to a prior container which is accumulated for minutes, hours or even a whole night.  This accumulation could happen early in the workflow, or perhaps well downstream.
> 

This folds in somewhat with the digital signing. Each time you increment the container it gets a different signature.

> It should be possible to combine and divide containers, perhaps to sort / reorder them, maybe more boutique operations.
> 

and you get a different signature each time you do that as well. It’s fine if it’s LSST doing it and they re-sign it each time. If a broker was doing it you’d lose the LSST signature and it would have to be re-signed by the broker.

> 
> On Nov 18, 2013, at 3:59 AM, John Swinbank <swinbank at transientskp.org> wrote:
> 
>> Dear all,
>> 
>> It seems to me there are three reasons why one might want a VOEventContainer style system:
>> 
>> 1. Transportation — it is possible to ship an event plus supporting data (thumbnails, etc) as a single unit, rather than relying on the recipients fetching URLs to retrieve supporting data. That potentially eases the load on the originator (who no longer needs to worry about a horde of call-back requests after issuing an alert, nor about long-term hosting), and on the recipient (who has all the information they need to respond immediately, rather than waiting to retrieve more data before they can make an educated decision).
>> 

Yes. Putting the cutout in the thing you ship around the place takes up more network traffic but has many benefits downstream. It also allows the cutout on the original server to be moved around without having to guarantee a URL will work forever.

>> 2. Convenience — we can store and reason about large groups of events at a time, rather than millions of individual events.
>> 
>> 3. Space efficiency — we can avoid duplicated information within the container.
>> 
>> Are there others I’ve missed?
>> 
>> The simplest such system would, indeed, be just to make a giant zip/tar.gz/whatever which contained all the events and whatever else you wanted to bundle with them. We should be clear if & why that’s inadequate before deciding to build something else.
>> 
>> Some potential issues for consideration:
>> 
>> - What is legitimate to include in the container? Just VOEvents (one per .xml file?) and images (FITS?)? Other XML or image formats? Arbitrary data files?
>> 

I can’t think of any reason why we shouldn’t allow PNG versions of the cutout. We’ll have a mime type.

>> - Is it useful to attach metadata to the container? For example, describing the creator of the container and an inventory of its contents.
>> 

Yes. If you unpack a .epub (just rename it to .zip and unzip it) you’ll find a “.opf" file which is an XML file containing things like the publisher information, and a manifest of all the files that are expected to be found in the epub. 

http://www.idpf.org/epub/30/spec/epub30-overview.html

in ePub 3 this seems to be called the Package Document.

>> - How much structure do we need within the container? For example, should all events be named in a standard way? Doing so might avoid the need for an inventory: you can just use a listing of the container contents instead.
>> 
>> - Would the VOEvent format itself need modifying? In particular, is it legitimate for an event to use a file:// based URI, which is only valid within the context of the container?
>> 

It’s not actually a file:// URI. It would surely be done as a relative link from the root of the zip file. 

  <item href=“cutout1.fits” id=“cutout1” media-type=“image/fits” />

just like in an HTML file with a specified BASE. No reason why it couldn’t also include a DOI or an external URL (some ebooks may do that for video files that they don’t want to bundle).

>> - Do we regard the container contents as canonical, or does it act as a cache for material stored elsewhere? One could imagine using fully-blown remote URIs for references within the container, but providing a means of transforming that URL into a local one (eg, replace "http://xx.yy/" with "file://“). The reader does the transformation, checks if the resource is available within the container, and, if not, they can fetch it from the remote URI.
>> 
>> - We could imagine saving space by making heavy use of references within the container to avoid storing duplicate information. However, my hunch is that, as Rick suggests, the amount of saved space might not be substantial, and it adds complexity. We could also go wild, and so something like a full-blown template engine: that is, provide a skeleton laying out the structure of all events within the container, as well as separate lists of content for insertion into the event. I’d imagine you could get significant savings here, but again the complexity goes up, and you can no longer (trivially) extract a single event from the container just by using zip or tar — you now have to go through the process of filling in the template to get at the data.
>> 

It might be worth testing but compression algorithms should be able to spot xml duplication.

>> 
>> On 17 Nov 2013, at 3:03 , Frederic V. Hessman <Hessman at Astro.physik.Uni-Goettingen.DE> wrote:
>> 
>>> The fundamental problem isn’t to zip or not to zip (gzipped tar files should be just as good).
>>> 

I’m not convinced a two stage gunzip + untar is as good as a single stage unzip. Are there any file formats in the wild using .tar.gz but hiding it? I agree that the principle should be agreed first regarding whether we are going to try to inline the FITS/image files as base64 or go for a bundling approach. Obviously I favor bundling.

>>> 
>>> 
>>> Is isn’t the sending of packages of complete VOEvents in a zip file enough data compression?   For this, we don’t need anything new.
>>> Rick
>>> 

I think that’s probably true.

>>> 
>>> On 16 Nov 2013, at 13:14, Roy Williams <roy at caltech.edu> wrote:
>>> 
>>>> OK I think I see now. There are two kinds of container here:
>>>> 
>>>> (1) How to put a single VOEvent together with images and other binary content. Zip would be a good way to do this, with images and other files referenced by file:// links. It could have a suffix .xmlx for example.
>>>> 

See above. I think it wouldn’t need the file://.

.xmlx sounds like a bad idea to me because it doesn’t tell me that it’s anything to do with voevents. Something like .voec (VOEvent Container) seems more explicit.

>>>> (2) How to collect together a lot of (1) into a single big object for data transfer. Here we can have a manifest.xml which is data about the collection itself, common metadata etc, together with all the .xmlx files. Then all these can be zipped up into a .xmlxx file.
>>>> 
>>>> Is that what we agreed?

It has my vote. The trick then is to agree how you place all this information in the directory/zip in a structured way.

You could have a contents.opf file with all the manifest information, a container.xml providing some structure for each VOEvent which sits in its own xml file and all the FITS/PNG images.

— 
Tim Jenness