Axes in Obscore

Thu Apr 23 11:57:39 CEST 2015

Dear DAL, dear DM,

On Thu, Apr 23, 2015 at 09:51:46AM +0200, Marco Molinaro wrote:
> regarding this topic I have a small use case that comes from a (currently
> custom) set of services whose aim is to allow velocity spectra analysis of
> galactic FITS cubes.

That's a perfect use case for obscore+datalink, I'd say.

> a - a super-set of FITS cubes from non-homogeneous galactic surveys and
> pointed archives in the radio band is deployed to allow velocity spectra
> analysis
> b - the first step for the user is to search in this set for available data
> along a line-of-sight, with possible filtering on a cone around it, or a
> box around it, limiting the velocity range, selecting explicitly one/more
> survey(s) by name, species, transition, ...

It seems to me most of the necessary metadata already is in obscore,
no?

> c - the search output (which includes something along the lines of a
> PublisherDID) is then used to explicitly cut the needed cube(s) to make
> data transfer affordable (in the near future merging of adjacent
> "same-survey" cubes will be also implemented)

And here I'd argue that's a Datalink thing.  There's just too many
sorts of processing one could do on data products to have any
hopes of describing them in a single database table, and datalink
lets you do exactly what you're asking for with minimal overhead on
both the table and the client (it will typically have to request one
small file per cutout, of course, but given the transfer volumes
we're talking about here on the data side that's neglible).

Conversely, just having the pixel sizes of the cube (as in the +6
columns proposal) won't really help you for your use case either, and
even if that information could, you'd still have to have some
descriptor of the access service somewhere, and so you'd have to use
datalink either way.

> The need for WCS information in the output of the search comes from the
> idea of allowing the client side to build correct cutout queries to the

Well, let me do a general plea here: "Keep data discovery and data
access separate as much as you can."  Datalink is the model to
efficiently perform that separation.

> I take the chance of this mail thread to give here also my 2 cents on the
> regexp approach Markus described in the DM-listed starting post on this
> topic (http://mail.ivoa.net/pipermail/dm/2015-April/005150.html).
> I tend to agree with Laurent's reply content.

Uh, the regex was the specification of the column contents (see
below).  True, for certain use cases you'd have to use SQL LIKE in
queries, but that's not terribly unusual either.

> I agree adding fields to a table is something we should care about, but
> packaging information is not usually my preferred solution.

Well, it's denormalisation, but obscore is all about denormalisation.
There's nothing wrong with that, many database schemata work like
that (essentially, most views are about denormalisation when you look
closesly enough).

Since I'm already talking, let me reply to Laurent's mail at
http://mail.ivoa.net/pipermail/dal/2015-April/007049.html, too:

> Let me express a bit of skepticism about the idea of gathering 
> information related to several columns in one encoded string (a RegExp 
> what's more).

First, the regexp is just the formal specification of the contents of
the column.  If this sounds to PHP-ish for you, you could say the
contents is specified by a right-linear grammar (and it would be,
here's the same thing EBNF:

axes_spec = axis_descriptor { '/' axis_descritor }
axis_descriptor = "s" | "em" | "pol" | "t" | ...
)

Then, I'd dispute it's related to several columns.  It's an
enumeration of axes types.  As such, you could put the thing in first
normal form by defining a second table with the structure

/------------- primary key ------------------\
pubDID                          | axis index | axis descriptor 
(foreign key into ivoa-obscore)
ivo://ab                          1            s
ivo://ab                          2            t

I believe this description is, from a DM perspective, all we need,
and indeed all we should have (i.e., require from data providers) for
discovery.

Indicentally, the DM-clean way to have the +6-column proposal would
be to have another column axis-length there. Hence, the +6-column
proposal is a special form of denormalisation itself (to, for
instance, Arnold's chagrin).

Be that as it may, we certainly don't want a second table -- that
would be even more painful than six additional columns.

The natural denormalisation (wrt the 1st normal form) would be to use
a set, or, since we model an ordered sequence in this case, an array
of axis descriptors in the obscore table itself.  We don't have good
operators to deal with sets and arrays in columns in ADQL.  So,
arrays aren't really a promising route for denormalization given the
technology stack that we have.

What we can do, however, is simulate the array, and that's what the
grammar above and the proposed queries in the original mail are
about.  That's not revolutionary at all.  It's already done in, e.g.,
RegTAP, and before that by VAO's custom Registry interface.

So, trust me on this, the proposal is sound from a theory/DM
perspective.

I give you, though, that there are open issues from a practical
perspective.  Mine are:

(1) certain types of queries (e.g., "give me all datasets that have a
certain axis type in any position") aren't really too well suited
for going through indices.

(2) there might be major *discovery* use cases that require additional
information on the axes

On (1), I've already written something in
http://mail.ivoa.net/pipermail/dm/2015-April/005150.html, which I
think hasn't been disputed yet.

On (2), I can well imagine they exist, but I'd still hope we can avoid
expanding obscore by 20% to satisfy those.  Let's identify them if
they're there, shall we?

Let me mount the soapbox once more (I'm done in a few lines): Think
of our adopters.  Based on what I hear from DaCHS' users and even a
sizable crowd on this list, I'm convinced that additional fields in
DMs are being paid for in terms of takeup (not to mention that people
tend to put junk in fields whose purpose they don't understand).

For the sake of takeup, please don't add fields without a strong,
validated use case that cannot be sanely satisfied in any other way.

Cheers,

          Markus