Post-workshop Measurement musings

Tue May 18 11:44:17 CEST 2021

Dear DM,

At yesterday's DM interop preparation workshop, I was asked to bring
forward a model for Measurement that I'd consider fine for my programme
of "cover a governing use case".

This use case for Measurement from my perspective is "plot error bars"
(which is think is easily sold to the client writers, which, full
disclosure, is what I'm convinced should guide us), with a
perspective to "automatic error propagation" in the future.

I think the current Measurements proposal will essentially work for that
when we drop a few of the boxes -- and then drop anything that is not
used by a client at the time of RFC.

What I'd like to see unconditionally dropped are the Time, Position,
Velocity, ProperMotion, and Polarization classes; they entangle the DM
with other DMs without giving a benefit I can perceive; for the rough
classification of quantities we have UCDs, and frames, photometric
metadata, and similar data can be attached directly to the columns.

For the rest, I strongly suspect you won't see implementations for the
3D errors, so I'd not be surprised if those dropped out at the
RFC implementation test.

The 2D errors I suspect may be convenient shortcuts.  But really, in the
end we'll need a proper model for correlated errors, perhaps as
envisioned by
https://github.com/msdemlei/astropy#working-with-covariance, but I'd
strongly advise to postpone that to later versions -- it'll scare
adopters unnecessarily, and I think it's really only useful once we
want to do automatic error propagation (which is Sci-Fi at this point
for all I can see).

That's basically it (and I've said as much on the two RFC pages).

If, on the other hand, you ask me how I'd build the measurement/error
thing if I got to design it from scratch... Well, in some ad-hoc
notation what we ought to have is at first (where "column" could of
course be a param as well and perhaps a literal):

Measurement:
	location: the column containing the value
	label: some human-readable designation how this annotation is to
	  be understood
	error_type: "stat" by default, or "sys", perhaps later other values;
		note that a single column can have both stat and sys annotations
	naive_error: a column containing a naive, symmetrical error
	naive_lower: a column containing a naive lower bound
	naive_upper: a column containing a naive upper bound
	naive_plus: a column containing a naive upper error
	naive_minus: a column containing a naive lower error

"Naive" here means that we don't actually say what this is (as in "one
sigma" or so); that's not known or specified in many sorts of data, and
while humans will eventually have to figure it out if they want to
interpret the error bars, it's not important for the first governing use
case.  Everything except location is optional, and data providers would
be encouraged to only give one of naive_error, (naive_upper/_lower), and
(naive_plus/_minus) in one annotation.

If we find a client that wants to plot error ellipses, we'd add

Measurement2D:
	location1: columns containing the position
	location2:
	semiMajor:
	semiMinor:
	posAngle:

as in current Measurement's ellipse (or whatever the client writer
says).

That would be it for the first round.

Once we've figured out how to talk to the client writers, I expect
they'll want to learn about correlated errors.  For that, there'd be a
class

Correlation:
	error1: column that contains the first error
	error2: column that contains the second error
	correlation_coeff: the entry in the covariance matrix

(and possibly other representations of correlations as requested by the
client writers).

And then, when we want to actually enable error calculus, I expect we
need to represent actual distributions.  I'm just mentioning this here
to show one way in which that could be done.  I'm pretty sure we'll want
something else in the end, but that would need to be worked out between
consumers (client writers) and producers (data providers) strictly based
on actual use cases.

Having said that, we could extend Measurement (meaning: even with
distributions, data providers should still provide some naive error
measure) by saying:

	dist_func: (from a vocabulary)
	dist_pars: array of DistPar

and 

DistPar:
	name: (literal, depending on dist_func)
	value: something

For instance, a Gaussian-distributed column z could have

(Measurement) {
	location: z
	naive_error: z_err
	dist_func: "normal"
	dist_pars: [
		{name: "mu", value: z}
		{name: "sigma", value: z_err}
	]
}

I think defining all the various distributions as separate classes
wouldn't help the clients writers enough to make it worthwhile.  Just
having a master list (vocabulary?) of what dist_funcs have what
dist_pars ought to do the trick -- if a client doesn't know a specific
dist_func, it's hosed whatever we do.

One important special case would be non-parametric distributions,
perhaps like this:

(Measurement) {
	location: z
	naive_error: 0.5
	dist_func: "deviation_histogram"
	dist_pars: [
		{name: "sampling_points", [-1, -0.5, 0, 0.5, 1]}
		{name: "sampling_values", [0.01, 0.2, 0.68, 0.1, 0.01]}
	]
}

-- but as I said, that's just Sci-Fi I'm inventing here to show that we
*can* extend this to support actual error calculus once we've worked out
the basic cases.

           -- Markus