[meas] RFC comment - MD #8,9,10,11

Mon Sep 30 14:18:35 CEST 2019

Hi DM,

On Sat, Sep 21, 2019 at 10:44:15AM -0400, CresitelloDittmar, Mark wrote:
> 8) would like to remove Bounds?D, Ellipse, Ellipsoid, and CovarianceMatrix
> <https://wiki.ivoa.net/twiki/bin/edit/IVOA/CovarianceMatrix?topicparent=IVOA.MeasRFC;nowysiwyg=0>
> 
> These are in the model as basic forms of multi-dimensional error
> types which do occur in our data.  I admit that the specific
> products that I'm working with do use them, so they could be

should that be "do *not* use them"?

> removed without directly affecting that.

> However, the risk of removing the multi-d types is that it then becomes
> very tempting to move the relation between the coordinate value(s) and
> their Error (as you do in the alternate syntax in #11).  This is a topic
> we've gone through more than once I think.. and I strongly resist changing
> that relation.  As soon as you need to incorporate the multi-d errors (like
> Gaussian), it needs to be associated with the coordinate pair.

I'd say (and FX said about as much, I think) that we'll probably need
to explicitly model both scalars and vectors, and I guess in Meas
that should be generic vectors; in an ideal world we would leave the
actual model to VOTable.

*I*'d then first look at the error models for scalars, because that's
what *I* have plenty of examples for.  I'm a lot less certain about
vectors, as I'm not aware of a single example where I have
non-trivial (as in: correlated) errors for vector or matrix values --
and where I have errors on array-valued columns, the arrays aren't
vectors but just collections of values -- as are the values, then.
(Example: the bp_flux and bp_flux_error in the table
gaia.dr2epochflux on http://dc.g-vo.org/tap.

> If the CovarianceMatrix representation is wrong.. it should be
> corrected or removed.  This isn't really in my wheelhouse, but I
> thought I looked into it enough to have it properly represented.
> If not, I'd need someone to provide the corrected model.

"Wrong" is a harsh word -- you're just storing quite a few values
twice (25% in 2x2, 33% in 3x3, asymptotically 50% as the matrix size
grows).  As I said, you could fix that by just keeping the upper
right triangle of the matrices, as in:

SymmetricMatrix2x2:
  m11
  m12
  m22

SymmetricMatrix3x3:
  m11
  m12
  m13
  m22
  m23
  m33

-- but frankly, I'd still consider this a rather cumbersome notation,
in particular since we'd have to define one class per vector space
dimension (and at least 6D -- space and derivatives -- seem rather
obvious to me).

This extra notation is particularly annoying because we already have
a way to express all sorts of arrays: VOTable's PARAM/@arraysize and
PARAM/@value (ok, I give you that for strings that... has room for
improvement).  Hence, I'd frankly say: "how the matrix is described
(in terms of its size and content) is up to the serialisation
format."  And then put any constraints (e.g., "a covariance matrix
must be a symmetric matrix with as many rows as the vector it is an
error for") into human-readable text.  Such things aren't easily
expressible in modelling languages, but we shouldn't uglify our
models too far in order to coax validators into validating things
beyond their complexity class.

> could model it as a list of cells, but would still need 2 types with fixed
> lengths (#cells).  That is an easy switch, but more bulky in direct
> serializations.

Well, I, for one, am aiming for VOTable, and I'd very much hope that
an array would be a VOTable array then (similar concerns apply for
FITS tables).

Questions like these are why I'm so sure we need to wait for the
actual serialisation rules.  I do expect that people will only care
if they see that, right now, they'd probably write 9 (ok, 6,
exploiting the symmetry) params with humonguous labels rather than
something like

  <PARAM name="covMat" datatype="double" arraysize="3x3" value=
    "0.1  0    0.01 
    0     0.2  0.5
    0.01  0.5  0.3"/>

(say).

> 10) points out that Gaussian, and other distributions, are missing, but
> also seems to indicate that we don't want to get into these at this point.
> 
> So, we're not adding distributions yet.. please.

But I'd say we have to say that we're aware of this deficiency and
say something about the limitations ensuing from that.  It would also
make the model more useful if we said in Symmetrical's description
something like "Annotators should see that value-radius ..
value+radius covers about 70% of the value's distribution (`1
sigma')" and analogously in Asymmetrical "The interval value-minus ..
value+plus should cover about 70% of the value's distribution."

I suspect that in some (not so far) future version we'll have to
introduce error classes that go beyond the standard "1 sigma"
everyone loves.  For a first version we might get away with just
doing "1 sigma".  Or we extend the class to allow people to say
something like "2 sigma" ("interval specified corresponds to about
95% of the value distribution") right away -- does someone have
tables that have that?  Software that would want to consume something
like that?

> 11) general issue with the manner that correlation is modeled, suggests a
>       ra   = GenericMeasure(value=20
>         Error(id=err-ra, statError=1e-7))
>       dec = GenericMeasure(value=30,
>         Error(id=err-ra, statError=1e-7))
>       Correlation(err1=err-ra, err2=err-dec, coeff=0.5)
> 
> The problem with this sort of ad hoc serialization representation is that
> I'm not sure what you mean in terms of the model.  It looks like maybe
> this..
> 
> [image: correlation.png]

Essentially, yes: A Correlation is a pair of errors and a numeric
value, the correlation coefficient.

> Which I don't quite understand... really implies only 1D errors.

Yes.  But of course every vector or matrix consists of scalars, and
so if we go this way *and* didn't do any explicit modelling of errors
for vectors or matrices (which I'd consider reasonable for a first
version), we could still annotate these by adding an index attribute
to the Correlation class; again, it would really help if we had the
mapping document to illustrate the consequences of the different
design choices (which is why I suggest to first get that out of the
door in emergency mode).

So, instead of a having a Matrix2x2 error for a vector x:

  m11 = 0.4
  m12 = 0.2
  m21 = 0.2
  m22 = 0.3

you you would say

Correlation(err1=x, index1=1, err2=x, index2=1, value=0.4)
Correlation(err1=x, index1=1, err2=x, index2=2, value=0.2)
Correlation(err1=x, index1=2, err2=x, index2=2, value=0.3)

-- I give you it's not exactly pretty, but given that it's a
universal and common mechanism for all sorts of correlated errors I
suspect it's a deal overall.

> Anyway.. if the correlation isn't done right in this model, it is probably
> best to skip it until we have a concrete example to work.  Perhaps with the
> catalogue/Source properties thread which is getting input from Gaia (I
> think).

Right -- which is why I'd so much like to see that as a use case.
In short, that would be:

  The Gaia satellite observes ra, dec, parallax, rmra, pmdec, and
  radial velocity, and photometry for a large number of stars.  The
  reduction correlates the first five of these values.  A client
  wants to work out the resulting covariance matrix without having to
  know about the specific Gaia data model.

In case you want to watch the situation live: Check
gaiadr2.gaia_source on your favourite Gaia-carrying TAP service, and
check the columns having a UCD starting with stat.correlation (in
ADQL: something like

select * from tap_schema.columns
where 
ucd like 'stat.correlation;%'
and table_name='gaiadr2.gaia_source'

).  Right now, you'll see dec_parallax_corr, dec_pmra_corr,
ra_dec_corr, ra_parallax_corr, ra_pmdec_corr, ra_pmra_corr,
dec_pmdec_corr, parallax_pmdec_corr, parallax_pmra_corr,
pmra_pmdec_corr.

*If* we accept Gaia as something we'd like to describe, then I think
we're faced with the choice of introducing a CovMatrix5x5 (and a way
to reference these fields from in there) -- or just introducing a
Correlation class (which, as shown above, might then also let us drop
CovMatrix2x2 and CovMatrix3x3; since it seems to be that Bounds?D and
Ellips* are just special cases of these, they could then also go --
wouldn't that be nice?).

         -- Markus