[VEP-003]: datalink/core#sibling

Tue Jan 7 10:45:59 CET 2020

Hi François,

On Fri, Dec 20, 2019 at 05:00:57PM +0100, François Bonnarel wrote:
> Le 20/12/2019 à 08:34, Markus Demleitner a écrit :
> > New Term: sibling
> > Action: Addition
> > Label: Sibling Data
> > Description: Data products derived from the same progenitor as #this.
> >    This could be a lightcure for an object catalog derived from repeated
> >    observations, the dataset processed using a different pipeline, or the
> >    like.
> If I compare this to the initial VEP-001 "associated-data" proposal
> and to the use case exposed in the other thread I wonder if
> "sibling" is the right word.  I'm not sure we can always identify a
> common progenitor for what I called the "Main" and what I called
> the "Target" (see the other thread for what I mean there) in the
> use cases VEP-001 was supposed to solve.

Can you describe the cases where you can't see the common progenitor?
Perhaps that would help us work out if

(a) #sibling to to special and needs to be generalised

(b) #sibling is useful and at the right level of generalisation, but
a second term is requried for something related but not quite
identical, or

(c) #sibling isn't useful at all and should be replaced by somthing
else.

> That's why instead of "associated_data" or "sibling" I proposed
> "Observation_Result_of_source".

Hm... I have to say I don't like it.  Why?  Well, datalink/core is a
vocabulary of properties, i.e., of things that in a simple
subject-predicate-object sentence work as predicates (with a minimum
of embellishment).  A datalink response row with columns ID,
semantics, and access_url thus expands to

  <ID> has-a-<semantics>  <access_url>

as in

  <ivo://example.edu/data?a/b/c> has-a-preview <http://example.edu/prev/a/b/c>

As there's little that's as practical as a good theory, I'd like to
try really hard to make sure that new terms match that pattern.  And,
well,

  X has-an-observation-result-of-source Y

is at least severely counter-intuitive.

I think what you're implicitly trying to do here is change the domain
of the datalink predicates, i.e., change what set X can be drawn
from.  So far, since ID in Datalink columns is a publisher dataset
identifier, it was implicit that all datalink/core properties had the
set of datasets (as defined by SSAP, say) as domain.

If I understand your intent correctly, then appending -of-source to
the term tries to change this at least for this term to say "well,
this term's domain isn't datasets at all, it's 'sources'".  I think
that goes far beyond the question of how to name or define a single
term; this is a large change in how clients should interpret datalink
results, and, indeed, it's a large change in what dataset identifiers
are supposed to mean.

Frankly, that's all a bit unnerving to me -- I mean, perhaps it's a
good idea to assign ivoids to "sources", but I'd rather wait with
that until we have defined what we think a source is (i.e., probably
the definition or a source DM).

Luckily, I think for what triggered VEP-001 and VEP-003 -- linking
gaia_source table rows to Gaia spectra and time series -- we don't
need to go all that profound.  A Gaia catalogue row may relate to a
source in a sense that we will have to make more precise in a source
DM, but it very certainly works just fine as a dataset.  Not a large
one, but still a dataset, complete with a pubDID.

This dataset is derived from a set of observations which also yielded
epoch photometry, RP/BP spectra, etc.  And in that sense at least for
this use case it seems #sibling is exactly on point.

Which of doesn't mean I'm claiming we're done already; as I said
above: if you have different cases, it might well be that using some
other concept might work better in the long run, which is why I was
asking for them.  Let's just make sure we don't needlessly blur what
datalink rows actually mean, because that's going to hurt all clients
down the line: Computers are bad at guessing.

         -- Markus