UCD question: what UCD word to use for percentiles? stat.percentile proposal

Bo Milvang-Jensen milvang at astro.ku.dk
Mon Mar 21 17:30:37 CET 2022


Hi Mark et al.,

While I liked idea of being able to indicate what percentile was meant, 
you make a good point, and I now tend to agree with you.

Somewhat related: the UCD word stat.error has definition "Statistical 
error", and thus no specification whether this should be a 1-sigma 
error, or would apply to any type of error. So if UCD words were 
introduced for 1,2,3 sigma percentiles below and above the median, one 
might then also want UCD words for the symmetric 1,2,3 sigma error bars, 
and for the asymmetric ditto (i.e. below or above say the median).

Kind regards, Bo

On 3/21/22 5:13 PM, Mark Taylor wrote:
> Hi Bo et al.,
>
> I'm not a semantics expert, but my feeling is that trying to go into
> more detail than stat.median and stat.percentile would be a mistake.
> As noted, designating them by number of standard deviations away
> from the mean requires assumptions about the distribution, and
> in any case providing any fixed set of numeric values is bound to
> disappoint data providers who have different percentiles available,
> unless a very large number of them is provided.  If the UCD mechanism
> provided some way to associate numeric values with the semantics
> it would be nice to do that here, but it doesn't (we've encountered
> this before with e.g. HEALPix pixel ID at depth N).
>
> So my suggestion would be to stick with just adding stat.percentile
> (or maybe stat.quantile) which is enough information to tell a human
> or computer *roughly* how to treat such a quantity.
>
> Mark
>
> On Thu, 17 Mar 2022, Bo Milvang-Jensen wrote:
>
>> Dear Mireille, Sebastien, IVOA Semantics group and colleagues,
>>
>> Thank you very much for giving my question so much thought. Your proposed new
>> words are clearly useful for my use case. My comments are:
>>
>> The proposed new word stat.percentile is clearly a good idea.
>>
>> The proposed new words stat.percentile.1sigma (and 2sigma and 3 sigma) are
>> also useful (and something I had not thought about myself), as they provide
>> more information about what percentile is meant. Your scheme of adding either
>> stat.min or stat.max, as in
>> stat.percentile.1sigma;stat.min
>> stat.percentile.1sigma;stat.max
>> works, but I am not sure it's the most satisfying solution. As far as I can
>> see, one would never use stat.percentile.1sigma without adding either stat.min
>> or stat.max, so I would therefore create separate words for the percentiles
>> below and above the median, e.g.
>> stat.percentile.lower1sigma
>> stat.percentile.upper1sigma
>> And similarly for 2sigma and 3sigma. I am not sure what the best wording would
>> be. If you want to use more characters, one could insert the word "median", as
>> in "1sigmabelowmedian". (And instead of lower/upper one could user
>> below/above.) One could also have another level
>> (stat.percentile.1sigma.lower), which could be more readable.
>>
>> I want to note that e.g. the 16% percentile is only guaranteed to be located 1
>> standard deviation ("sigma") below the median (and mean) for a normal
>> distribution, whereas for asymmetric distributions that would not be the case.
>> (Disclaimer: I am not a statistics expert.) It should be therefore be
>> understood that these new UCD words can be applied to the percentiles that in
>> a normal distribution would correspond to 1,2,3 sigma below/above the median,
>> but which in the concrete case may not have that property.
>>
>> I think that the 1sigma/2sigma/3sigma naming is fine. If you instead wanted to
>> have the actual numbers, a problem is the dot in e.g. 2.5%. Instead of per
>> cent one could use per mille. I have looked up what the percentiles (in per
>> mille!) are for a normal distribution for -3,-2,-1,+1,+2,+3 sigma:
>> 1.3499000000000194
>> 22.750130000000013
>> 158.65525499999995
>> 841.3447450000001
>> 977.24987
>> 998.6501
>> So one could create the words
>> stat.percentile.1permille
>> stat.percentile.23permille
>> stat.percentile.159permille
>> stat.percentile.841permille
>> stat.percentile.977permille
>> stat.percentile.999permille
>> But I am not sure it is more elegant. (And I note that my catalogue (not
>> created by my) has e.g. the 2.5% percentile and not 2.3% which would be the
>> logical choice.)
>>
>> I would like to use the new proposed UCD words (either directly what you
>> wrote, or a modified version based on what I suggest now) in my catalogues for
>> publication in ESO's Phase 3. How long would it take before the new words
>> would be approved? I suppose they need to be approved before ESO can accept
>> them. I can say that we found a small problem with one column in the
>> catalogue, so the final version will probably not be ready before 1-2 weeks,
>> as the main author is finishing his PhD thesis these days.
>>
>> Kind regards, Bo
>>
>> On 3/17/22 12:39 PM, Mireille LOUYS wrote:
>>> Hi Bo , Hi semantics,
>>>
>>> We have re-examined your use case together with S. Derriere and A. Preite
>>> Martinez and checked also how Vizier handles percentiles.
>>>
>>> There is indeed currently no proper way to describe with UCDs that a
>>> measurement is associated to some percentile
>>> of a statistical model/distribution.
>>> Creating a new word could help describe these values :
>>> Q stat.percentile    Percentile in a statistical distribution
>>> We could also have a few more precise words to address exactly what you are
>>> trying to describe :
>>> Q stat.percentile.1sigma    Percentile corresponding to one standard
>>> deviation from the median
>>> Q stat.percentile.2sigma    Percentile corresponding to two standard
>>> deviations from the median
>>>
>>> With these words, we could use :
>>> ucd="src.redshift.phot;stat.percentile.2sigma;stat.min"  for EAZY  2.5%
>>> percentile of photo-z
>>> ucd="src.redshift.phot;stat.percentile.1sigma;stat.min"  for EAZY  16%
>>> percentile of photo-z AND LePhare photo-z lower limit, 68% conf. level
>>>
>>> ucd="src.redshift.phot;stat.median"  for EAZY  50% percentile of photo-z
>>>
>>> ucd="src.redshift.phot;stat.percentile.1sigma;stat.max"  for EAZY  84%
>>> percentile of photo-z AND LePhare photo-z upper limit, 68% conf. level
>>>
>>> ucd="src.redshift.phot;stat.percentile.2sigma;stat.max"  for EAZY  16%
>>> percentile of photo-z
>>>
>>> In the UCD vocabulary, maybe an extra word would cover all possible cases :
>>> Q stat.percentile.3sigma   Percentile corresponding to three standard
>>> deviations from the median
>>> I hope this helps .
>>> I have created a VEP-UCD for this term , and will circulate it in the UCD
>>> Board to discuss it for adoption .
>>>
>>> Tell us wheter you can use this , and your feedback in case .
>>> Thanks in advance .
>>>
>>> Mireille & Sebastien
>>> CDS, Strasbourg
>>> ----------------
>>> ------------------------------------------------------------------------
> --
> Mark Taylor  Astronomical Programmer  Physics, Bristol University, UK
> m.b.taylor at bristol.ac.uk          http://www.star.bristol.ac.uk/~mbt/


More information about the semantics mailing list