VOUnits: _another_ version, based on implementation feedback

Francois Ochsenbein Francois.Ochsenbein at astro.unistra.fr
Wed Nov 6 09:12:51 PST 2013


Greetings,

I've followed the discussion on VOUnits and I agree that the last version
(2013102) is a real improvement for clarity and for removing ambiguities.
I have however a few remarks:

* allowing both "unrecognised" units and "quoted" units (by the way, why the
   single quote('), and not the double quote(") more common for a citation?):
   isn't there some contradiction ? At least in validation procedures, allowing
   only explicitly _known units_ and _quoted units_  would produce more reliable
   documents, assuming that _quoted units_ are defined somewhere in the document
   (e.g. a VOTable) which makes use of such non-standard units.

* still about _quoted units_: while not explicitely specified in the document
   I imagine these can be combined in expressions like m'MoonMass'/yr, as it looks
   to be possible from the grammar ? The usage of  _quoted units_ becomes quite
   useful to represent some "natural" unit in the case of some modelisations
   (e.g. gravitational potential in a galaxy)

* About the units listed in Table 2: I tend to agree with Arnold that the abbreviations
   "Ba" and "ta" proposed for the Besselian and tropical years look strange -- if such
   units are required, "Byr"  and "tyr" would likely reach a better consensus;

* still about Table 2, the "B" for "Byte" looks also quite unusual; I understand
   that the authors wish to allow units like "MB/s" or "MiB/s", but recommending
   "B" alone as meaning "byte" looks bizarre (capitalized unit symbols refer to
   human names like Joule, Kelvin, Herz, etc). I would feel more comfortable if "byte"
   would be recommended for byte unit, and saying that multiples of "bytes" can be
   written "B" instead of "byte" (in other terms, multiples of "B" are "bytes",
   while sub-multiples are "Bell"). Maybe "B" alone (without prefix) just be forbidden?

* some wide-spread physical constants like the speed-of-light (c), Planck's (h)
   Boltzmann's (k), or gravitation (G) constants -- not talking about pi -- are
   frequently used in units (e.g. MeV/c2 for masses); the document says a few words
   about their usage for transformations (section 3.3), but are these constants forbidden
   in units ? Note that c,or k are unambiguous, but h is commonly used for the cosmological
   factor, and G is collapsing with Gauss.

All the best,
François

Le 05/11/2013 12:08, Norman Gray a écrit :
>
> Greetings, Semantics people.
>
> Thanks to Markus for kicking off a very useful discussion of the units document
> this week.  Like Markus, I believe that the remaining decisions are not
> profound, but do need some explicit consensus on the mailing list.
>
> [this is another long one, but it's not particularly intricate, after the first section, so you can probably skim it fairly quickly]
>
> In rough order of contentiousness...
>
> ----
>
> Prefixes on quoted units
>
> Myself, I share everyone's mild distaste, but I think these are pretty
> much unavoidable (and not perhaps quite as bad as you may think).
>
> The logic is:
>
>    * we want to allow 'unknown units' (because otherwise we have to have a long
>      list of approved units, which will be permanently out of date, and which
>      will never satisfy everyone)
>
>    * but that means we have to allow SI prefixes on unknown units (because
>      if we don't, then "MBa" (mega-besselian-year) will be syntactically valid or
>      invalid depending on whether or not 'Ba' is listed as 'known', which means
>      that data providers will have to memorise the list of known units)
>
>    * so we have to have a way of quoting units (or else 'martianDay', presuming
>      that's an 'unknown' unit, would have to be interpreted as milli-artianDay)
>
>    * so we must, I think, allow prefixes on those quoted units (or else we have
>      to write "'martianDay'" for those units, but remember to drop the quotes and
>      write "kmartianDay" when we talk about 1000s of them (and then how do we
>      parse the unit of 1000 days on Io, the "kioDay"?)).
>
>
> The reason that SI (and IEEE binary) prefixes don't explicitly appear
> in the grammar (ie, there's no 'si-prefix base-unit-string' pattern) is
> because there's an intrinsic ambiguity here with, for example, the 'Pa'
> or the 'mag'.  The only reason we don't parse this as the peta-'a' or
> milli-'ag' is because we have the _semantic_ knowledge that 'Pa' and
> 'mag' are members of a small, but not negligible, set of special cases.
> That is, the (high-ish level) semantics of some unit strings interfere,
> in a rather irritating way, with the otherwise purely (low-level) syntactic
> issues involved in parsing the string.
>
> [[[ Technical aside (for those who haven't had the pleasure of working with
> parser-generators): the way such a parser framework works is that one function,
> the 'lexer', identifies 'terminals' in an input, such as STRING and
> SIGNED_INTEGER, and reports them in sequence to the actual parser, which uses
> the 'grammar' to decide that the string of lexemes is or is not an allowed
> sequence.  The 'grammar' is still purely syntactic, with no semantics attached.
>
> This isn't just a yacc problem, by the way: essentially the same problem would
> appear using any parsing technology, so it's an artefact of the desire for a
> machine-readable grammar, in contrast to specifying the grammar in text and
> requiring implementers to create a hand-written parser. ]]]
>
> That means that a lexer can't be given the task of spotting the 'si-prefix'
> strings, and the prefixes have to be identified in a sub-parse of the
> STRING or QUOTED_STRING which emerges from the lexer.  Put another way,
> a 'semantic sub-parse' _has_ to have a (brief) look at the unit string
> _before_ we split it into base unit and prefix.
>
> It's that sub-parse that ensures that only permitted prefixes appear,
> and it's the same sub-parse that ensures that the prefix before the
> QUOTED_STRING is only one of the permitted ones.  So yes, Markus's example
> of gargantuan'jupiterMass' does indeed _appear_ to be valid according to
> the yacc grammar, but the text of the specification, and the library, forbids it.
>
> (actually, the text could be a bit explicit about that.  How about:
>
>> Quoted units can take prefixes (they are `unknown units', so there are
>> no restrictions on their usage), so that \unit{m'furlong'} is a
>> milli-furlong, and \unit{m'm'} is a milli-`m'.  As with 'known units',
>> the only permissible prefixes are those of \prettyref{tab:vouscalefactors}.
>
> and I should highlight, in the grammar appendix, that this is an extra-syntactic
> constraint)
>
> That is, Markus's remark:
>
>> believe if what we want to do here is allow prefixes on quoted units,
>> things should look somewhat like
>>
>> siPrefix: "u" | "c" | "d" | "da" | "h" |...
>> unit: ...
>>   | siPrefix QUOTED_STRING
>>
>> -- and that would be ugly because we don't otherwise talk about SI
>> prefixes in the grammar, and I'd not feel to good about introducing
>> them now.
>>
>> If, on the other hand, the "gargantuan" above should only blow up
>> during unit interpretation, we have another error type that would
>> come out of the parser, something like "invalid SI prefix", and
>> that's arguably a complication of the interface, not to speak of the
>> parser function for the unit production.
>
> ... is perfectly correct.
>
> Regarding the interface, when presented with "gargantuan'jupiterMass'",
> the C library reports "parse error: units parsing error: Impossible
> prefix before quoted unit", and the Java library "Error parsing units:
> unity parser error at character 11: error creating unit -- bad prefix?:
> gargantuan".
>
> That is, the gargantuan'jupiterMass' _is_ reported as a syntax error, rather than
> having to be asked about (as is the case for the known/unknown unit distinction).
>
> (well, _now_ the C version reports that; before 10 minutes ago, it produced an
> assertion error!)
>
> Turning to the rationale for these...
>
> Markus:
>
>> So, it may still be on the mostly harmless side of specification
>> prose, but I'd still say we shouldn't just do it because we can --
>> unless somebody clearly speaks out in favour of it (so we have a
>> target for pointing fingers later:-) I'd still prefer if it weren't
>> there.
>
> But as Rick says, the motivation here is to ensure robust parsing of
> unusual units.  If we forbid prefixes on quoted units, then we're saying
> that quoted units are very significantly different from unquoted but
> unknown ones.  That means that we forbid for example M'jupiterMass' --
> that looks pretty harmless to me, and so forbidding it doesn't sound
> like a great idea.
>
> I do say in the text (as Markus quotes), "this is not often likely to be a good
> idea."  I think that's true, but I can imagine it will _sometimes_ be a good idea.
>
> The only downsides to this are, it seems to me, that it makes the grammar
> less pretty (which I can live with), and makes the internal sub-parse
> marginally more complicated (but that's an implementer's problem).
>
> ----
>
> Quoted function names
>
> I _think_ that the idea of quoted function names was introduced (by me?) largely
> out of symmetry with the quoted units.  I can't (come to think of it) think of
> any reason why we'd want to distinguish
>
> log(Hz)
>
> from
>
> 'log'(Hz)
>
> Can anyone else?  Also, since there are no prefixes allowed on functions (!)
> there's no other ambiguity.  I'd be happy to remove this from the grammar unless
> anyone can reconstruct why we thought this was a good plan.  Hmm: looking back
> through the (very good) discussion of 25 July to 1 August this year, I can see
> no mention of this, and this may just have been a brainstorm on my part.
>
> Markus says:
>
>> Whether it's a good idea to allow arbitrary function names is of
>> course yet another matter.  Do we really want km(adu/s) and
>> km.(adu/s) both be well-formed but having a completely different
>> semantics?  Shouldn't log, ln, exp, and sqrt be good enough for
>> anyone?
>
> Having unknown functions is for cases such as "dB(adu/s)", which seems
> defensible for much the same reasons that allowing unknown units is, in
> the end, defensible.
>
> I hadn't thought about "km.(adu/s)".  I'm inclined to say that that's a
> curiosity, but that the ambiguity is tolerable.
>
> ----
>
> deka and friends
>
> Markus suggests:
>
>> In the light of this ambiguitiy, we leave the parse of da.*
>> unspecified.  This means that unit authors SHOULD not apply the
>> deci-prefix to units starting with a and not apply the deka-prefix
>> at all.
>
> I'd vote for that.
>
> I do wish the august designers of the SI prefixes had thought a little bit more
> about the consequences here (similar two-letter arcana: the german eszet letter
> causes pain to unicode implementers because a string including an eszet -- say
> 'faß' -- _changes the number of characters_ when it's uppercased to 'FASS', and
> that german is either unique or almost unique in this property).
>
> We shall have to hope that ex-austro-hungarians -- thanks to Markus and Marco
> for this -- won't feel themselves persecuted by their inability to do their
> shopping by sending VOTables to their grocers.
>
> ----
>
> Explicitly Unknown units
>
> At present, the document states that a 'unit string' of
> "?" is not a valid unit, but that this should be recognised by the
> 'application layer', which will avoid then parsing the unit.  Markus
> suggested, and Arnold endorsed, the ide of a more obvious string, such
> as "UNKNOWN".  We could add a section just before '2.12 General rationale',
> as below.
>
> I also realise that we say rather little about dimensionless quantities.  At
> present, the string "" is not a valid VOUnits string, according to the grammar,
> even though table 14 says this is the recommended way of indicating a
> dimensionless quantity.  I feel we should mark this more positively.
>
> How about the following:
>
>> \section{Indicating dimensionless and unknown units}
>>
>> This specification reserves the unit \texttt{UNKNOWN}, which may not
>> appear in a VOUnits unit-string except as discussed here.  A unit-string
>> consisting of the string \texttt{UNKNOWN}, alone, indicates that a
>> quantity has unknown units.  This string should be recognised
>> case-insensitively by an application, as a separate step before attempting
>> any VOUnits parsing.
>>
>> A unit string consisting of the string \texttt{-}, alone, indicates that a
>> quantity is dimensionless.
>
> I'm in two minds about whether "-" should be explicitly recognised beforehand,
> or whether I should add it to the grammar.  It's probably fairly natural to add
> it to the grammar.
>
> ----
>
> Odd units
>
> Arnold:
>
>> I do lament disallowing "cy": it's common and clear, and I'm not impressed
>> by "hyr", even less by "ha".
>
> Indeed.  I think the 'cy' got ruled out in the cross-fire between BIPM,
> the ISO and the IAU.  I wanted to avoid any units that weren't in at
> least _one_ of those three -- I'm still (with decreasing plausibility)
> trying to keep this document conservative.
>
>> I am not entirely comfortable with disallowing "Ba" and "ta"
>> (although I am about equally uncomfortable with allowing them).
>> The question is: what do you propose to do when someone asks
>> for putting a catalog that measures time in one of thsoe units into a
>> VOTable?
>
> They're both perfectly allowed, as 'unknown units'.
>
> So the answer to your question "what do you propose to do...?" is "fine --
> there's nothing stopping you putting 'Ba' as a unit if you want to, as long as
> you believe the recipient will know how to interpret them".  I think most people
> who receive a VOTable (etc) with a 'Ba' column will be people who know and care
> what a besselian year is.
>
> I therefore propose removing these somewhat arcane units from the list of 'known'
> ones, given that the document is now blessing the presence of 'unknown' units in
> unit strings (and neither of them accidentally has an SI prefix, so there's no
> ambiguity, and neither would require the 'quotes' treatment).
>
>> There is a body of data files that uses a "Vanguard unit of time"
>> which actually is a centi-day - but the centi-day is disallowed.
>
> I'd be inclined to merely commisserate here, rather than go so flatly
> against the BIPM on an SI base unit.  Especially since I _can_ just about
> imagine a Candela appearing in an astronomical context.
>
> ----
>
> Underscores in strings
>
> I also think we should leave underscores out of strings.
>
> ----
>
> Future versions
>
> I take Markus's point that a new document version seems a rather heavyweight
> way to add new units.  However simply bumping 1.0 to 1.1, or even to
> 1.0.1 might be enough, and could be done in a very short time.  So I
> could add language to that effect:
>
>> Future versions of this specification may add to the set of known units, by
>> releasing a minor update (for example 1.n to 1.(n+1)).
>
> The DocStd document <http://www.ivoa.net/documents/DocStd/> does indicate (Sect
> 1.2) that such an update should go through the whole PR/RFC/TCG process.
>
> The current document takes its list of known units from the list in
> src/grammar/known-units.csv at <https://bitbucket.org/nxg/unity>, and one way of
> updating the list of units would be to declare that this file has some normative
> value.
>
> ----
>
> Summary (at last!)
>
> Marco puts it well:
>
> - I prefer no underscores
> - I like the "unknown" instead of '?'
> - I don't think we need prefixes to 'quoted' units
> - I'll prefer having no quoted functions
> - I have no particular opinion on the deci/deka problem, but I like Markus' rewording:
>
> The prefixes-to-quoted-units issue seems to be the outstanding question.  But
> apart from that, I agree with this very short list!
>
> I therefore propose the following list of changes
>
>    * wording changes as indicated above
>    * remove 'Ba' and 'ta' from the list of known units
>    * remove quoted function names from the grammar
>    * add spec text to say that "unknown" and "-" are to be used to indicate unknown and dimensionless units
>
> All the best,
>
> Norman
>
>

-- 
======================================================================
Francois Ochsenbein   ---   Observatoire Astronomique de Strasbourg
ochsenbein at evc.net    ---   francois.ochsenbein at astro.unistra.fr
+33 (0)3 88 77 81 17  ---   +33 (0)3 68 85 24 29
======================================================================


More information about the semantics mailing list