VOUnits: _another_ version, based on implementation feedback

Arnold Rots arots at cfa.harvard.edu
Wed Nov 6 06:47:27 PST 2013

I would favor leaving ta and Ba in deprecated state.
It's the least of the uncomfortable states.

  - Arnold

On Tue, Nov 5, 2013 at 6:08 AM, Norman Gray <norman at astro.gla.ac.uk> wrote:

> Greetings, Semantics people.
> Thanks to Markus for kicking off a very useful discussion of the units
> document
> this week.  Like Markus, I believe that the remaining decisions are not
> profound, but do need some explicit consensus on the mailing list.
> [this is another long one, but it's not particularly intricate, after the
> first section, so you can probably skim it fairly quickly]
> In rough order of contentiousness...
> ----
> Prefixes on quoted units
> Myself, I share everyone's mild distaste, but I think these are pretty
> much unavoidable (and not perhaps quite as bad as you may think).
> The logic is:
>   * we want to allow 'unknown units' (because otherwise we have to have a
> long
>     list of approved units, which will be permanently out of date, and
> which
>     will never satisfy everyone)
>   * but that means we have to allow SI prefixes on unknown units (because
>     if we don't, then "MBa" (mega-besselian-year) will be syntactically
> valid or
>     invalid depending on whether or not 'Ba' is listed as 'known', which
> means
>     that data providers will have to memorise the list of known units)
>   * so we have to have a way of quoting units (or else 'martianDay',
> presuming
>     that's an 'unknown' unit, would have to be interpreted as
> milli-artianDay)
>   * so we must, I think, allow prefixes on those quoted units (or else we
> have
>     to write "'martianDay'" for those units, but remember to drop the
> quotes and
>     write "kmartianDay" when we talk about 1000s of them (and then how do
> we
>     parse the unit of 1000 days on Io, the "kioDay"?)).
> The reason that SI (and IEEE binary) prefixes don't explicitly appear
> in the grammar (ie, there's no 'si-prefix base-unit-string' pattern) is
> because there's an intrinsic ambiguity here with, for example, the 'Pa'
> or the 'mag'.  The only reason we don't parse this as the peta-'a' or
> milli-'ag' is because we have the _semantic_ knowledge that 'Pa' and
> 'mag' are members of a small, but not negligible, set of special cases.
> That is, the (high-ish level) semantics of some unit strings interfere,
> in a rather irritating way, with the otherwise purely (low-level) syntactic
> issues involved in parsing the string.
> [[[ Technical aside (for those who haven't had the pleasure of working with
> parser-generators): the way such a parser framework works is that one
> function,
> the 'lexer', identifies 'terminals' in an input, such as STRING and
> SIGNED_INTEGER, and reports them in sequence to the actual parser, which
> uses
> the 'grammar' to decide that the string of lexemes is or is not an allowed
> sequence.  The 'grammar' is still purely syntactic, with no semantics
> attached.
> This isn't just a yacc problem, by the way: essentially the same problem
> would
> appear using any parsing technology, so it's an artefact of the desire for
> a
> machine-readable grammar, in contrast to specifying the grammar in text and
> requiring implementers to create a hand-written parser. ]]]
> That means that a lexer can't be given the task of spotting the 'si-prefix'
> strings, and the prefixes have to be identified in a sub-parse of the
> STRING or QUOTED_STRING which emerges from the lexer.  Put another way,
> a 'semantic sub-parse' _has_ to have a (brief) look at the unit string
> _before_ we split it into base unit and prefix.
> It's that sub-parse that ensures that only permitted prefixes appear,
> and it's the same sub-parse that ensures that the prefix before the
> QUOTED_STRING is only one of the permitted ones.  So yes, Markus's example
> of gargantuan'jupiterMass' does indeed _appear_ to be valid according to
> the yacc grammar, but the text of the specification, and the library,
> forbids it.
> (actually, the text could be a bit explicit about that.  How about:
> > Quoted units can take prefixes (they are `unknown units', so there are
> > no restrictions on their usage), so that \unit{m'furlong'} is a
> > milli-furlong, and \unit{m'm'} is a milli-`m'.  As with 'known units',
> > the only permissible prefixes are those of
> \prettyref{tab:vouscalefactors}.
> and I should highlight, in the grammar appendix, that this is an
> extra-syntactic
> constraint)
> That is, Markus's remark:
> > believe if what we want to do here is allow prefixes on quoted units,
> > things should look somewhat like
> >
> > siPrefix: "u" | "c" | "d" | "da" | "h" |...
> > unit: ...
> >  | siPrefix QUOTED_STRING
> >
> > -- and that would be ugly because we don't otherwise talk about SI
> > prefixes in the grammar, and I'd not feel to good about introducing
> > them now.
> >
> > If, on the other hand, the "gargantuan" above should only blow up
> > during unit interpretation, we have another error type that would
> > come out of the parser, something like "invalid SI prefix", and
> > that's arguably a complication of the interface, not to speak of the
> > parser function for the unit production.
> ... is perfectly correct.
> Regarding the interface, when presented with "gargantuan'jupiterMass'",
> the C library reports "parse error: units parsing error: Impossible
> prefix before quoted unit", and the Java library "Error parsing units:
> unity parser error at character 11: error creating unit -- bad prefix?:
> gargantuan".
> That is, the gargantuan'jupiterMass' _is_ reported as a syntax error,
> rather than
> having to be asked about (as is the case for the known/unknown unit
> distinction).
> (well, _now_ the C version reports that; before 10 minutes ago, it
> produced an
> assertion error!)
> Turning to the rationale for these...
> Markus:
> > So, it may still be on the mostly harmless side of specification
> > prose, but I'd still say we shouldn't just do it because we can --
> > unless somebody clearly speaks out in favour of it (so we have a
> > target for pointing fingers later:-) I'd still prefer if it weren't
> > there.
> But as Rick says, the motivation here is to ensure robust parsing of
> unusual units.  If we forbid prefixes on quoted units, then we're saying
> that quoted units are very significantly different from unquoted but
> unknown ones.  That means that we forbid for example M'jupiterMass' --
> that looks pretty harmless to me, and so forbidding it doesn't sound
> like a great idea.
> I do say in the text (as Markus quotes), "this is not often likely to be a
> good
> idea."  I think that's true, but I can imagine it will _sometimes_ be a
> good idea.
> The only downsides to this are, it seems to me, that it makes the grammar
> less pretty (which I can live with), and makes the internal sub-parse
> marginally more complicated (but that's an implementer's problem).
> ----
> Quoted function names
> I _think_ that the idea of quoted function names was introduced (by me?)
> largely
> out of symmetry with the quoted units.  I can't (come to think of it)
> think of
> any reason why we'd want to distinguish
> log(Hz)
> from
> 'log'(Hz)
> Can anyone else?  Also, since there are no prefixes allowed on functions
> (!)
> there's no other ambiguity.  I'd be happy to remove this from the grammar
> unless
> anyone can reconstruct why we thought this was a good plan.  Hmm: looking
> back
> through the (very good) discussion of 25 July to 1 August this year, I can
> see
> no mention of this, and this may just have been a brainstorm on my part.
> Markus says:
> > Whether it's a good idea to allow arbitrary function names is of
> > course yet another matter.  Do we really want km(adu/s) and
> > km.(adu/s) both be well-formed but having a completely different
> > semantics?  Shouldn't log, ln, exp, and sqrt be good enough for
> > anyone?
> Having unknown functions is for cases such as "dB(adu/s)", which seems
> defensible for much the same reasons that allowing unknown units is, in
> the end, defensible.
> I hadn't thought about "km.(adu/s)".  I'm inclined to say that that's a
> curiosity, but that the ambiguity is tolerable.
> ----
> deka and friends
> Markus suggests:
> > In the light of this ambiguitiy, we leave the parse of da.*
> > unspecified.  This means that unit authors SHOULD not apply the
> > deci-prefix to units starting with a and not apply the deka-prefix
> > at all.
> I'd vote for that.
> I do wish the august designers of the SI prefixes had thought a little bit
> more
> about the consequences here (similar two-letter arcana: the german eszet
> letter
> causes pain to unicode implementers because a string including an eszet --
> say
> 'faß' -- _changes the number of characters_ when it's uppercased to
> 'FASS', and
> that german is either unique or almost unique in this property).
> We shall have to hope that ex-austro-hungarians -- thanks to Markus and
> Marco
> for this -- won't feel themselves persecuted by their inability to do their
> shopping by sending VOTables to their grocers.
> ----
> Explicitly Unknown units
> At present, the document states that a 'unit string' of
> "?" is not a valid unit, but that this should be recognised by the
> 'application layer', which will avoid then parsing the unit.  Markus
> suggested, and Arnold endorsed, the ide of a more obvious string, such
> as "UNKNOWN".  We could add a section just before '2.12 General rationale',
> as below.
> I also realise that we say rather little about dimensionless quantities.
>  At
> present, the string "" is not a valid VOUnits string, according to the
> grammar,
> even though table 14 says this is the recommended way of indicating a
> dimensionless quantity.  I feel we should mark this more positively.
> How about the following:
> > \section{Indicating dimensionless and unknown units}
> >
> > This specification reserves the unit \texttt{UNKNOWN}, which may not
> > appear in a VOUnits unit-string except as discussed here.  A unit-string
> > consisting of the string \texttt{UNKNOWN}, alone, indicates that a
> > quantity has unknown units.  This string should be recognised
> > case-insensitively by an application, as a separate step before
> attempting
> > any VOUnits parsing.
> >
> > A unit string consisting of the string \texttt{-}, alone, indicates that
> a
> > quantity is dimensionless.
> I'm in two minds about whether "-" should be explicitly recognised
> beforehand,
> or whether I should add it to the grammar.  It's probably fairly natural
> to add
> it to the grammar.
> ----
> Odd units
> Arnold:
> > I do lament disallowing "cy": it's common and clear, and I'm not
> impressed
> > by "hyr", even less by "ha".
> Indeed.  I think the 'cy' got ruled out in the cross-fire between BIPM,
> the ISO and the IAU.  I wanted to avoid any units that weren't in at
> least _one_ of those three -- I'm still (with decreasing plausibility)
> trying to keep this document conservative.
> > I am not entirely comfortable with disallowing "Ba" and "ta"
> > (although I am about equally uncomfortable with allowing them).
> > The question is: what do you propose to do when someone asks
> > for putting a catalog that measures time in one of thsoe units into a
> > VOTable?
> They're both perfectly allowed, as 'unknown units'.
> So the answer to your question "what do you propose to do...?" is "fine --
> there's nothing stopping you putting 'Ba' as a unit if you want to, as
> long as
> you believe the recipient will know how to interpret them".  I think most
> people
> who receive a VOTable (etc) with a 'Ba' column will be people who know and
> care
> what a besselian year is.
> I therefore propose removing these somewhat arcane units from the list of
> 'known'
> ones, given that the document is now blessing the presence of 'unknown'
> units in
> unit strings (and neither of them accidentally has an SI prefix, so
> there's no
> ambiguity, and neither would require the 'quotes' treatment).
> > There is a body of data files that uses a "Vanguard unit of time"
> > which actually is a centi-day - but the centi-day is disallowed.
> I'd be inclined to merely commisserate here, rather than go so flatly
> against the BIPM on an SI base unit.  Especially since I _can_ just about
> imagine a Candela appearing in an astronomical context.
> ----
> Underscores in strings
> I also think we should leave underscores out of strings.
> ----
> Future versions
> I take Markus's point that a new document version seems a rather
> heavyweight
> way to add new units.  However simply bumping 1.0 to 1.1, or even to
> 1.0.1 might be enough, and could be done in a very short time.  So I
> could add language to that effect:
> > Future versions of this specification may add to the set of known units,
> by
> > releasing a minor update (for example 1.n to 1.(n+1)).
> The DocStd document <http://www.ivoa.net/documents/DocStd/> does indicate
> (Sect
> 1.2) that such an update should go through the whole PR/RFC/TCG process.
> The current document takes its list of known units from the list in
> src/grammar/known-units.csv at <https://bitbucket.org/nxg/unity>, and one
> way of
> updating the list of units would be to declare that this file has some
> normative
> value.
> ----
> Summary (at last!)
> Marco puts it well:
> - I prefer no underscores
> - I like the "unknown" instead of '?'
> - I don't think we need prefixes to 'quoted' units
> - I'll prefer having no quoted functions
> - I have no particular opinion on the deci/deka problem, but I like
> Markus' rewording:
> The prefixes-to-quoted-units issue seems to be the outstanding question.
>  But
> apart from that, I agree with this very short list!
> I therefore propose the following list of changes
>   * wording changes as indicated above
>   * remove 'Ba' and 'ta' from the list of known units
>   * remove quoted function names from the grammar
>   * add spec text to say that "unknown" and "-" are to be used to indicate
> unknown and dimensionless units
> All the best,
> Norman
> --
> Norman Gray  :  http://nxg.me.uk
> SUPA School of Physics and Astronomy, University of Glasgow, UK
