VOUnits: _another_ version, based on implementation feedback

Norman Gray norman at astro.gla.ac.uk
Tue Nov 5 03:08:17 PST 2013


Greetings, Semantics people.

Thanks to Markus for kicking off a very useful discussion of the units document
this week.  Like Markus, I believe that the remaining decisions are not
profound, but do need some explicit consensus on the mailing list.

[this is another long one, but it's not particularly intricate, after the first section, so you can probably skim it fairly quickly]

In rough order of contentiousness...

----

Prefixes on quoted units

Myself, I share everyone's mild distaste, but I think these are pretty
much unavoidable (and not perhaps quite as bad as you may think).

The logic is:

  * we want to allow 'unknown units' (because otherwise we have to have a long
    list of approved units, which will be permanently out of date, and which
    will never satisfy everyone)

  * but that means we have to allow SI prefixes on unknown units (because
    if we don't, then "MBa" (mega-besselian-year) will be syntactically valid or
    invalid depending on whether or not 'Ba' is listed as 'known', which means
    that data providers will have to memorise the list of known units)

  * so we have to have a way of quoting units (or else 'martianDay', presuming
    that's an 'unknown' unit, would have to be interpreted as milli-artianDay)

  * so we must, I think, allow prefixes on those quoted units (or else we have
    to write "'martianDay'" for those units, but remember to drop the quotes and
    write "kmartianDay" when we talk about 1000s of them (and then how do we
    parse the unit of 1000 days on Io, the "kioDay"?)).


The reason that SI (and IEEE binary) prefixes don't explicitly appear
in the grammar (ie, there's no 'si-prefix base-unit-string' pattern) is
because there's an intrinsic ambiguity here with, for example, the 'Pa'
or the 'mag'.  The only reason we don't parse this as the peta-'a' or
milli-'ag' is because we have the _semantic_ knowledge that 'Pa' and
'mag' are members of a small, but not negligible, set of special cases.
That is, the (high-ish level) semantics of some unit strings interfere,
in a rather irritating way, with the otherwise purely (low-level) syntactic
issues involved in parsing the string.

[[[ Technical aside (for those who haven't had the pleasure of working with
parser-generators): the way such a parser framework works is that one function,
the 'lexer', identifies 'terminals' in an input, such as STRING and
SIGNED_INTEGER, and reports them in sequence to the actual parser, which uses
the 'grammar' to decide that the string of lexemes is or is not an allowed
sequence.  The 'grammar' is still purely syntactic, with no semantics attached.

This isn't just a yacc problem, by the way: essentially the same problem would
appear using any parsing technology, so it's an artefact of the desire for a
machine-readable grammar, in contrast to specifying the grammar in text and
requiring implementers to create a hand-written parser. ]]]

That means that a lexer can't be given the task of spotting the 'si-prefix'
strings, and the prefixes have to be identified in a sub-parse of the
STRING or QUOTED_STRING which emerges from the lexer.  Put another way,
a 'semantic sub-parse' _has_ to have a (brief) look at the unit string
_before_ we split it into base unit and prefix.

It's that sub-parse that ensures that only permitted prefixes appear,
and it's the same sub-parse that ensures that the prefix before the
QUOTED_STRING is only one of the permitted ones.  So yes, Markus's example
of gargantuan'jupiterMass' does indeed _appear_ to be valid according to
the yacc grammar, but the text of the specification, and the library, forbids it.

(actually, the text could be a bit explicit about that.  How about:

> Quoted units can take prefixes (they are `unknown units', so there are
> no restrictions on their usage), so that \unit{m'furlong'} is a
> milli-furlong, and \unit{m'm'} is a milli-`m'.  As with 'known units', 
> the only permissible prefixes are those of \prettyref{tab:vouscalefactors}.

and I should highlight, in the grammar appendix, that this is an extra-syntactic
constraint)

That is, Markus's remark:

> believe if what we want to do here is allow prefixes on quoted units,
> things should look somewhat like
> 
> siPrefix: "u" | "c" | "d" | "da" | "h" |...
> unit: ...
>  | siPrefix QUOTED_STRING
> 
> -- and that would be ugly because we don't otherwise talk about SI
> prefixes in the grammar, and I'd not feel to good about introducing
> them now.
> 
> If, on the other hand, the "gargantuan" above should only blow up
> during unit interpretation, we have another error type that would
> come out of the parser, something like "invalid SI prefix", and
> that's arguably a complication of the interface, not to speak of the
> parser function for the unit production.

... is perfectly correct.

Regarding the interface, when presented with "gargantuan'jupiterMass'",
the C library reports "parse error: units parsing error: Impossible
prefix before quoted unit", and the Java library "Error parsing units:
unity parser error at character 11: error creating unit -- bad prefix?:
gargantuan".

That is, the gargantuan'jupiterMass' _is_ reported as a syntax error, rather than
having to be asked about (as is the case for the known/unknown unit distinction).

(well, _now_ the C version reports that; before 10 minutes ago, it produced an
assertion error!)

Turning to the rationale for these...

Markus:

> So, it may still be on the mostly harmless side of specification
> prose, but I'd still say we shouldn't just do it because we can --
> unless somebody clearly speaks out in favour of it (so we have a
> target for pointing fingers later:-) I'd still prefer if it weren't
> there.

But as Rick says, the motivation here is to ensure robust parsing of
unusual units.  If we forbid prefixes on quoted units, then we're saying
that quoted units are very significantly different from unquoted but
unknown ones.  That means that we forbid for example M'jupiterMass' --
that looks pretty harmless to me, and so forbidding it doesn't sound
like a great idea.

I do say in the text (as Markus quotes), "this is not often likely to be a good
idea."  I think that's true, but I can imagine it will _sometimes_ be a good idea.

The only downsides to this are, it seems to me, that it makes the grammar
less pretty (which I can live with), and makes the internal sub-parse
marginally more complicated (but that's an implementer's problem).

----

Quoted function names

I _think_ that the idea of quoted function names was introduced (by me?) largely
out of symmetry with the quoted units.  I can't (come to think of it) think of
any reason why we'd want to distinguish 

log(Hz)

from 

'log'(Hz)

Can anyone else?  Also, since there are no prefixes allowed on functions (!)
there's no other ambiguity.  I'd be happy to remove this from the grammar unless
anyone can reconstruct why we thought this was a good plan.  Hmm: looking back
through the (very good) discussion of 25 July to 1 August this year, I can see
no mention of this, and this may just have been a brainstorm on my part.

Markus says:

> Whether it's a good idea to allow arbitrary function names is of
> course yet another matter.  Do we really want km(adu/s) and
> km.(adu/s) both be well-formed but having a completely different
> semantics?  Shouldn't log, ln, exp, and sqrt be good enough for
> anyone?

Having unknown functions is for cases such as "dB(adu/s)", which seems
defensible for much the same reasons that allowing unknown units is, in
the end, defensible.

I hadn't thought about "km.(adu/s)".  I'm inclined to say that that's a
curiosity, but that the ambiguity is tolerable.

----

deka and friends

Markus suggests:

> In the light of this ambiguitiy, we leave the parse of da.*
> unspecified.  This means that unit authors SHOULD not apply the
> deci-prefix to units starting with a and not apply the deka-prefix
> at all.

I'd vote for that.

I do wish the august designers of the SI prefixes had thought a little bit more
about the consequences here (similar two-letter arcana: the german eszet letter
causes pain to unicode implementers because a string including an eszet -- say
'faß' -- _changes the number of characters_ when it's uppercased to 'FASS', and
that german is either unique or almost unique in this property).

We shall have to hope that ex-austro-hungarians -- thanks to Markus and Marco
for this -- won't feel themselves persecuted by their inability to do their
shopping by sending VOTables to their grocers.

----

Explicitly Unknown units

At present, the document states that a 'unit string' of
"?" is not a valid unit, but that this should be recognised by the
'application layer', which will avoid then parsing the unit.  Markus
suggested, and Arnold endorsed, the ide of a more obvious string, such
as "UNKNOWN".  We could add a section just before '2.12 General rationale',
as below.

I also realise that we say rather little about dimensionless quantities.  At
present, the string "" is not a valid VOUnits string, according to the grammar,
even though table 14 says this is the recommended way of indicating a
dimensionless quantity.  I feel we should mark this more positively.

How about the following:

> \section{Indicating dimensionless and unknown units}
>
> This specification reserves the unit \texttt{UNKNOWN}, which may not
> appear in a VOUnits unit-string except as discussed here.  A unit-string
> consisting of the string \texttt{UNKNOWN}, alone, indicates that a
> quantity has unknown units.  This string should be recognised
> case-insensitively by an application, as a separate step before attempting
> any VOUnits parsing.
>
> A unit string consisting of the string \texttt{-}, alone, indicates that a
> quantity is dimensionless.

I'm in two minds about whether "-" should be explicitly recognised beforehand,
or whether I should add it to the grammar.  It's probably fairly natural to add
it to the grammar.

----

Odd units

Arnold:

> I do lament disallowing "cy": it's common and clear, and I'm not impressed
> by "hyr", even less by "ha".

Indeed.  I think the 'cy' got ruled out in the cross-fire between BIPM,
the ISO and the IAU.  I wanted to avoid any units that weren't in at
least _one_ of those three -- I'm still (with decreasing plausibility)
trying to keep this document conservative.

> I am not entirely comfortable with disallowing "Ba" and "ta"
> (although I am about equally uncomfortable with allowing them).
> The question is: what do you propose to do when someone asks
> for putting a catalog that measures time in one of thsoe units into a
> VOTable?

They're both perfectly allowed, as 'unknown units'.

So the answer to your question "what do you propose to do...?" is "fine --
there's nothing stopping you putting 'Ba' as a unit if you want to, as long as
you believe the recipient will know how to interpret them".  I think most people
who receive a VOTable (etc) with a 'Ba' column will be people who know and care
what a besselian year is.

I therefore propose removing these somewhat arcane units from the list of 'known'
ones, given that the document is now blessing the presence of 'unknown' units in
unit strings (and neither of them accidentally has an SI prefix, so there's no
ambiguity, and neither would require the 'quotes' treatment).

> There is a body of data files that uses a "Vanguard unit of time"
> which actually is a centi-day - but the centi-day is disallowed.

I'd be inclined to merely commisserate here, rather than go so flatly
against the BIPM on an SI base unit.  Especially since I _can_ just about
imagine a Candela appearing in an astronomical context.

----

Underscores in strings

I also think we should leave underscores out of strings.

----

Future versions

I take Markus's point that a new document version seems a rather heavyweight
way to add new units.  However simply bumping 1.0 to 1.1, or even to
1.0.1 might be enough, and could be done in a very short time.  So I
could add language to that effect:

> Future versions of this specification may add to the set of known units, by
> releasing a minor update (for example 1.n to 1.(n+1)).

The DocStd document <http://www.ivoa.net/documents/DocStd/> does indicate (Sect
1.2) that such an update should go through the whole PR/RFC/TCG process.

The current document takes its list of known units from the list in
src/grammar/known-units.csv at <https://bitbucket.org/nxg/unity>, and one way of
updating the list of units would be to declare that this file has some normative
value.

----

Summary (at last!)

Marco puts it well:

- I prefer no underscores
- I like the "unknown" instead of '?'
- I don't think we need prefixes to 'quoted' units
- I'll prefer having no quoted functions
- I have no particular opinion on the deci/deka problem, but I like Markus' rewording:

The prefixes-to-quoted-units issue seems to be the outstanding question.  But
apart from that, I agree with this very short list!

I therefore propose the following list of changes

  * wording changes as indicated above
  * remove 'Ba' and 'ta' from the list of known units
  * remove quoted function names from the grammar
  * add spec text to say that "unknown" and "-" are to be used to indicate unknown and dimensionless units

All the best,

Norman


-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK



More information about the semantics mailing list