VOUnits RFC
Norman Gray
norman at astro.gla.ac.uk
Tue Jul 30 07:10:18 PDT 2013
Markus and all, hello.
On 2013 Jul 30, at 09:24, Markus Demleitner wrote:
> Nice discussion, and a big thanks to all dropping in. As the
> troublemaker who started it, I feel compelled to reply to some of the
> mails. Sorry for disturbing threads, but I think that's preferable
> to lots of small mails.
Good idea -- I'll reply to these points all together.
> So, first, Norman in <83AB7956-24D4-411A-994D-60028AD9BEC6 at astro.gla.ac.uk>
>> And the specification _is_ concerned with the FITS syntax because (as I've
>> stressed above), it's a goal that any VOUnits-compatible units string would
>> also be a syntactically valid FITS unit string (and the same for CDS and
>> nearly so for OGIP). Permitting a scaling factor would break that.
>
> As you've pointed out elsewhere, it's not an intersection anyway, so
> another slight breakage wouldn't hurt, would it?
That is true. The difference is that the other compatibility-breakage is forced rather than voluntary. That aside, allowing leading scalefactors would mean that there would be two cases rather than just one, where a FITS-compatible units parser wouldn't be guaranteed to work on a VOUnits string. I don't know how bad/worse that would be.
> Then, replying to Tom's <51F681A6.40804 at nasa.gov>
>> In VOTables (and elsewhere) we don't have, AFAIK, a comparable scaling
>> capability nor is it likely any time soon. Since I perceive that astronomers
>> are oft enamored of non-SI units, we'd be requiring wholesale rescaling of
>> values in tables for tables to be able to use this convention. I don't see
>> that happening.
>>
>> A point of clarification: I'm not positive I follow where the rescaling would
>> be necessary. Do you mean that at present VOTables can use "1.9x10+27kg" as a
>> unit string (because they use CDS-format unit strings), but couldn't if there
>> was an immediate switch to VOUnits strings, and therefore that the content of
>> the VOTable would have to be scaled when it's generated?
>
> That, at least, is 50% of what I am worried about. You see, as far
> as I understood things, VOUnit is supposed to say what's allowed in
> VOTable unit strings (in the end, at least). Without the scale
> factors, quite a few of my unit strings will become invalid, at least
> until we'd have the great unit translation and I'd have my data
> providers' strange units pushed in there. It will come as no
> surprise that I don't like that.
Are VOTables often _stored_ rather than generated on the fly? That is, are you presenting an archival problem (stored VOTables will stop being valid) or a behaviour problem (generated VOTables will change the syntax of unit strings)?
Is this actually a problem? Do VOTable parsers actually try to parse the unit strings? If so, they're presumably going to have to be pretty tolerant, if they have to cope with the mish-mash of units you've listed in your ADQL query. If they're tolerant, then they can tolerate a change of mandated unit syntax.
> Norman lists VOTable units strings as a use case and then adds:
>
>> Other places where you might want a unit string are:
>>
>> * in a structured comment in a RDBMS or other schema, documenting a column;
>> * in a request to a web service (SOAP or otherwise), indicating the desired
>> units of the result; or
>> * in an annotation (RDFa-style) to a number in a web page; et cetera.
>
> plus registry metadata; that's not very different from the
> "structured comment" thing, but you suddenly have these things an an
> RDBMS with RegTAP. This, by the way, lets you assess where we're
> coming from in terms of units declared to the registry:
>
> http://dc.zah.uni-heidelberg.de/__system__/adql/query/form?__nevow_form__=genForm&query=select%20distinct%20unit%20from%20rr.table_column%20where%20unit%20like%20%27%25.%25%27&_TIMEOUT=5&_FORMAT=HTML&submit=Go
Urghh. I presume that list has been case-folded in some way, since I see no uppercase characters at all. That aside, I see a good fraction of those units (about 15% of them) are invalid according to the CDS spec.
>
>> These seem to leave us with two alternatives for VOUnits:
>>
>> 1. permit numerical scale-factors, and thus units of "1.9e27kg" (or whatever
>> f.p. syntax we choose); or
>>
>> 2. forbid numerical scale-factors, but permit 'unrecognised units', such as
>> 'jupMass'.
>>
>>
>> Option (1) means that we effectively smuggle a TSCALn behaviour into the unit
>> string.
>>
>> Option (1) also breaks consistency with FITS unit strings.
>
> ...but maintains VOTable's capability to represent everything that
> FITS binary tables can, which would otherwise get lost. I'm pretty
> sure I know what I prefer...
That's true (given that conversion from a FITS file to a VOTable, as opposed to from a DB, is a significant use-case; I wouldn't disagree, but is that the case?).
>
>> The problem with (1) is that this loses the information that this is a
>> 'jupiter mass', and leaves it as being some apparently random scaling factor.
>> That's not a problem if the data is going into a pipeline and nowhere else,
>> but it could be a problem in some of the other cases. If I found this
>> 1.9e27kg as a unit column in a structured comment, I'd probably want to
>> strangle someone. If I want my results in units of jupiter masses, and so
>
> Well, to figure out how to convert the value to kg, it's perfect, so
> you'd have little reason to strangle someone. I think what bugs you
> is that *provenance* is lost. That is regrettable, true, but I'm
> pretty sure overloading units with a part of provenance is making it
> unsuitable for both.
Well I would want to strangle someone, if what I wanted was to figure out how to convert the figures to jupMass, because that's what they started out as, in the data provider's database, and why I asked for the table, and because they've been mangled to units of 1.9e27kg only because the know-it-alls writing the VOUnits specification decided that they didn't like my favourite unit.
And...
> This is also what I'd say to Rob's statement from
> <64294334-B5A6-495D-9459-698436CBBCEA at noao.edu> (where I'd like to
> stress that I agree there's a problem worth solving, it's just that
> VOUnits is the wrong place):
>
>> A more fundamental issue is that often measurements are calibrated in
>> terms of other measurements. Quoting something as 1.5 jupMass might
>> not just be a handy way to provide a sense of scale, but it might be
>> that as measurements are refined of what the mass of Jupiter actually
>> is, that the number quoted (in the table or what have you) ought be
>> adjusted to suit. Examples abound such as the Hubble constant, etc.
Well, I very much want to stay away from Provenance in this discussion (I've talked a bit about provenance with Marco and Markus in the CoSADIE context, and I think there's a problem that we don't have to re-solve there). I'll stress again that I also want to stay away from any discussion about Quantities, and this is part of my (and Rob's?) nervousness about the numerical factor in the unit string.
Permitting 'unknown unit' strings is a sort of loose provenance, yes, but that's not the motivation.
The idea here is to specify the simplest unit string that can do useful work; precluding 'unknown units' makes this too simple to be useful.
> On Mon, Jul 29, 2013 at 08:30:31AM -0700, Rob Seaman wrote:
>> On Jul 29, 2013, at 7:52 AM, Tom McGlynn <Thomas.A.McGlynn at nasa.gov> wrote:
>>> In practice when reading these our software will read
>>> 1.2
>>> and
>>> 1.2345678901234567891234567801233e33
>>> with equal facility
>>
>> So either an arbitrary precision library must be used or the
>> handling of units must permit scale factors only as opaque
>> literals?
>>
>>> and whether the second really has vastly more precision than the
>>> first is unknowable and unaddressed by this standard.
>
> The question of precision is, I would argue, beside the scope of
> units -- I give you we should have had a quantity data model ages
> ago, but alas we don't have, and trying to shoehorn this into units
> well break units while not actually answering natural questions like
> "what's the error on this, and what kind of error is this".
Yes: this is the can of worms that would be opened if a numerical factor were to appear.
> The question of determining equality is an interesting one, though.
> If the floating point prefixes were the only thing holding this up,
> I'd say that's a heavy blow. However, we don't actually say how to
> compare units in the current draft, and so I'd claim making that
> comparison "harder" is a weak argument.
We don't, indeed! We should, and I think it's pretty trivial to define equality. I'll add language to that effect to the document now. I suspect the thing would be to avoid complication and say that for equality, any scale factors must be equal as floats.
>> I don't disagree with the notion of borrowing from earlier
>> standards, but there are implications. Still haven't heard
>> comments on embedding the scale factors other than as prefixes, as
>> denominators (not an unknown usage), etc.
>
> Doesn't help expressivity, complicates standard: I'd say let's not.
I see no rationale for permitting scale factors elsewhere than at the beginning.
> Ah, come on. Basically all formal languages developed in the last 30
> years and in measurable usage agree on how floating point literals
> look like. Let's just follow them.
Except they don't. Are digits required both before and after the decimal point, is the exponent marker 'e', 'E', 'd' or 'D', and so on and on in a very fiddly way. It may be fundamentally simple, but it's not trivial or particularly standard, since each language will define them in slightly different ways (I haven't done a census...).
> So... May I try to suggest a compromise?
>
> How about if we say: VOUnit allows a single floating point scaling
> factor at the start of the unit string. For serializing into FITS,
> the scaling factor must be split off into a TSCALn card (or absorbed
> into the value in the unfortunate event the value is a card value).
> For serializing from FITS binary tables, TSCALn cards SHOULD be
> preserved into the the unit strings rather than being baked into the
> values.
If it is the case (as you argue, Markus, above) that it is an important use-case to be able to convert a FITS file to a VOTable (that is, moving the FITS file's TSCALn to a numerical prefix), then I'm rather persuaded that it's necessary to include a numerical prefix. We could ensure interoperability by demanding a very simple form for the prefix, such as /[0-9]\.[0-9]+e[-+][0-9]+/
However:
* I think it would be good to include language in the spec that deprecates this in most cases, as OGIP does, for example; and
* I think it's still necessary to permit 'unknown units' to deal with the 'jupMass' case.
Would people here agree about the importance of this use-case, and this as the resolution?
> And then on to Quantity and Provenance data models. The phenomena
> they describe deserve to be done right, not implicitely in unit
> strings.
Quantity and Provenance come later! (and I reserve the right to sit at the side and throw peanuts).
All the best,
Norman
--
Norman Gray : http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK
More information about the semantics
mailing list