Definitions, costs and use-cases
Norman Gray
norman at astro.gla.ac.uk
Sat Sep 22 03:12:03 PDT 2007
Rob and all,
Rob:
> My previous question about use cases sank without a trace. Perhaps
> I can try again. Here you are making a "case for using this tool",
> not asserting a use case. Before learning to use a tool, one needs
> to know, as Edwin Starr says: What is it good for?
Let's knock this on the head. You want definitions, costs and use-
cases: we've got 'em.
I think various people _have_ discussed these before, possibly
implicitly, but certainly at length, so I expect that in places I am
going to repeat what others have said. I hope that having it in one
place will help tie this up, so we can move toward more specific
questions, which is where I think Andrea has been trying to herd us.
This has turned into rather a long message. The 'costs' section may
go into too much detail, but the definitions and use-case sections
should be reasonably compact.
In fact, this message has the feel of a first draft of a wiki page.
Should we be having this discussion in a different form?
Definitions
We won't get much from 200-year old dictionary definitions, or
technical abstracts from completely different domains (philosophical
ethics, as I recall), so here's the standard definition from computer
science (all together, class!):
an ontology is a formal specification of a shared conceptualisation
That is:
conceptualisation = a set of things/concepts/types, as appropriate
shared = ...which at least one other person agrees with
specification = ... and which you've written down
formal = ...in a machine-readable way
The various terms folksonomy, vocabulary, thesaurus, taxonomy and
ontology all have slightly different definitions (they overlap in
practice), but all exist on a single spectrum, or ladder, from
informal and suggestive at one end, to formal and expressive at the
other. The 'ontology' range is further subdivided ('RDFS' is at one
end and 'OWL' at the other -- let's not worry just now).
'Semantics' is just the stuff you're doing after you've grokked
whatever syntax you're using, and 'semantic search' is 'trying to do
better than simple string matching'. Yes, Google does do rather well
with 'simple string matching', but that's because (a) they don't have
any choice, as there isn't a great deal of semantically rich material
on the wild wild web, outside of specialised domains such as ours;
and (b) they have money and kit to throw at the problem of guessing
meaning from string coincidences.
Costs
Processing costs: Processors become more (computationally) expensive
as you go from less to more formal. Handling a folksonomy requires
strcmp(3); handling an ontology requires one of several types of
reasoner[1].
However processors become much more efficient as you go towards the
more formal end, since you have to work quite hard with strcmp,
tolower and friends, and be quite clever, to extract much meaning
like 'this resource is more specific than that resource'. That sort
of thing is much more immediate, further up the ladder.
Acquisition costs: Folksonomies[2] are a big deal currently because
they offer a way of talking about the only vaguely semantic
information realistically available on the web. Adding richer
information is dramatically more expensive (issues of education,
hassle, payoff to the tagger), so might be worth it only for small,
high-value, data collections (such as the registry?), or collections
which already have most of the structure visible already (example?).
Note that not every application necessarily benefits from more
expressive structure. Myself, I think that SKOS (taxonomy/thesaurus)
provides most of what you really need, and can reasonably acquire, to
support searching. Ed is a more unequivocal enthusiast for OWL. The
CDS ontology can support automatic classification ('if this object
has these properties then it must be of this type'), but not every
reasoner can cope with it.
The upside, from the point of view of acquisition costs, is that most
of the sciences, with their journal keywords, and the systematising
mindset of their users, can probably get on to the second rung for
free. The much-lamented lack of interest in the IAU keyword list
suggests that getting on to the third rung might be a struggle. The
existence of the registry indicates that the people running archives
can be persuaded to supply reasonably extensive/expensive semantic
information; the prospect of this bringing users to them, and the
embarassment of their logo not appearing where it ought, are what
will persuade them to do this _and_ get it right. The largeish
number of errors in registry entries show that the benefits -- custom
and visibility -- have not yet ben perceived to match the costs.
Opportunity and development costs: Developing (which means agreeing
on) a new vocabulary or an ontology is very hard work, and very
expensive (as we all know...); it should therefore be avoided as much
as possible. Repurposing an existing vocabulary is much better: even
if it's not perfect, the benefits of it actually existing outweigh
the costs of the fit being a little loose.
Resuse is better than redevelopment for other software as well (news
just in: sin is bad!), but the costs are especially high for
vocabulary development, since it necessarily consumes the effort and
good temper of multiple people simultaneously, and it probably
involves the time of valuable domain experts (you can't just hire
someone).
Reusing an existing vocabulary should be cheap, and might consist of
nothing more than some Perl magic to put the right type of pointy
brackets round the items in your vocabulary list.
The tools and APIs for supporting reasoning (ie, working at the top
end of the ladder) are rather hard to use, in my experience, for a
mixture of reasons: what they're doing can be rather confusing, and
they're still aimed at a fairly specialised developer community, so
there isn't the sort of tutorial and community support that would let
Joe Developer just pick up a tool and start creating. What that
means, I think, is that where those tools are useful, they should be
well hidden as services or as middleware, and the community should
have a fairly explicit plan about how it will maintain them in the
medium term.
At the bottom end of the ladder, there are much more approachable
tools for handling and storing RDF (though I haven't yet had to
actually API-call an RDF parser, and most of my work in this area has
been using XSLT).
That presumes you're using RDF. The benefits have been rehearsed
elsewhere this month, so I'll skip them here. The main cost of not
doing so is that you cut yourself off from the rest of the world.
Use-cases
Mathilda is reading a paper online. She types the (A&A) keywords for
the paper into VOExplorer and asks for 'more like this'. VOExplorer
calls out to a service which finds the AOIM and Simbad equivalences
of the A&A keywords, and uses the former to query a suitable service
to find some pretty pictures, and the latter to query Simbad,
presenting the two lists to Mathilda. There aren't many pretty
pictures, so Mathilda asks to expand the search, and VOExplorer asks
for pretty pictures corresponding to a more general term, found
either directly in the AOIM vocabulary, or finding a more general
SImbad term and finding the AOIM equivalent of that. The Simbad
query, on the other hand, has produced far too many hits, so
VOExplorer looks down the tree of Simbad terms which are 'narrower',
and asks 'you were looking for compact objects: do you mean black
holes, quasars, or...?' Once she has established a suitable keyword
or keywords, she can make a queries using the equivalent terms in
whichever vocabularies the registry or VOEvent keywords are drawn
from. She finds some heterodyne observations (apologies if this is
astronomical nonsense, but...), but she's an X-ray person, so is a
bit vague, and curious, about just what that is -- but oooh, there's
a link to DBpedia/wikipedia, so she goes there on the off-chance the
article is decent. The mechanism that brought in the link to DBpedia
is the same one that is linking a growing collection of non-
specialist semantic resources (see the 'linking open data' project:
<http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/
LinkingOpenData>).
Most of the components of that are in place already, in the sense
that the vocabularies exist and services can be queried using them.
VOExplorer already makes a callout to a skeleton service which
doesn't do anything useful yet, but will be expanded starting next
year (funding's just arrived). The CDS people (Alexandre in
particular) have already demoed an application using the Ontology of
Astro Object types which does something similar to the business of
broadening and narrowing the Simbad queries).
All the best,
Norman
[1] A 'reasoner' is something which, for example, deduce that an
instance of a given subtype is also an instance of the type.
[2] A 'folksonomy' is a del.icio.us or Flickr-style cloud of
keywords, or the keywords on eBay or a conference abstract, where
people ask themselves `what keyword would other people use to search
for this?'. 'Folksonomy' is the same as 'free keyword list with
counts of occurrences', but is fewer characters to type.
--
------------------------------------------------------------
Norman Gray : http://nxg.me.uk
eurovotech.org : University of Leicester, UK
More information about the semantics
mailing list