Definitions, costs and use-cases

Sat Sep 22 03:12:03 PDT 2007

Rob and all,

Rob:

> My previous question about use cases sank without a trace.  Perhaps  
> I can try again.  Here you are making a "case for using this tool",  
> not asserting a use case.  Before learning to use a tool, one needs  
> to know, as Edwin Starr says:  What is it good for?

Let's knock this on the head.  You want definitions, costs and use- 
cases: we've got 'em.

I think various people _have_ discussed these before, possibly  
implicitly, but certainly at length, so I expect that in places I am  
going to repeat what others have said.  I hope that having it in one  
place will help tie this up, so we can move toward more specific  
questions, which is where I think Andrea has been trying to herd us.

This has turned into rather a long message.  The 'costs' section may  
go into too much detail, but the definitions and use-case sections  
should be reasonably compact.

In fact, this message has the feel of a first draft of a wiki page.   
Should we be having this discussion in a different form?

Definitions

We won't get much from 200-year old dictionary definitions, or  
technical abstracts from completely different domains (philosophical  
ethics, as I recall), so here's the standard definition from computer  
science (all together, class!):

     an ontology is a formal specification of a shared conceptualisation

That is:

     conceptualisation = a set of things/concepts/types, as appropriate
     shared = ...which at least one other person agrees with
     specification = ... and which you've written down
     formal = ...in a machine-readable way

The various terms folksonomy, vocabulary, thesaurus, taxonomy and  
ontology all have slightly different definitions (they overlap in  
practice), but all exist on a single spectrum, or ladder, from  
informal and suggestive at one end, to formal and expressive at the  
other.  The 'ontology' range is further subdivided ('RDFS' is at one  
end and 'OWL' at the other -- let's not worry just now).

'Semantics' is just the stuff you're doing after you've grokked  
whatever syntax you're using, and 'semantic search' is 'trying to do  
better than simple string matching'.  Yes, Google does do rather well  
with 'simple string matching', but that's because (a) they don't have  
any choice, as there isn't a great deal of semantically rich material  
on the wild wild web, outside of specialised domains such as ours;  
and (b) they have money and kit to throw at the problem of guessing  
meaning from string coincidences.

Costs

Processing costs: Processors become more (computationally) expensive  
as you go from less to more formal.  Handling a folksonomy requires  
strcmp(3); handling an ontology requires one of several types of  
reasoner[1].

However processors become much more efficient as you go towards the  
more formal end, since you have to work quite hard with strcmp,  
tolower and friends, and be quite clever, to extract much meaning  
like 'this resource is more specific than that resource'.  That sort  
of thing is much more immediate, further up the ladder.

Acquisition costs: Folksonomies[2] are a big deal currently because  
they offer a way of talking about the only vaguely semantic  
information realistically available on the web.  Adding richer  
information is dramatically more expensive (issues of education,  
hassle, payoff to the tagger), so might be worth it only for small,  
high-value, data collections (such as the registry?), or collections  
which already have most of the structure visible already (example?).

Note that not every application necessarily benefits from more  
expressive structure.  Myself, I think that SKOS (taxonomy/thesaurus)  
provides most of what you really need, and can reasonably acquire, to  
support searching.  Ed is a more unequivocal enthusiast for OWL.  The  
CDS ontology can support automatic classification ('if this object  
has these properties then it must be of this type'), but not every  
reasoner can cope with it.

The upside, from the point of view of acquisition costs, is that most  
of the sciences, with their journal keywords, and the systematising  
mindset of their users, can probably get on to the second rung for  
free.  The much-lamented lack of interest in the IAU keyword list  
suggests that getting on to the third rung might be a struggle.  The  
existence of the registry indicates that the people running archives  
can be persuaded to supply reasonably extensive/expensive semantic  
information; the prospect of this bringing users to them, and the  
embarassment of their logo not appearing where it ought, are what  
will persuade them to do this _and_ get it right.  The largeish  
number of errors in registry entries show that the benefits -- custom  
and visibility -- have not yet ben perceived to match the costs.

Opportunity and development costs: Developing (which means agreeing  
on) a new vocabulary or an ontology is very hard work, and very  
expensive (as we all know...); it should therefore be avoided as much  
as possible.  Repurposing an existing vocabulary is much better: even  
if it's not perfect, the benefits of it actually existing outweigh  
the costs of the fit being a little loose.

Resuse is better than redevelopment for other software as well (news  
just in: sin is bad!), but the costs are especially high for  
vocabulary development, since it necessarily consumes the effort and  
good temper of multiple people simultaneously, and it probably  
involves the time of valuable domain experts (you can't just hire  
someone).

Reusing an existing vocabulary should be cheap, and might consist of  
nothing more than some Perl magic to put the right type of pointy  
brackets round the items in your vocabulary list.

The tools and APIs for supporting reasoning (ie, working at the top  
end of the ladder) are rather hard to use, in my experience, for a  
mixture of reasons: what they're doing can be rather confusing, and  
they're still aimed at a fairly specialised developer community, so  
there isn't the sort of tutorial and community support that would let  
Joe Developer just pick up a tool and start creating.  What that  
means, I think, is that where those tools are useful, they should be  
well hidden as services or as middleware, and the community should  
have a fairly explicit plan about how it will maintain them in the  
medium term.

At the bottom end of the ladder, there are much more approachable  
tools for handling and storing RDF (though I haven't yet had to  
actually API-call an RDF parser, and most of my work in this area has  
been using XSLT).

That presumes you're using RDF.  The benefits have been rehearsed  
elsewhere this month, so I'll skip them here.  The main cost of not  
doing so is that you cut yourself off from the rest of the world.

Use-cases

Mathilda is reading a paper online.  She types the (A&A) keywords for  
the paper into VOExplorer and asks for 'more like this'.  VOExplorer  
calls out to a service which finds the  AOIM and Simbad equivalences  
of the A&A keywords, and uses the former to query a suitable service  
to find some pretty pictures, and the latter to query Simbad,  
presenting the two lists to Mathilda.  There aren't many pretty  
pictures, so Mathilda asks to expand the search, and VOExplorer asks  
for pretty pictures corresponding to a more general term, found  
either directly in the AOIM vocabulary, or finding a more general  
SImbad term and finding the AOIM equivalent of that.  The Simbad  
query, on the other hand, has produced far too many hits, so  
VOExplorer looks down the tree of Simbad terms which are 'narrower',  
and asks 'you were looking for compact objects: do you mean black  
holes, quasars, or...?'  Once she has established a suitable keyword  
or keywords, she can make a queries using the equivalent terms in  
whichever vocabularies the registry or VOEvent keywords are drawn  
from.  She finds some heterodyne observations (apologies if this is  
astronomical nonsense, but...), but she's an X-ray person, so is a  
bit vague, and curious, about just what that is -- but oooh, there's  
a link to DBpedia/wikipedia, so she goes there on the off-chance the  
article is decent.  The mechanism that brought in the link to DBpedia  
is the same one that is linking a growing collection of non- 
specialist semantic resources (see the 'linking open data' project:  
<http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/ 
LinkingOpenData>).

Most of the components of that are in place already, in the sense  
that the vocabularies exist and services can be queried using them.   
VOExplorer already makes a callout to a skeleton service which  
doesn't do anything useful yet, but will be expanded starting next  
year (funding's just arrived).  The CDS people (Alexandre in  
particular) have already demoed an application using the Ontology of  
Astro Object types which does something similar to the business of  
broadening and narrowing the Simbad queries).

All the best,

Norman

[1] A 'reasoner' is something which, for example, deduce that an  
instance of a given subtype is also an instance of the type.

[2] A 'folksonomy' is a del.icio.us or Flickr-style cloud of  
keywords, or the keywords on eBay or a conference abstract, where  
people ask themselves `what keyword would other people use to search  
for this?'.  'Folksonomy' is the same as 'free keyword list with  
counts of occurrences', but is fewer characters to type.

-- 
------------------------------------------------------------
Norman Gray  :  http://nxg.me.uk
eurovotech.org  :  University of Leicester, UK