Taxonomy issues

Fri Sep 27 11:36:31 PDT 2002

On Fri, 27 Sep 2002 14:44:49 +0100
"Tony Linde" <tol at star.le.ac.uk> wrote:

> Is it possible using DAML+OIL or equivalents to classify the same object
> as two apparently contradictory things? I assume not. But both authors
> must be allowed to classify an object as different things. So how is
> this issue resolved?

Taxonomy hierarchies are trees -- directed, acyclic graphs -- but in
general, classification schemes need not be. I believe it is possible
to construct ontologies that are not trees, in order to handle the
classification conundrum that you mention.

I have been playing with a classification scheme in which one simply
assigns classification tags (call them "classes") to
entities. Initially, there is no class hierarchy; you must explicitly
call out all the classes to which an entity belongs, rather than
assume that attribution of a particular class implies attribution of
other classes (as would be the case in some implicit class hierarchy).

A collection of these explictly-classified entities -- call it a
training set -- can be used to construct a graph that represents the
correlations between classes.

This correlation graph can be fed back into the entity set to "fill
in" implicit classifications for those entities for which such class
tags are missing; the system suggests classes for these entites. Such
classifcations may be formally accepted or denied, which provides more
information to the graph contructor.

Pretty soon, you have an emergent classification scheme that is quite
useful: you reach a point where you don't have to call out all of the
explicit classifications.

Consider two search engines, Google and Yahoo. Yahoo started out as a
hierarchy of web pages: you traverse the tree to find web sites of
interest, and at each node is a number of web sites that had the
same classification. Google makes no attempt to classify pages, but
rather assumes that pages connected via a hyperlink are related
somehow; Google constructs a correlation graph for the web.

For a constrained knowlege domain -- e.g., "Astronomy" -- an a priori
ontology certainly exists, and needs to be used to verify the validity
of the correlation graph -- the emergent ontology. (When you type in a
search phrase for Google, you often get lots of unrelated items. It
is usually a mistake to assign any scientific significance to these
"outliers".)

Currently, my graph constructor simply solves for correlation. But it
should be possible to solve more-sophisticated constraints; I could
have a set of production rules that modify the correlation
strengths. One could provide an explicit (a priori) ontology to the
system by means of such rules. But I suspect that it would be better
to simply seed the system with a training set that expressed the a
priori ontology.

When I attended the Sematic Web Workshop at Stanford last year, there
was a clear distinction between the knowlege-representation(KR), "a
priori ontology" people and the "emergent behavior" people. The KR
folks build us systems that make sense, but sometimes cannot handle
flexible or fuzzy classification schemas. The emergent behavior folks
build systems that can perhaps handle the fuzziness of the real world,
but occasionally spout nonsense. Hmmm.

I suppose I'm an "emergent behavior" person. I have played with these
things for years; I once (briefly!) worked for a shiny new Internet
start-up company that tried to deliver a web search system very
similiar to the classification scheme I describe here: Collaborative
Filtering. The system worked remarkably well. (And never made it to
product stage...)

-- boyd

Boyd Waters
National Radio Astronomy Observatory
PO BOX 0
Socorro NM 87801
http://www.nrao.edu/~bwaters