handling metadata with multiple values
Ray Plante
rplante at poplar.ncsa.uiuc.edu
Wed Aug 6 01:45:06 PDT 2003
Hey Marco,
On Wed, 6 Aug 2003, Marco C. Leoni wrote:
> quick question: what was wrong with the first choice (<SUBJECT>
> <item>...</item> <item>...</item> </SUBJECT>) ?
Two main reasons: one is you don't need these <item> tags in practice
(rather they get in the way), and second is that they make SUBJECT as a
piece of metadata conceptually more opaque.
The motivation for the above pattern is to have a node that contains all
the subjects together; however in practice, we found that with typical
techniques for extracting metadata from XML, multiple nodes of the same
type are usually packaged up together anyway. For example, when using DOM
on:
<SUBJECT>...</SUBJECT>
<SUBJECT>...</SUBJECT>
<SUBJECT>...</SUBJECT>
one would use getElementsByTagName('SUBJECT') on the parent node to get
all the subjects, returning it in a NodeList object. You can use this
technique for all elements, whether you expect multiple values or not. On
the other hand, with the <item> pattern, you have treat the multiple value
case as a special case, first getting the <SUBJECT> node, and then
returning the <item> nodes as a list. Thus, these <item>s really just get
in the way.
The same is true with other techniques. With Java Binding tools--such as
JAXB, Castor, and MS XSD--multiple, sequential occurrances of SUBJECT will
automatically be parsed into a list container (e.g. ArrayList). When
using XPath, "SUBJECT" will return all of the subjects.
(Perhaps Wil can give the concrete example that he encountered when we
were working on our prototype registry.)
My second reason is that <item>s clutter the meaning behind the metadata
model. I would like to see schemas in which all our elements carry
meaning that can be pieced together to create more complex meaning.
XPaths, as pointers into the data model, can be a very effective way of
carrying that meaning. A good example would be
"RESOURCE/CONTENT/SUBJECT", which points to a subject of the resource's
content. If you use my preferred pattern for listing these (with no
<item>s), then this path points to the actual subject values; however, in
the <item> pattern, you have to use "RESOURCE/CONTENT/SUBJECT/item" to get
the values. I don't like this because "item" adds no additional meaning
to the path--it just clutters it. (Really, this second reason is just
an abstract form of the first reason.)
Note that all of the above applies to this non-preferred pattern as well:
<SUBJECTS>
<SUBJECT>...<SUBJECT>
<SUBJECT>...<SUBJECT>
<SUBJECT>...<SUBJECT>
</SUBJECTS>
The extra layer is not needed.
(Not quite a quick answer for a quick question ;-)
cheers,
Ray
More information about the registry
mailing list