handling metadata with multiple values

Wed Aug 6 01:45:06 PDT 2003

Hey Marco,

On Wed, 6 Aug 2003, Marco C. Leoni wrote:
>     quick question: what was wrong with the first choice (<SUBJECT> 
> <item>...</item> <item>...</item> </SUBJECT>) ?

Two main reasons: one is you don't need these <item> tags in practice 
(rather they get in the way), and second is that they make SUBJECT as a 
piece of metadata conceptually more opaque. 

The motivation for the above pattern is to have a node that contains all 
the subjects together; however in practice, we found that with typical 
techniques for extracting metadata from XML, multiple nodes of the same 
type are usually packaged up together anyway.  For example, when using DOM 
on: 

   <SUBJECT>...</SUBJECT>
   <SUBJECT>...</SUBJECT>
   <SUBJECT>...</SUBJECT>

one would use getElementsByTagName('SUBJECT') on the parent node to get 
all the subjects, returning it in a NodeList object.  You can use this 
technique for all elements, whether you expect multiple values or not.  On 
the other hand, with the <item> pattern, you have treat the multiple value 
case as a special case, first getting the <SUBJECT> node, and then 
returning the <item> nodes as a list.  Thus, these <item>s really just get 
in the way.

The same is true with other techniques.  With Java Binding tools--such as 
JAXB, Castor, and MS XSD--multiple, sequential occurrances of SUBJECT will 
automatically be parsed into a list container (e.g. ArrayList).  When 
using XPath, "SUBJECT" will return all of the subjects.  

(Perhaps Wil can give the concrete example that he encountered when we 
were working on our prototype registry.)

My second reason is that <item>s clutter the meaning behind the metadata
model.  I would like to see schemas in which all our elements carry
meaning that can be pieced together to create more complex meaning.  
XPaths, as pointers into the data model, can be a very effective way of
carrying that meaning.  A good example would be
"RESOURCE/CONTENT/SUBJECT", which points to a subject of the resource's
content.  If you use my preferred pattern for listing these (with no
<item>s), then this path points to the actual subject values; however, in
the <item> pattern, you have to use "RESOURCE/CONTENT/SUBJECT/item" to get
the values.  I don't like this because "item" adds no additional meaning 
to the path--it just clutters it.  (Really, this second reason is just 
an abstract form of the first reason.)

Note that all of the above applies to this non-preferred pattern as well:

  <SUBJECTS>
    <SUBJECT>...<SUBJECT>
    <SUBJECT>...<SUBJECT>
    <SUBJECT>...<SUBJECT>
  </SUBJECTS>

The extra layer is not needed.

(Not quite a quick answer for a quick question ;-)

cheers,
Ray