BogieLand - Presentation: Metadata and XML
BogieLand byline



Metadata and XML: Improving the Findability of Information


Presented at the European Information Development Conference 2004
(November 10, 2004 - Wiesbaden, Germany)

Now that the Extensible Markup Language (XML) is a familiar World Wide Web Consortium specification within technical communication, its application starts to solve important problems in information spaces. One of the major problems is the findability of information and documents. A powerful way to solve this problem is the use of metadata. This talk outlines a few of the concepts in which metadata and XML work together to increase findability.


For many, metadata is data about data. Metadata is better understood as any statement about an information resource. Metadata describes an asset and provides a meaningful set of attributes that can be used to further classify or consume content. Metadata is the foundation of all information retrieval. The best-known vocabulary for metadata is Dublin Core Metadata Initiative (DCMI). According to the DCMI, the most useful metadata about a document is the keywords, since that is the only thing that explicitly describes what the document is about. Other metadata are useful in managing the documents and in helping the user decide which of their search hits they want to look more closely at.

Controlled Vocabulary

A controlled vocabulary (CV) is the simpliest of all metadata applications. A CV is an organized lists of words and phrases, or notation systems, that are used to initially tag content, and then to find it through navigation or search. The most basic, and often overlooked, form of controlled vocabulary is a consistent labeling system.


A taxonomy is a hierarchical structure for the classification or organization of data. Historically used by biologists to classify plants or animals according to a set of natural relationships, in content management and information architecture, the intention is to leverage taxonomies as a tool for organizing content. Perhaps the greatest benefit to taxonomies is improved searching. Taxonomy means a subject-based classification that arranges the terms in the controlled vocabulary into a hierarchy without doing anything further, though in real life you will find the term 'taxonomy' applied to more complex structures as well.

Faceted Classification

In faceted classification, the idea is to classify documents by picking one term from each facet to describe the document along different axes. A faceted classification scheme is actually a special case of controlled vocabularies. The eXchangeable Faceted Metadata Language (XFML) is a language designed to exchange metadata. The metadata of topics is arranged into facets. You need to have a taxonomy about the domain before you can exchange the metadata. Generating your navigation from XFML means a separation between content and navigation that makes it easier to adjust and evolve your taxonomies.


A thesaurus shows all of the relationships between concepts and terms. It shows how one term is associated with another. Thesauri provide a much richer vocabulary for describing the terms than taxonomies do. The key feature of a thesaurus is the relationships, or associations, between terms.


Ontologies in computer science came out of artificial intelligence, and have generally been closely associated with logical inferencing and similar techniques, but have recently begun to be applied to information retrieval. The ontology concept is strongly related to the W3C initiative of the Semantic Web. An ontology defines a common vocabulary for researchers who need to share information in a domain. It includes machine-interpretable definitions of basic concepts in the domain and relations among them. The OWL Web Ontology Language is intended to be used when the information contained in documents needs to be processed by applications, as opposed to situations where the content only needs to be presented to humans.

Topic Maps

Topic Maps are described as 'the GPS of the information universe' and are an ontology framework for information retrieval. Topic Maps is a model for describing knowledge structures and associating them with any kind of information resources. Topic Maps and RDF (see below) solve the same problem and are seperate specifications. Both models are suitable to solve the knowledge management problem, but the idea that inspired them was different. RDF has been developed with the Semantic Web in mind, while Topic Maps was born as a means to create a practical way to build indexes of information resources. XML Topic Maps (XTM) provides a model and grammar for representing the structure of information resources used to define topics, and the associations (relationships) between topics.

The Resource Description Framework

The Resource Description Framework (RDF) is an alternative to Topic Maps, but it is considered too complex for mere mortals. RDF is a W3C recommendation for the expression of metadata on any kind of target, from real life objects to abstract entities, but it is particularly useful for internet resources such as documents or server-side processes. RDF integrates a variety of applications from library catalogs and world-wide directories to syndication and aggregation of news, software, and content to personal collections and using XML as an interchange syntax. The RDF specifications provide a lightweight ontology system to support the exchange of knowledge on the Web.