Modern metadata through a DITA lens
In structured authoring, metadata is best defined as “data about content”¹. It may describe what the content is about, who wrote it, or what it is intended to achieve. It is used to make content searchable, discoverable, manageable, interchangeable, analyzable, and more easily integrated with other systems. Structured authoring allows us to apply metadata to granular chunks of content. For example, in DITA XML, it can be applied to maps and topics, but also to elements and attributes within those topics.
Just as DITA made structured content far easier to adopt and work with, standards around metadata allow different systems and organization- or industry-specific vocabularies to be linked together successfully. This post talks about some of those standards in relation to structured authoring, as well as the basic principles behind all metadata.
Principles of metadata
The NISO standards body distinguishes four types of metadata (http://www.niso.org/publications/press/understanding_metadata). The first three are:
- Descriptive (for finding and understanding information, e.g. values from a subject taxonomy)
- Administrative (technical and legal information, e.g. data about image formats to use in rendering, or copyright data to decide when and where content can be used)
- Structural (the relationship of resources to other resources and container objects)
As NISO point out, these are not mutually exclusive. For example, the same piece of metadata that describes an article’s author for internal administration purposes may also enable the author byline to be displayed when the article is published.
The fourth type of metadata is markup languages, indicating that these languages mix metadata and content together to a greater extent than happens with other content storage approaches. The metadata in markup languages can still be classified under the first three types, but it is applied to nested, granular chunks of content: inline block and phrase-level elements. This allows ways of working with content that are not possible with traditional document- or page-oriented content management. For example:
- Phrases and keywords can be automatically hyperlinked to definitions or related information without risking broken links when content changes, moves, or is deleted.
- In search interfaces and elsewhere, more relevant paragraphs and blocks of information (sometimes known as microcontent) can be suggested — not just the title and short description or other snippets derived from naive indexing of literal strings.
- In conversational interfaces, useful short answers (microcontent again) can be sourced from long articles provided that the inline elements in those articles are suitably annotated.
- By transforming the markup into web pages enriched with the Schema.org vocabulary, search engines can understand the content better and use snippets of information in link previews.
Key ideas and terms
In any publishing environment, metadata follows the same basic pattern: a series of statements are made about a piece of content. We can break each statement into three components:
- The content itself, or an identifier that represents it
- Some data about the content
- A component that represents the type of relationship between the content and the data that describes it
In traditional data modelling terms, the content is an Entity; the relationship an Attribute; and the data a Value of that attribute. In Linked Data terminology / RDF, the same conceptual components are understood as analogous to natural language statements and are called Subject, Predicate, and Object. However, in an RDF statement the Object can alternatively represent another entity, and the Predicate a relationship with that entity. As Entity / Attribute / Value are more specific to the case of describing content directly, I'll use those terms for the rest of this post.
As in any technical field, a variety of terms are in use for similar ideas. To help with definitions for the rest of the post, here are three important ideas, with some of the terms asssociated with them:
|Schema / Ontology / Vocabulary||In a broad sense, these define containers of information (entities / classes) and the relationships (properties / elements) between these containers and the data they hold. When “vocabulary” is used on its own, it typically has this sense (see a recent discussion).|
|Taxonomy / Thesaurus / Controlled vocabulary / Controlled list||These structures define key concepts that can be used as data in metadata statements. This is often descriptive data about the resource (entity) that is the subject of the statement, for example the themes of a document, the user goals it addresses, or the products it relates to. A taxonomy concept does not become metadata until it is used as part of a statement: in an element or field, as a property of the resource it describes. The compound term “controlled vocabulary” always means a taxonomy rather than an ontology / schema.|
|Linked Data / Semantic Web||These terms describe an increasingly popular approach to defining and using metadata schemas and taxonomies. In the NISO document’s list of “Notable Metadata Languages: Examples in Broad Use”, nearly all use this approach, or at least have a published version that uses it. The approach is to express and publish semantic data so that the entities within are unambiguously defined and so that different schemas and usages (whether within an organization or across the web) are easily linked. It uses web technologies to achieve this, particularly the use of URIs (now IRIs) as identifiers. These are typically in URL format, which can cause confusion for people new to the Linked Data approach: they may link to useful information about the identified entity, but they don’t have to. A key technology in Linked Data is RDF — the framework that provides the subject - predicate - object terminology used above.|
The following sections outline two key use cases for metadata in DITA: describing content with taxonomy, and understanding DITA structures in relation to external schemas or ontologies.
Describing DITA content with taxonomy and controlled lists
We often need to tag content using values from an external source such as an enterprise or industry taxonomy. This makes it easier for all users, internal and external, to find the content, for example via navigational hierarchies or faceted search interfaces. It can also power intelligent content recommendations, ease information sharing, and help with analyzing data on how the content is being used.
In structured content, particularly in markup languages, taxonomical metadata can describe content chunks at various levels of granularity. In DITA, these levels could be root maps, sub-maps, topics, block elements, and phrase-level elements. For example:
|(a DITA map)||market||EMEA|
|(a DITA map)||product||Acme flip phone|
|(a topic)||goal||factory reset|
|(a topic)||author||Francis-Noël Thomas|
|(an inline mention of a company name, for example in a
These examples can be expressed in DITA in various ways². The topic author could be stated using the available <prolog> metadata. (An example of the powerful ability of markup languages to use nested text containers — elements — to provide metadata describing their parent objects.)
<task id="topic"> <title>…</title> <prolog> <author>Francis-Noël Thomas</author>
If the author value will be manually keyed in (or if the editing environment can restrict prolog element values to an allowable list), this could be a good option. Most authoring tools display element text by default and allow it to be directly edited.
However, if the author name is sourced from an external list of values, we might need to use a unique identifier for that name (which also makes life easier if the author changes name at some point). This could be placed in the <author> element still or in an attribute. Attributes are often better locations for data that is not intended to be readable or translatable. In addition, their values can be controlled by Subject Scheme maps.
Thus, using a specialized attribute, we could either add the value to the <author> element in the prolog, or, as the prolog structure becomes rather redundant semantically, directly to the <topic> itself:
<task id="topic" author="francis-noel-thomas"> <title>…</title>
If we are working with external systems, it may be necessary to use a URI as the unique identifier for the author:
<task id="topic" author="http://example.com/users/0800200c9a66"> <title>…</title>
Benefits of unique identifiers for controlled values
As in the example above of the author, we often need to insert standard snippets of text or data in metadata statements. The values are defined in controlled lists of terms or taxonomies. This means that we can effectively exchange content between systems that have access to the same lists or taxonomies, for example web delivery platforms with faceted search interfaces, analytics tools, or taxonomy management systems. To use a value in metadata, we may feel that the easiest way is to simply use the readable string, for example "Francis-Noël Thomas" as the author of a DITA topic, or “factory reset” as the topic's subject value: the concept that the topic is about. However, relying solely on readable strings presents problems when exchanging content between systems or working with an external taxonomy management tool:
- If there are several strings for the same idea (concept) in different languages, which should be the authoritative one?
- If the string is changed, for example due to capitalization rule changes, how do we update that across all the different pieces of content that refer to it?
- If the string is taken from a hierarchical taxonomy or an even richer structure, how do we share changes to that structure without having to update all the uses of the string?
Many systems such as older content management systems, discussion forum software, and some CRMs use literal strings for controlled values, and integration is always difficult as a result.
The smarter, future-proof approach is to use a unique identifier for each value from a controlled vocabulary. It is this identifier, not the associated string, that is embedded in any content that needs to use it. In this way, any updates to the string do not require changes to the content. It is far easier to integrate systems and keep them in sync. In fact, a single identifier can represent a “concept” that includes readable labels in various languages and even alternative labels — synonyms — for the same idea. Clearly, for usability and accuracy, system users have to be able to work with the readable labels instead of pasting in IDs from a list. Many systems allow exactly that.
In taxonomies based on RDF standards such as SKOS, the identifiers are all URIs; typically in URL format, such as:
(Their primary purpose is still to identify concepts, although a best practice in Linked Data is to have the URL point to some useful resource about the concept.) However, there is no easy way to work with URIs in DITA Subject Scheme maps, nor is there a good way to embed URIs in attributes, at least in out-of-the-box, unspecialized DITA. Discussions in the DITA Technical Committee have raised some promising opportunities to deal with this challenges.
Understanding DITA elements in relation to external ontologies
So far, we have looked at the basic model for expressing data about a chunk of content — both the data itself, and the relationship that it has with the content (i.e. what property of the content is being expressed). However, there need to be some rules for what relationships can be expressed, and perhaps also what forms the data can take. A complete metadata schema expresses:
- The types of content that the data can be applied to
- For each type of content, what relationships or attributes may be expressed (or, in some cases, must be present)
- For each property, what values are allowed: what format do they need to be in, and are they keyed in by the author, generated automatically by the system, or sourced from a controlled list?
In DITA, some of the more mechanical rules for structuring metadata are provided in the DITA standard itself, and typically enforced in the authoring tool via the DTD or other schema format. For example, the content model for metadata elements in a topic <prolog>, and the general rules for using XML attributes, are controlled in this way.
Other rules, particularly concerning the meaning of metadata, tend to be maintained in human-readable rather than machine-readable form. If an authoring team decides that the <category> element should be used to store subject matter / aboutness metadata, this guideline may be defined in a team wiki alongside other authoring conventions.
However, as organizations realize the potential of integrating structured content environments with external systems for content delivery, semantic search, customer relationship management, master data management and so on, it becomes more important that DITA elements (and hence their content) can be understand in relation to external ontologies that link these systems together.
For example, the author metadata above could be understood as the Creator element from the widely known, though perhaps dated, Dublin Core metadata schema. Equally, it could be the Schema.org author property of the resulting generated page.
Schema.org is a very popular vocabulary for marking up web content, primarily to help external search engines such as Google understand the content, rank it better to relevant queries, and in some cases display enhanced result snippets. It provides both the general, whole-document level descriptive and administrative metadata that you would expect to find in other schemas such as Dublin Core, but it also mirrors aspects of DITA in its ability to mark up granular fragments of content in a semantically meaningful way. For example, the schema:HowTo entity (perhaps the equivalent of a task topic) can have a schema:HowToStep (equivalent of <step>) that itself relates to a schema:HowToDirection (equivalent of <cmd>). Using our previous framework, a simplified view of this metadata could be:
|(a task topic)||http://schema.org/HowToDirection||(content of the <cmd> element)|
In fact, there are a number of statements involved to describe this piece of metadata. A more accurate way to visualize it is:
This structure can be parsed and used by services such as Google and Bing to highlight relevant data in search results. There are also some non-search-engine applications of Schema.org being discussed and developed. An obvious use case for DITA in a web publishing environment (particularly a commercial one) is to map these semantic elements to their Schema.org markup equivalents.
However, there is currently no reliable, consistent way to map DITA elements to their equivalents in external vocabularies or ontologies. The DITA Open Toolkit does in fact generate some Dublin Core markup in HTML based on certain prolog metadata elements, but there is no easy way for external applications to work with that mapping directly in the DITA content. As Linked Data techniques become increasingly prevalent, both for standard ontologies and organization-specific ones, the current approach of ad-hoc mappings shows its limitations.
For sure, it is not the DITA standard’s responsibility to incorporate every new ontology or vocabulary that comes along. However, it could be very useful for information architects to have a mechanism in DITA to indicate their own mappings from DITA elements (core or specialized) to entities and properties in external vocabularies. Simply enabling RDFa support in DITA could be one way, although it could face significant adoption and usage challenges. Perhaps a better mapping method would be to extend Subject Scheme to define mappings between DITA elements and URIs for externally defined entities and properties. Discussions in the DITA Technical Committee on these and other approaches may reveal a sensible way forward to make DITA, in the Linked Data sense, more truly semantic.
- Good overview of metadata uses and some common standards (PDF):
- More on benefits of IDs/URIs instead of literal taxonomy term labels (PDF): http://eprints.rclis.org/16818/1/SP_clarke_zeng_isqv24no1.pdf
- Excellent presentation that describes the fundamental structures behind all metadata (PDF): https://web.archive.org/web/20130126101115/http://www.metalounge.org/_literature_52579/Stephen_Machin_–_ON_METADATA_AND_METACONTENT
- Useful discussion of the similarities and differences between schemas (in terms of data modeling) and ontologies
¹ Difficulties of defining metadata
Many people have observed that metadata is hard to define. In databases or content repositories, in some way all data is related, so what is it about certain bits of data that makes them "meta"? Deane Barker points out that in web content management it can no longer be defined as being "external", because in XML, metadata is stored alongside content. Neither can it be separated according to purpose: for display, management, or otherwise.
Stephen Machins solved the problem in a way by using an older, narrower definition of metadata: just the design of a database — the schema that sets rules for containers (entities) and properties or attributes. For the current sense of metadata, he used the term metacontent: content about content; the data that goes in the attributes. Yet, this doesn't help us much from a pragmatic point of view, and it is the broader sense of metadata that is commonly used now.
Here, I'm using the term to mean "data about content", including content that describes other content. It doesn't mean that that "metacontent" will never be content in its own right. It's just that in the current, functional context, that's how it's behaving. If it acts as metadata, I’m calling it metadata.
² Storing metadata directly in XML content or attached to it
The examples here are of metadata stored directly in the XML of a document: as element or attribute values. However, when structured content is managed in a database, it is often possible to attach metadata externally to the content, as database fields. For example, some component content management systems (CCMSs), popular in DITA and other structured documentation formats, allow taxonomy tags to be attached to discrete database objects such as maps/books and topics/sections. This can have benefits. The object may have already been through the workflow and been approved and released, in the context of one publication or deliverable. Subsequently, it may be used in a new context requiring additional metadata, such as applying to a new market or product. If the metadata is stored in the XML, that object may need to go back through the draft — review — approval workflow, just to add new metadata. On the other hand, if the metadata is stored externally, it can be updated without needing to modify the content itself.
The limitation of most CMSs/databases is that there is no easy way to apply metadata to smaller chunks within a topic or section. If this is needed, the best solution is still to embed the metadata in the markup, most feasibly as attributes in XML.