Data for humans vs. data for machines

This paper focuses on the written information that human beings generate and use, but doesn't exclude information for machines. The reason we can't overlook information for machines is that much of the information we produce for human consumption must also be ‘machine-friendly,’ though we may not realize it when we're composing our documents.

The company policy manual, for example, may be a Microsoft Word document that we've formatted to look good in print, but at some point we'll need to have it on the company intranet; and if the software that we use to convert the document from Word to HTML (i.e., for the intranet) is unable to recognize a logical markup structure, the person who does the conversion must either supply the missing elements or post a broken document.

In a Word document, ‘logical markup structure’ refers to the hierarchy of information that we create by applying styles.

To the average person, a document with styles may look exactly the same as a document with similar formatting but no styles. For a computer, on the other hand, they would be radically different documents. The computer would ‘understand’ the document that lacks styles as a cluster of words with no logical breaks or organization other than punctuation and paragraphs. For computers, a document without styles is not really a document.

Those of you who are familiar with Word may also know that Word has an export-to-Web function. Unfortunately, this feature cannot add logical structure to a document that lacks styles. Word will create a document that looks like its print cousin. But the day when machines are able to associate physical appearance with logical structure, as humans do, is about twenty years away.

For now, machines must rely on us to define the logical divisions of a document for them. Unless we format documents properly in the first place (with styles as opposed to manual formatting) or the person who converts the document knows exactly what the person who prepared the document had in mind, and takes the time to add styles, the result will be a web page that other machines will interpret in unpredictable ways.

Documents that are not machine-friendly are usually not human-friendly either. In the next section, we'll discuss the three components of written documents (content, markup and metadata) and how these components can work together to ensure consistent delivery of quality information to people and computers.