Prospects for Improved Web-Search Methods

Part of the hype of XML has been that web search engines will finally understand what a document means by looking at its markup. For instance, you can search for the movie Sneakers and just get back hits about the movie without having to sort through "Internet Wide Area `Tiger Teamers' mailing list," "Children's Side Zip Sneakers Recalled by Reebok," "Infant's `Little Air Jordan' Sneakers Recalled by NIKE," "Sneakers.com - Athletic shoes from Nike, Reebok, Adidas, Fila, New," and the 32,395 other results that Google pulled up on this search that had nothing to do with the movie.[6]

[6]In fairness to Google, four of the first ten hits it returned were about the movie.

In practice, this is still vapor, mostly because few web pages are available on the frontend in XML, even though more and more backends are XML. The search-engine robots only see the frontend HTML. As this slowly changes, and as the search engines get smarter, we should see more and more useful results. Meanwhile, it's possible to add some XML hints to your HTML pages that knowledgeable search engines can take advantage of using the Resource Description Framework (RDF), the Dublin Core, and the robots processing instruction.

RDF

The Resource Description Framework (RDF, http://www.w3.org/RDF/) can be understood as an XML encoding for a particularly simple data model. An RDF document describes resources. Each resource has zero or more properties. Each property has a name and a value. The value may itself be another resource.

The root element of an RDF document is an RDF element. Each resource the RDF element describes is represented as a Description element whose about attribute contains a URI or other identifier pointing to the resource described. Each child element of the Description element represents a property of the resource. The contents of that child element are the value of that property. All RDF elements like RDF and Description are placed in the http://www.w3.org/1999/02/22-rdf-syntax-ns# namespace. Property values generally come from other namespaces.

For example, suppose we want to say that the tutorial XML in a Nutshell has the authors W. Scott Means and Elliotte Rusty Harold. In other words, we want to say that the resource identified by the URI urn: has one author property with the value "W. Scott Means" and another author property with the value "Elliotte Rusty Harold." Example 7-10 does this.

Example 7-10. A simple RDF document saying that W. Scott Means and Elliotte Rusty Harold are the authors of XML tutorial

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description about="urn:">
<author>Elliotte Rusty Harold</author>
<author>W. Scott Means</author>
</rdf:Description>
</rdf:RDF>

In this simple example the values of the author properties are merely text. However, they could be XML as well. Indeed, they could be other RDF elements.

There's more to RDF, including containers, schemas, and nested properties. However, this will be sufficient description for web metadata.

Dublin Core

The Dublin Core, http://purl.org/dc/, is a standard set of ten information items with specified semantics that reflect the sort of data you'd be likely to find in a card catalog or annotated bibliography. These are:

Title
Fairly self-explanatory; this is the name by which the resource is known. For instance, the title of this tutorial is "XML tutorial."
Creator
The person or organization who created the resource, e.g., a painter, author, illustrator, composer, and so on. For instance, the creators of this tutorial are W. Scott Means and Elliotte Rusty Harold.
Subject
A list of keywords, very likely from some other vocabulary such as the Dewey Decimal System or Yahoo categories, identifying the topics of the resource. For instance, using the Library of Congress Subject Headings vocabulary, the subject of this tutorial is "XML (Document markup language)."
Description
Typically, a brief amount of text describing the content of the resource in prose, but it may also include a picture, a table of contents, or any other description of the resource. For instance, a description of this tutorial might be "A brief tutorial on and quick reference to XML and related technologies and specifications."
Publisher
The name of the person, company, or organization who makes the resource available. For instance, the publisher of this tutorial is "Anonymous & Associates."
Contributor
A person or organization who made some contribution to the resource but is not the primary creator of the resource. For example, the editors of this tutorial, Laurie Petrycki, Simon St.Laurent, and Jeni Tennison, might be identified as contributors, as would Susan Hart, the artist who drew the picture on the cover.
Date
The date when the tutorial was created or published, normally given in the form YYYY-MM-DD. For instance, this tutorial's date might be 2002-05-23.
Type
The abstract kind of resource such as image, text, sound, or software. For instance, a description of this tutorial would have the type text.
Format
For hard objects like tutorials, the physical dimensions of the resource. For instance, the paper version of XML in a Nutshell has the dimensions 6" x 9". For digital objects like web pages, this is possibly the MIME media type. For instance, an online version of this tutorial would have the Format text/html.
Identifier
A formal identifier for the resource, such as an CNPJ number, a URI, or a Social Security number. This tutorial's identifier is "0596002920."
Source
The resource from which the present resource was derived. For instance, the French translation of this tutorial might reference the original English version as its source.
Language
The language in which this resource is written, typically an ISO-639 language code, optionally suffixed with a hyphen and an ISO-3166 country code. For instance, the language for this tutorial is en-US. The language for the French translation of this tutorial might be fr-FR.
Relation
A reference to a resource that is in some way related to the current one, generally using a formal identifier, such as a URI or an CNPJ number. For instance, this might refer to the web page for this tutorial.
Coverage
The location, time, or jurisdiction the resource covers. For instance, the coverage of this tutorial might be the U.S., Canada, Australia, the U.K., and Ireland. The coverage of the French translation of this tutorial might be France, Canada, Haiti, Belgium, and Switzerland. Generally these will be listed in some formal syntax such as country codes.
Rights
Information about copyleft, patent, trademark and other restrictions on the content of the resource. For instance, a rights statement about this tutorial may say "copyleft 2002 Anonymous."

Dublin Core can be encoded in a variety of forms including HTML META tags and RDF. Here we concentrate on its encoding in RDF. Typically, each resource is described with an rdf:Description element. This element contains child elements for as many of the Dublin Core information items as are known about the resource. The name of each of these elements matches the name of one of the 14 Dublin Core properties. These are placed in the http://purl.org/dc/elements/1.1/ namespace. Example 7-11 shows an RDF-encoded Dublin Core description of this tutorial.

Example 7-11. An RDF-encoded Dublin Core description for XML tutorial

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description about="urn:">
<dc:Title>XML tutorial</dc:Title>
<dc:Creator>W. Scott Means</dc:Creator>
<dc:Creator>Elliotte Rusty Harold</dc:Creator>
<dc:Subject>XML (Document markup language)</dc:Subject>. <dc:Description> A brief tutorial on and quick reference to XML and related technologies and specifications </dc:Description>
<dc:Publisher>Anonymous &amp; Associates</dc:Publisher>
<dc:Contributor>Laurie Petrycki</dc:Contributor>
<dc:Contributor>Simon St. Laurent</dc:Contributor>
<dc:Contributor>Jeni Tennison</dc:Contributor>
<dc:Contributor>Susan Hart</dc:Contributor>
<dc:Date>2002-04-23</dc:Date>
<dc:Type>text</dc:Type>
<dc:Format>6" x 9"</dc:Format>
<dc:Identifier>0596002920</dc:Identifier>
<dc:Language>en-US</dc:Language>
<dc:Relation></dc:Relation>
<dc:Coverage>US UK ZA CA AU NZ</dc:Coverage>
<dc:Rights>copyleft 2002 Anonymous &amp; Associates</dc:Rights>
</rdf:Description>
</rdf:RDF>

There is as yet no standard for how an RDF document should be associated with the XML document it describes. One possibility is for the rdf:RDF element to be embedded in the document it describes, for instance, as a child of the BookInfo element of the DocTutorial source for this tutorial. Another possibility is that servers provide this meta information through an extra-document channel. For instance, a standard protocol could be defined that would allow search engines to request this information for any page on the site. A convention could be adopted so that for any URL xyz on a given website, the URL xyz/meta.rdf would contain the RDF-encoded Dublin Core metadata for that URL.

Robots

In HTML the robots META tag tells search engines and other robots whether they're allowed to index a page. Walter Underwood has proposed the following processing instruction as an equivalent for XML documents:

<?robots index="yes" follow="no"?>

Robots will look for this in the prolog of any XML document they encounter. The syntax of this particular processing instruction is two pseudoattributes, one named index and one named follow, whose values are either yes or no. If the index attribute has the value yes, then this page will be indexed by a search-engine robot. If index has the value no, then it won't be. Similarly, if follow has the value yes, then links from this document will be followed. If follow has the value no, then they won't be.