XML Transformations - XML - Java Programming Language

One of the cooler things about XML is the ability to transform it into something else. With the wealth of web-capable devices these days (computers, personal organizers, phones, DVRs, etc.), you never know what flavor of markup you need to deliver. Sometimes HTML works, sometimes XHTML (the XML flavor of HTML) is required, sometimes the Wireless Markup Language (WML) is supported; and sometimes you need something else entirely. In all of these cases, though, the basic data being displayed is the same; it's just the formatting and presentation that changes. A great technique is to store the data in an XML document, and then transform that XML into various formats for display. As useful as XML transformations can be, though, they are not simple to implement. In fact, rather than trying to specify the transformation of XML in the original XML 1.0 specification, the W3C has put out three separate recommendations to define how XML transformations work. Because these three specifications are tied together tightly and are almost always used in concert, there is rarely a clear distinction between them. This can often make for a discussion that is easy to understand, but not necessarily technically correct. In other words, the term XSLT, which refers specifically to extensible stylesheet transformations, is often applied to both XSL and XPath. In the same fashion, XSL is often used as a grouping term for all three technologies. In this section, I distinguish among the three recommendations, and remain true to the letter of the specifications outlining these technologies. However, in the interest of clarity, I use XSL and XSLT interchangeably to refer to the complete transformation process throughout the rest of the tutorial. Although this may not follow the letter of these specifications, it certainly follows their spirit, as well as avoiding lengthy definitions of simple concepts when you already understand what I mean.

XSL

XSL is the Extensible Stylesheet Language. It is defined as a language for expressing stylesheets. This broad definition is broken down into two parts:

XSL is a language for transforming XML documents.
XSL is an XML vocabulary for specifying the formatting of XML documents.

The definitions are similar, but one deals with moving from one XML document form to another, while the other focuses on the actual presentation of content within each document. Perhaps a clearer definition would be to say that XSL handles the specification of how to transform a document from format A to format B. The components of the language handle the processing and identification of the constructs used to do this.

XSL and trees

The most i mportant concept to understand in XSL is that all data within XSL processing stages is in tree structures (see ). In fact, the rules you define using XSL are themselves held in a tree structure. This allows simple processing of the hierarchical structure of XML documents. Templates are used to match the root element of the XML document being processed. Then "leaf" rules are applied to "leaf" elements, filtering down to the most nested elements. At any point in this progression, elements can be processed, styled, ignored, copied, or have a variety of other things done to them.

Tree operations within XSL

A nice advantage of this tree structure is that it allows the grouping of XML documents to be maintained. If element A contains elements B and C, and element A is moved or copied, the elements contained within it receive the same treatment. This makes the handling of large data sections that need to receive the same treatment fast and easy to notate concisely in the XSL stylesheet. You will see more about how this tree is constructed when I talk specifically about XSLT in the next section.

Formatting objects

The XSL specification is almost entirely concerned with defining formatting objects. A formatting object is based on a large model, not surprisingly called the formatting model. This model is all about a set of objects that are fed as input into a formatter. The formatter applies the objects to the document, and what results is a new document that consists of all or part of the data from the original XML document in a format specific to the objects the formatter used. Because this is such a vague, shadowy concept, the XSL specification attempts to define a concrete model to which these objects should conform. In other words, a large set of properties and vocabulary make up the set of features that formatting objects can use. These include the types of areas that may be visualized by the objects; the properties of lines, fonts, graphics, and other visual objects; inline and block formatting objects; and a wealth of other syntactical constructs. Formatting objects are used heavily when converting textual XML data into binary formats such as PDF files, images, or document formats such as Microsoft Word. For transforming XML data to another textual format, these objects are seldom used explicitly. Although an underlying part of the stylesheet logic, formatting objects are rarely invoked directly, since the resulting textual data often conforms to another predefined markup language such as HTML. Because most enterprise apps today are based at least in part on web architecture and use a browser as a client, I spend the most time looking at transformations to HTML and XHTML. While formatting objects are covered only lightly, the topic is broad enough to merit its own coverage in a separate tutorial. For further information, consult the XSL specification at http://www.w3.org/TR/xsl.

XSLT

The second component of XML transformations is XSL Transformations. XSLT is the language that specifies the conversion of a document from one format to another (where XSL defined the means of that specification). The syntax used within XSLT is generally concerned with textual transformations that do not result in binary data output. For example, XSLT is instrumental is generating HTML or WML from an XML document. In fact, the XSLT specification outlines the syntax of an XSL stylesheet more explicitly than the XSL specification itself! Just as in the case of XSL, an XSLT stylesheet is always well-formed, valid XML. A DTD is defined for XSL and XSLT that delineates the allowed constructs. For this reason, you should only have to learn new syntax to use XSLT, and not new structural rules (if you know how XML is structured, you know how XSLT is structured). Just as in XSL, XSLT is based on a hierarchical tree structure of data, where nested elements are leaves, or children, of their parents. XSLT provides a mechanism for matching patterns within the original XML document, and applying formatting to that data. This results in anything from outputting XML data without the unwanted element names to inserting the data into a complex HTML table and displaying it to the user with highlighting and coloring. XSLT also provides syntax for many common operators, such as conditionals, copying of document tree fragments, advanced pattern matching, and the ability to access elements within the input XML data in an absolute and relative path structure. All these constructs are designed to ease the process of transforming an XML document into a new format.

XPath

As the final piece of the XML transformations puzzle, XPath provides a mechanism for referring to the wide variety of element and attribute names and values in an XML document. As I mentioned earlier, many XML specifications are now using XPath, but this discussion is concerned primarily with its use in XSLT. With the complex structure that an XML document can have, locating one specific element or set of elements can be difficult. It is made more difficult because access to a set of constraints that outlines the document's structure cannot be assumed; documents that are not validated must be able to be transformed just as valid documents can. To accomplish this addressing of elements, XPath defines syntax in line with the tree structure of XML, and the XSLT processes and constructs that use it. Referencing any element or attribute within an XML document is most easily accomplished by specifying the path to the element relative to the current element being processed. In other words, if element B is the current element and element C and element D are nested within it, a relative path most easily locates them. This is similar to the relative paths used in operating system directory structures. At the same time, XPath also defines addressing for elements relative to the root of a document. This covers the common case of needing to reference an element not within the current element's scope; in other words, an element that is not nested within the element being processed. Finally, XPath defines syntax for actual pattern matching: find an element whose parent is element E and that has a sibling element F. This fills in the gaps left between the absolute and relative paths. In all these expressions, attributes can be used as well, with similar matching abilities:

<!-- Match the element named link underneath the current element -->
<xsl:value-of select="link" />
<!-- Match the element named title nested within the channel element -->
<xsl:value-of select="channel/title" />
<!-- Match the description element using an absolute path -->
<xsl:value-of select="/rdf:RDF/description" />
<!-- Match the resource attribute of the current element -->
<xsl:value-of select="@rdf:resource" />
<!-- Match the resource attribute of the errorReportsTo element -->
<xsl:value-of select="/rdf:RDF/channel/admin:errorReportsTo/@rdf:resource" />

Because the input document is often not fixed, an XPath expression can result in the evaluation of no input data, one input element or attribute, or multiple input elements and attributes. This ability makes XPath very useful and handy; it also causes the introduction of some additional terms. The result of evaluating an XPath expression can be a node set. This name is in line with the idea of a hierarchical structure, which is dealt with in terms of leaves and nodes. The resultant node set can be empty, have a single member, or have 5 or 10 members. It can be transformed, copied, ignored, or have any other legal operation performed on it. Instead of a node set, evaluating an XPath expression could result in a Boolean value, a numerical value, or a string value. In addition to expressions that select node sets, XPath defines several functions that operate on node sets, like not( ) and count( ). These functions take in a node set as input and operate upon that node set. All of these expressions and functions are part of the XPath specification and XPath implementations; however, XPath is also often used to signify any expression that conforms to the specification itself. As with XSL and XSLT, this makes it easier to talk about XSL and XPath, though it is not always technically correct. With all that in mind, you're at least somewhat prepared to take a look at a simple XSL stylesheet, shown in Example 1-2.

Example XSL stylesheet for Example 1-1

<?xml version="1.0" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
 xmlns:rss="http://purl.org/rss/1.0/"
 xmlns:dc="http://purl.org/dc/elements/1.1/"
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/rdf:RDF">
<p>
 <a><xsl:attribute >
 <xsl:value-of select="rss:channel/rss:link"/>
 </xsl:attribute>
 <xsl:value-of select="rss:channel/rss:title"/></a>
</p>
<p>
<!-- Make the date presentable -->
 <xsl:variable select="rss:channel/dc:date"/>
 <xsl:variable select="substring($datetime, 9, 2)"/>
 <xsl:variable select="substring($datetime, 6, 2)"/>
 <xsl:variable select="substring($datetime, 0, 5)"/>
 <xsl:value-of select="concat($day, '/', $month, '/', $year)"/> - <xsl:value-of select="substring($datetime, 12, 5)"/>
</p>
<dl>
<xsl:for-each select="rss:item">
 <dt>
 <a><xsl:attribute >
 <xsl:value-of select="rss:link"/>
 </xsl:attribute>
 <xsl:value-of select="rss:title"/></a>
 </dt>
 <dd>
 <xsl:value-of select="rss:description"
 disable-output-escaping="yes" />
 <!-- Format the publish date -->
 (<xsl:variable select="dc:date"/>
 <xsl:variable select="substring($pubdate, 9, 2)"/>
 <xsl:variable select="substring($pubdate, 6, 2)"/>
 <xsl:variable select="substring($pubdate, 0, 5)"/>
 <xsl:value-of select="concat($pubday, '/', $pubmonth, '/', $pubyear)"/> - <xsl:value-of select="substring($pubdate, 12, 5)"/>)
 </dd>
</xsl:for-each>
</dl>
<p>
 <xsl:value-of select="rss:channel/dc:rights"/>
</p>
</xsl:template>
</xsl:stylesheet>

Template matching

The basis of all XSL work is template matching. For any element on which you want some sort of output to occur, you generally provide a template that matches the element. You signify a template with the template keyword, and provide the name of the element to match in its match attribute:

<xsl:template match="/rdf:RDF">
<p>
 <a><xsl:attribute >
 <xsl:value-of select="rss:channel/rss:link"/>
 </xsl:attribute>
 <xsl:value-of select="rss:channel/rss:title"/></a>
</p>
 <!-- etc... -->
</xsl:template>

Here, the RDF element (in the rdf-associated namespace) is being matched (the / is an XPath construct). When an XSL processor encounters the RDF element, the instructions within this template are carried out. In the example, several HTML formatting tags are output (the p and a tags). Be sure to distinguish your XSL elements from other elements (such as HTML elements) with proper use of namespaces. You can use the value-of construct to obtain the value of an element, and provide the element name to match through the select attribute. In the example, the character data within the title element is extracted and used as the title of the page, and a link is constructed using the link element as the target. On the other hand, when you want to cause the templates associated with an element's children to be applied, use apply-templates. Be sure to do this, or nested elements can be ignored! You can specify the elements to apply templates to using the select attribute; by specifying a value of * to that attribute, all templates left will be applied to all nested elements.

Looping

You'll also often find a need for looping in XSL:

<xsl:for-each select="rss:item">
 <dt>
 <a><xsl:attribute >
 <xsl:value-of select="rss:link"/></xsl:attribute>
 <xsl:value-of select="rss:title"/></a>
 </dt>
 <dd>
 <xsl:value-of select="rss:description"
 disable-output-escaping="yes" />
 <!-- Format the publish date -->
 (<xsl:variable select="dc:date"/>
 <xsl:variable select="substring($pubdate, 9, 2)"/>
 <xsl:variable select="substring($pubdate, 6, 2)"/>
 <xsl:variable select="substring($pubdate, 0, 5)"/>
 <xsl:value-of select="concat($pubday, '/', $pubmonth, '/', $pubyear)"/> - <xsl:value-of select="substring($pubdate, 12, 5)"/>)
 </dd>
</xsl:for-each>

Here, I'm looping through each element named item using the for-each construct. In Java, this would be:

for (Iterator i = item.iterator( ); i.hasNext( ); ) {
 // take action on each item
}

Within the loop, the "current" element becomes the next item element encountered. For each item, I output the description (the entry text) using the value-ofconstruct. Take particular note of the disable-output-escaping attribute. In the XML, the description element has HTML content, which makes liberal use of entity references:

When the shoot was done, my daughter Holly, who had been doing her homework in the room next door, and occasionally coming out to laugh at me, helped use up the last few pictures on the roll. She looks like she's having fun. I think I look a little dazed.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://www.neilgaiman.com/journal/neil_8313036.jpg" /&gt;&lt;br /&gt;&lt;br /&gt;This is the one we're going to be using on the tutorial jacket of ANANSI BOYS.

Normally, value-of outputs text just as it is in the XML document being processed. The result would be that this escaped HTML would stay escaped. The output document would end up looking like .

With output escaping on, HTML content within XML elements often won't look correct

To ensure that your output is not escaped, set disable-output-escaping to yes.

Be sure you think this through. I used to get confused, thinking that I wanted to set this attribute to no so that escaping would not happen. However, a value of no results in escaping being enabled (not being disabled). Make sure you get this straight, or you'll have some odd results.

Setting this attribute to yes and rerunning the transform results in the output shown in .

With escaping turned off, output shows up as HTML, which is almost certainly the desired result

Performing a transform

Before leaving XSL (at least for now), I want to show you how to easily perform transformations from the command line. This is a useful tool for quick-and-dirty tests; in fact, it's how I generated the screenshots used in this chapter. Download Xalan-J from the Xalan web site, http://xml.apache.org/xalan-j. Expand the archive (on my Windows laptop, I use c:/java/xalan-j_2_6_0). Then add xalan.jar, xercesImpl.jar, and xml-apis.jar to your classpath. Finally, run the following command:

java org.apache.xalan.xslt.Process IN [XML filename]
 -XSL [XSL stylesheet]
 -OUT [output filename]

For example, to generate the HTML output for Neil Gaiman's feed, I used the tool like this:

> java org.apache.xalan.xslt.Process -IN gaiman-blogger_rss.xml  -XSL rdf.xsl -OUT test.html

You'll get a file (test.html in this case) in the directory in which you run the command. Use this tool often; it will really help you figure out how XSL works, and what effect small changes have on output.