Document Traversal - XML - Java Programming Language

After parsing an XML document, you generally need to find some piece of information contained within the document. dom4j provides several different options for moving through the Document object and its children.

Iterator, Lists, and Index-Based Access

Just as in DOM and JDOM, dom4j's Document and Element interfaces have a variety of methods for getting child nodes. In dom4j's case, the basic methods to get child nodes are actually contained within the Branch interface. contains a UML diagram containing the Branch, Document, and Element interfaces. For clarity, the methods to add and remove nodes have been removed.

Node access methods on Branch, Document, and Element interfaces

After looking at DOM and JDOM, some of these method names may seem a bit unusual: attributes( ) versus getAttributes( ), content( ) versus getChildNodes( ) and getContent( ), etc. dom4j does not consistently follow the JavaBeans method naming conventions. But past these naming differences, these methods are largely the same as what we have already seen in those APIs. Another cosmetic difference is that to access a namespace-qualified element or attribute in dom4j, you create a QName object encapsulating both the local name and the namespace. Compare this to both DOM and JDOM where getElementsByTagNameNS( ) and getChildren( ) both accept the local name and namespace as two separate parameters. Using these methods, it is possible to easily write code that, for example, outputs the value of an attribute named location on all of an Element's children:

public void outputLocationAttributes(Element parent) {
 for (Iterator it = parent.elementIterator( ); it.hasNext( ); ) {
 Element child = (Element) it.next( );
 String value = child.attributeValue("location");
 if (value == null) {
 System.out.println("No location attribute");
 } else {
 System.out.println("Location attribute value is " + value);
 }
 }
}

Note that in this example, I'm using the elementIterator( ) method. This utility method returns a java.util.Iterator for the List returned by elements( ). If you don't want to use the Iterator interface and prefer to use index-based access, the same code could be written using the nodeCount( ) and node( ) methods as:

 public void outputLocationAttributes2(Element parent) {
 for (int i = 0; i < parent.nodeCount( ); i++) {
 Node node = parent.node(i);
 if (node instanceof Element) {
 Element child = (Element) node;
 String value = child.attributeValue("location");
 if (value == null) {
 System.out.println("No location attribute");
 } else {
 System.out.println("Location attribute value is " + value);
 }
 }
 }
}

This is more verbose but will use less memory and could be faster because fewer List and Iterator objects are created. The significance of these optimizations depends upon the size and complexity of your document.

XPath

As mentioned above, dom4j has two different ways of evaluating XPath expressions. dom4j, like JDOM, has an XPath interface, diagrammed in . Instances of this class are created with the createXPath( ) methods of DocumentFactory listed in or the createXPath( ) method on DocumentHelper.

dom4j's XPath interface

dom4j's XPath support uses the Jaxen library, to which dom4j delegates the actual evaluation of XPath expressions. The NamespaceContext, FunctionContext, and VariableContext interfaces referenced by dom4j's XPath interface are Jaxen interfaces in the org.jaxen package.

Some of these methods are similar to methods with similar names in JDOM's XPath class, specifically selectNodes( ), selectSingleNode( ), valueOf( ), and numberValueOf( ). What is unique about dom4j's XPath interface is the ability to specify an XPath expression for use in sorting either a List of Node objects (the sort( ) methods) or the result of an expression (the two- and three-argument selectNodes( ) methods). Consider the XML document in Example 10-1.

Example XML list of tutorials

<?xml version="1.0" encoding="UTF-8"?>
<books>
 <book>
 <title>Java &amp; XML</title>
 <pubDate>2006</pubDate>
 </book>
 <book>
 <title>Learning UML</title>
 <pubDate>2003</pubDate>
 </book>
 <book>
 <title>XML in a Nutshell</title>
 <pubDate>2004</pubDate>
 </book>
 <book>
 <title>Apache Cookbook</title>
 <pubDate>2003</pubDate>
 </book>
</books>

If you wanted to get a list of the tutorial titles sorted by publication date, you could do this by creating two separate XPath expressionsone to get the book elements and one to sort themand then use them like this:

package javaxml3;
import java.io.File;
import java.util.Iterator;
import java.util.List;
import org.dom4j.Document;
import org.dom4j.DocumentHelper;
import org.dom4j.Element;
import org.dom4j.XPath;
import org.dom4j.io.SAXReader;
public class SortingXPath {
 public static void main(String[] args) throws Exception {
 Document doc = new SAXReader( ).read(new File("books.xml"));
 XPath tutorialPath = DocumentHelper.createXPath("//book");
 XPath sortPath = DocumentHelper.createXPath("pubDate");
 List tutorials = tutorialPath.selectNodes(doc, sortPath);
 for (Iterator it = tutorials.iterator( ); it.hasNext( );) {
 Element tutorial = (Element) it.next( );
 System.out.println(book.elementText("title"));
 }
 }
}

This outputs the titles in ascending order, starting with Learning UML and ending with Java & XML. There's no built-in mechanism to support descending sorting. Instead, you can use the static reverse( ) method of java.util.Collections to reverse the order of the list. The three-argument version of selectNodes( ) removes Node objects with duplicate values from the resulting List.^[] If the call to select nodes in the example above was:

^[] Assuming the third argument is TRue. If it's false, then duplicates aren't removed.

List tutorials = tutorialPath.selectNodes(doc, sortPath, true);

Then only three titles would be output. Apache Cookbook would be excluded because it has the same publication date as Learning UML. In addition to the XPath class, the Node interface has a handful of methods that allow you to evaluate XPath expressions by simply passing a String to one of these methods. Example 10-2 contains the XPath-specific methods of the Node interface.

Example XPath methods in the Node interface

public interface Node {
 // non-XPath methods removed
 List selectNodes(String xpathExpression);
 Object selectObject(String xpathExpression);
 List selectNodes(String xpathExpression, String comparisonXPathExpression);
 List selectNodes(String xpathExpression, String comparisonXPathExpression,
 boolean removeDuplicates);
 Node selectSingleNode(String xpathExpression);
 String valueOf(String xpathExpression);
 Number numberValueOf(String xpathExpression);
 boolean matches(String xpathExpression); // non-XPath methods removed
}

Behind the scenes, implementations of the Node interface will generally use the XPath class to evaluate the expressions passed to these methods. Because these methods deal with Strings, generally each call will result in the creation of a new XPath object. Thus, if you're going to be repeatedly evaluating the same XPath expression, the XPath class is going to be a better choice as you're expression will get compiled only one time. In addition, these methods can't deal with namespaces, variables, or custom functions, so if those features are necessary, the XPath class is your only choice. But that doesn't mean that these methods are useless; actually they're convenient and result in fewer lines of code. Before we leave XPath there are a few more methods from the Node interface that warrant mentioning. The methods getPath( ) and getUniquePath( ) return an XPath expression that would evaluate to a List of Nodes containing the current Node. The getUniquePath( ) method goes a step further from getPath( ) and adds indexing to ensure that the resulting XPath expression will evaluate to only this Node. In addition to zero-argument versions, both getPath( ) and getUniquePath( ) are overloaded to accept an Element, in which case, the result will be a relative XPath expression from the passed Element to the current Node. Looking back at the document in Example 10-1, if the object named tutorial is the book element for Learning UML, then Table 10-1 contains the results of these methods.

book.getPath( )	/books/book
book.getUniquePath( )	/books/book[2]
book.getPath(doc.getRootElement( ));	book
book.getUniquePath(doc.getRootElement( ));	book[2]

Using the Visitor Pattern

The final traversal option within dom4j, its support for the Visitor Pattern, is unique among the context of the object-model APIs discussed in this tutorial. If anything, it's most similar to SAX. As described above, the Visitor Pattern in dom4j is used by creating an implementation of the org.dom4j.Visitor interface . As you can see in the UML diagram in , the Visitor interface defines a visit( ) method for each node type.

The Visitor interface

Since you generally only care about a few of these node types, dom4j includes the org.dom4j.VisitorSupport class, which implements all the methods from the Visitor interface with empty method bodies. This lets your classes extend VisitorSupport and only override the methods for the types with which you are concerned. In prior chapters, we've discussed the common use case of needing to change the namespace of all elements within an XML document. Implementing this with the Visitor interface looks like this:

class NamesapceChangingVisitor extends VisitorSupport {
 private Namespace from;
 private Namespace to;
 public NamesapceChangingVisitor(Namespace from, Namespace to) {
 this.from = from;
 this.to = to;
 }
 public void visit(Element node) {
 Namespace ns = node.getNamespace( );
 if (ns.getURI( ).equals(from.getURI( ))) {
 QName newQName = new QName(node.getName( ), to);
 node.setQName(newQName);
 } // we also need to remove the namespace declaration
 ListIterator namespaces = node.additionalNamespaces( ).listIterator( );
 while (namespaces.hasNext( )) {
 Namespace additionalNamespace = (Namespace) namespaces.next( );
 if (additionalNamespace.getURI( ).equals(from.getURI( ))) {
 namespaces.remove( );
 }
 }
 }
}

The equals( ) method of the Namespace class will only return true if the URIs and the prefixes are equal. From an XML standpoint, this is incorrect. Namespaces in XML are equal if their URIs are equal. The prefix is merely a shortcut.

Using our Visitor class is largely a matter of parsing the XML and creating the Namespace objects we need to pass to the constructor of NamespaceChangingVisitor:

public class VisitorExample {
 public static void main(String[] args) throws Exception {
 if (args.length != 4) {
 System.err.println(
"Usage: javaxml3.VisitorExample [doc] [old ns] [new prefix] [new ns]");
 System.exit(0);
 }
 Document doc = new SAXReader( ).read(args[0]);
 Namespace oldNs = Namespace.get(args[1]);
 Namespace newNs;
 if (args[2].equals("-")) {
 newNs = Namespace.get(args[3]);
 } else {
 newNs = Namespace.get(args[2], args[3]);
 }
 Visitor visitor = new NamesapceChangingVisitor(oldNs, newNs);
 doc.accept(visitor);
 System.out.println(doc.asXML( ));
 }
}

It's also common to implement the Visitor interface with an inner class:

Visitor visitor = new VisitorSupport( ) {
 public void visit(Element node) {
 System.out.println(node.getName( ));
 }
}
doc.accept(visitor);