Reading and Writing with dom4j - XML - Java Programming Language

Document input and output is probably where JDOM and dom4j are closest. Both define input and output as both being able to read and write XML documents from and to input sources such as files, URLs, and String objects and as a way of interfacing with other XML APIs. Both JDOM and dom4j, for example, have classes (SAXWriter for dom4j and SAXOutputer for JDOM) for firing SAX event method calls based on the structure of a Document object. One additional, critical thing that JDOM and dom4j have in common is that neither is an XML parser. I mentioned this in the last chapter, but it's worth repeating: both JDOM and dom4j use a parser object provided by some other package. Both can use different parsers (SAX, DOM, StAX, etc.), but most commonly, SAX is used. In the case of SAX and DOM, by default, both JDOM and dom4j will use the SAX or DOM parser retrieved through the JAXP factories as described in . This means that in dom4j, like JDOM, if you run into parsing problems, it's likely that the source of your problem is the underlying SAX parser.

Parsing a Document

As noted above, dom4j is not an XML parser and must use a separate parser to produce Document objects. In general, you will use a SAX parser through the dom4j class org.dom4j.io.SAXReader . A call to one of SAXReader's read( ) methods will create an instance of org.xml.sax.XMLReader and pass it an implementation of the ContentHandler interface that has calls to DocumentFactory to create the dom4j object tree. The code to parse a java.io.File looks something like:

// assume we got a path as a command-line argument File file = new File(args[0];
SAXReader reader = new SAXReader( );
Document doc = reader.read(file);

Through various constructor arguments, it's possible to create a SAXReader instance that does validation,^[*] uses an alternate DocumentFactory implementation, or uses a specific SAX implementation. In addition, there are a variety of setter methods (setValidating( ), setDocumentFactory( ), etc.) to set these properties and others on the SAXReader object. See the dom4j Javadocs for a complete listing. The read( ) method is overloaded to accept one of the following inputs:

^[*] More precisely, SAXReader asks the underlying SAX parser to validate.

A java.io.File object
A java.net.URL object
A system ID as a String object (which could be a URL or a filename)
A java.io.InputStream object
A java.io.Reader object
A java.io.InputStream object and a system ID resolving relative URLs
A java.io.Reader object and a system ID for resolving relative URLs
An org.xml.sax.InputSource object

If you have a String object that you want to parse as XML, you can either wrap that String in a java.io.StringReader or pass the String to the utility method DocumentHelper.parseText( ). The parseText( ) method determines the proper encoding, parses your String, and returns the resulting Document object. As with JDOM, dom4j includes a class, org.dom4j.io.DOMReader , which converts an instance of org.w3c.dom.Document to an instance of org.dom4j.Document . dom4j also includes reader classes that use the StAX and XMLPull APIs discussed in the last chapter. These classes, org.dom4j.io.StAXEventReader and org.dom4j.io.XPP3Reader , respectively, require additional JAR files (included with the dom4j distribution) in the classpath. Classpath modifications aside, both classes are similar enough to SAXReader that you can easily take either for a test drive.

Creating a Document Object

As discussed above, creation of a Document object is done with the DocumentFactory class or one of its subclasses. There is also a DocumentHelper class that provides static methods for creating Document, Element, Attribute, etc. objects. These static methods call the corresponding method on an instance of DocumentFactory. This is simply a shortcut, so that instead of writing:

DocumentFactory factory = DocumentFactory.getInstance( );
Document doc = factory.createDocument( );

You can write:

Document doc = DocumentHelper.createDocument( );

Not using the DocumentFactory object directly does save us a line of code, but we've lost our ability to use a different factory class, which we'll explore in more depth later in this chapter. Once you've created your Document object, it's easy to add nodes to it by calling one of the methods named add( ). Based on the XML specification, you can add as many Comment and ProcessingInstruction objects to a Document as you choose, but only one Element. In addition to the add( ) methods, the Branch interface, which Document extends, includes a group of methods named addElement( ), which accept a QName object, a local name, or a local name and a namespace URI. When you call one of the addElement( ) methods, the createElement( ) method on DocumentFactory is called and the resulting Element object is set as the root element of the Document object. As a result, these two blocks of code do the same thing:

//block 1 the long way Element myElement = factory.createElement("name");
doc.add(myElement);
//block 2 the short way doc.addElement("name");

The addElement( ) methods return the object that was newly created, which allows you to chain method calls such as:

doc.addElement("root").addElement("child").addElement("innerChild");

To produce a document that, when serialized to a file, looks like this:

<root>
 <child>
 <innerChild/>
 </child>
</root>

Similar shortcut methods also exist for creating Comment and ProcessingInstruction objects as part of the Document interface and Comment, ProcessingInstruction, Attribute, CData, Entity, Namespace, and Text objects as part of the Element interface.

Namespaces and qualified names

In dom4j, as with other XML APIs, the names of elements and attributes are expressed as a triple of a local name, a namespace prefix, and a namespace URI. The namespace prefix and namespace URI are encapsulated in the class org.dom4j.Namespace , which is itself encapsulated in the class org.dom4j.QName . Instances of both Namespace and QName are immutableonce instantiated, their properties cannot be modified. And although both provide public constructors, it is recommended that instances be instantiated through the static get( ) methods on both classes. The get( ) methods make use of object caches such that repeated calls to get( ) with the same parameters will return the same object. This leads to lower memory usage, faster object comparisons, and more consistent XML. contains diagrams of the QName and Namespace classes. Namespace implements the Node interface, and Namespace objects can be added to Element objects like any other node type. However, for clarity, I have omitted those methods from the diagram. The public constructors are also omitted to discourage their use.

QName and Namespace classes

When moving between XML APIs, be sure to not confuse org.dom4j.QName with the JAXP class javax.xml.namespace.QName. The two classes are not related.

Document Output

As with JDOM, document output in dom4j encompasses both XML serialization (writing an XML document or document component as a series of characters (a file, the console, a String object, etc.) and passing dom4j objects to other XML APIs. Of all the APIs covered in this tutorial, dom4j has the simplest mechanism for producing a String from a Document object:

String output = doc.asXML( );

That's it. The asXML( ) method is part of the Node interface and thus is available on every interface in the dom4j object model. None of the asXML( ) methods apply any formatting to the output and, other than for Document objects, the character encoding will always be UTF-8. If you want to format your output, you need to use the class org.dom4j.io.XMLWriter . XMLWriter does what its name suggestswrites XML objects to either a java.io.Writer or a java.io.InputStream. When creating an XMLWriter instance, you can pass in an org.dom4j.io.OutputFormat object, which controls how the XML objects are written. dom4j includes three formatting definitions:

Default: Created by new OutputFormat( ). Raw format. No indententation or newlines added. XML declaration with encoding always written with a newline following it.
Pretty Print: Created by OutputFormat.createPrettyPrint( ). Newlines and indentation of two spaces applied between elements. Text is trimmed and normalized.
Compact: Created by OutputFormat.createCompactFormat( ) . Default format with text trimming and normalization added.

You can pass an OutputFormat object to the constructor of XMLWriter. However, there is no setOutputFormat( ) or getOutputFormat( ) methodonce set, you cannot change the format of an XMLWriter. You could create an OutputFormat object, pass it to a constructor of XMLWriter, and then call one of the mutator methods on the OutputFormat instance; but this is not recommended and could lead to inconsistent behavior. Save yourself a headache and configure your OutputFormat object before creating an XMLWriter.

In the current version of dom4j as of the time of writing (Version 1.6.1), there's a bug in the asXML( ) implementation for Attribute objectsattribute values are not escaped. If you have an element such as:

<element name="some &quot;value&quot;"/>

The result of calling asXML( ) on the name attribute will be:

name="some "value""

To accurately output Attribute objects, use the write( ) method on XMLWriter.

Formatting options

The OutputFormat object can be customized further beyond the three default formatting definitions. In fact, it's useful to think of those definitions as templates. For example, if you wanted to have indentation and whitespace normalization but exclude the XML declaration, it is simpler to write:

OutputFormat format = OutputFormat.createPrettyPrint( );
format.setSuppressDeclaration(true);

The complete list of customizations is as follows:

lineSeparator: Which character or characters should be used when the newlines are added to the output. This setting applies only to newlines created while outputting, not to newlines already extant in Text nodes.
newlines: Indicates whether newlines will be added between elements.
encoding: What character encoding to use for output. Defaults to UTF-8.
omitEncoding: If this is true, the encoding is omitted from the XML declaration. Has no effect if the XML declaration isn't written.
suppressDeclaration: If this is true, the XML declaration is not output.
newLineAfterDeclaration: If this is false, a newline is output between the XML declaration and first node of the document.
expandEmptyElements: If this is true, elements without any child nodes are output as <name></name> instead of <name/>.
trimText: If this is true, leading and trailing whitespace is removed from text nodes and all the interior whitespace is normalized.
padText: If whitespace is being trimmed, it's possible that word boundaries will disappear. Calling setPadText(true) will partially disable whitespace trimming so that if a text node's content begins or ends with spaces and the node is immediately preceded or followed by an element, a single space is kept between the element and the text. This is helpful when outputting HTML to hello <b>and</b> goodbye rather then hello<b>and</b>goodbye, for example.
indent: The String to use for indentation. If this is null (the default), no indentation is done.
XHTML: Used by the XMLWriter subclass org.dom4j.io.HTMLWriter , which outputs a Document as HTML or XHTML. If this value is true, the output of HTMLWriter will be well-formed XML. Specifically, this means that CDATA sections will be output with the CDATA delimiters. Otherwise, just the text of the CDATA section will be output.
newLineAfterNTags: Like XHTML, this is used only by HTMLWriter. If this is set to a positive number and newlines is false, then a newline will be output after the set number of close tags. This is useful when trying to output as many tags on one line as possible, but don't want your output all on one line.
attributeQuoteCharacter: This property allows you to specify whether a single or double quote will be used as the character before and after attribute values. If you try to pass any other value, a java.lang.IllegalArgumentException is thrown.

Outputting to other APIs

In addition to XMLWriter, dom4j comes with three other writer classes that allow you to output a dom4j object to a different XML API. These classes are org.dom4j.DOMWriter , to output to an org.w3c.dom.Document object; org.dom4j.SAXWriter , to output to an org.xml.sax.ContentHandler and (optionally) an org.xml.sax.LexicalHandler; and StAXEventWriter, to output to a javax.xml.stream.XMLEventConsumer. These interfaces have been examined in the prior chapters, so I won't go into too much detail here other than to say that DOMWriter is limited to outputting Document objects whereas SAXWriter and StAXEventWriter can output any Node object.