Serialization - XML - Java Programming Language

Typically, I'd come up with some clever example for using DOM at this point, and use it to demonstrate how the API works. However, DOM leaves a rather gaping hole, and filling that hole proves to be a good DOM tutorial, as well as having practical value. This hole, of course, is serialization. Serialization is the process of taking an XML document in memory, represented as a DOM tree, and writing it to disk (or to a stream). If you're lucky enough to have a parser that implements the DOM Level 3 Load and Save module, then outputting a DOM tree isn't a problem for you. Most parsers don't provide that supportor slap experimental all over itand it becomes a real problem for DOM programming.

Getting a DOM Parser

Before you can serialize a DOM tree representing some XML, though, you need to read that XML in the first place. Since you'll usually be reading XML from a file, I'll show you how to do just that. Example 5-1 is a sample class that takes an XML filename, and loads the document into a DOM tree, represented by the org.w3c.dom.Document interface.

Example This test class reads in an XML document and loads it into a DOM tree

package javaxml3;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStream;
import org.xml.sax.InputSource;
import org.w3c.dom.Document;
// Parser import import org.apache.xerces.parsers.DOMParser;
public class SerializeTester {
 // File to read XML from
 private File inputXML;
 // File to serialize XML to
 private File outputXML;
 public SerializeTester(File inputXML) {
 this.inputXML = inputXML;
 }
 public void test(OutputStream outputStream) throws Exception {
 DOMParser parser = new DOMParser( );
 // Get the DOM tree as a Document object
 // Serialize
 }
 public static void main(String[] args) {
 if (args.length != 2) {
 System.out.println(
 "Usage: java javaxml3.SerializeTester " +
 "[XML document to read] " +
 "[filename to write output to]");
 return;
 }
 try {
 SerializeTester tester = new SerializeTester(new File(args[0]));
 tester.test(new FileOutputStream(new File(args[1])));
 } catch (Exception e) {
 e.printStackTrace( );
 }
 }
}

This example obviously has a couple of pieces missing, represented by the two comments in the test( ) method. I'll supply those in the next two sections, first explaining how to get a DOM tree object, and then detailing the DOMSerializer class, which will do all the heavy lifting.

The DOM Document Object

Remember that in SAX, the focus of interest in the parser was the lifecycle of the process, as all the callback methods provided hooks into the data as it was being parsed. In DOM, the focus of interest lies in the output from the parsing process. Until the entire document is parsed, the XML data is not in a usable state. The output of a parse intended for use with the DOM interface is an org.w3c.dom.Document object . This object acts as a handle to the tree your XML data is in, and in terms of the element hierarchy, it is equivalent to one level above the root element in your XML document. In other words, it owns each and every element in the XML document. As in SAX, the key method for parsing XML is, unsurprisingly, parse( ); this time the method is on the DOMParser class, though, instead of the SAXParser class. However, DOM requires an additional method to obtain the Document object result from the XML parsing. For DOMParser, this method is named getDocument( ). A simple addition to the SerializeTester class, then, makes reading in XML possible:

public void test(OutputStream outputStream) throws Exception {
 DOMParser parser = new DOMParser( );
 // Get the DOM tree as a Document object
 parser.parse(new InputSource(
 new FileInputStream(inputXML)));
 Document doc = parser.getDocument( );
 // Serialize
 }

This of course assumes you are using Xerces, as the import statement at the beginning of the source file indicates:

import org.apache.xerces.parsers.DOMParser;

If you are using a different parser, you'll need to change this import to your vendor's DOM parser class. Then consult your vendor's documentation to determine which of the parse( ) mechanisms you need to employ to get the DOM result of your parse. Although there is some variance in getting this result, all the uses of this result that we look at are standard across the DOM specification, so you should not have to worry about any other implementation curveballs in the rest of this chapter.

SAX and DOM at Play

You might have noticed that I supplied the parse( ) methodused for DOM parsing, mind youa SAX construct, org.xml.sax.InputSource. That might seem surprising, until you realize that DOM parsers often use SAX to handle their parsing! You've already seen that SAX is a fast and efficient method for parsing, so many DOM parsers will actually read in data with SAX, and build up a DOM tree. Even in cases where SAX isn't used wholesale under the covers, the SAX API still offers useful constructs, like InputSource, for representing input and output data.

Serializer Preliminaries

I've been throwing the term "serialization" around quite a bit, and should probably make sure you know what I mean. When I say serialization, I simply mean outputting XML. This could be a file (using a Java File), an OutputStream, or a Writer. There are certainly more output forms available in Java, but these three cover most of the bases (in fact, the latter two do, as a File can be easily converted to a Writer). In this case, the serialization taking place is in an XML format; the DOM tree is converted back to a well-formed XML document in a textual format. Example 5-2 is the skeleton for the DOMSerializer class. It imports all the needed classes to get the code going, and defines the different entry points (for a File, OutputStream, and Writer) to the class. It also handles setting a few variables that are used in output: the indentation (if any), the encoding, and the line separator (important so that the output works across multiple platforms).

Example DOMSerializer handles the preliminary details of DOM output

package javaxml3;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
// DOM imports import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class DOMSerializer {
 /** Indentation to use (default is no indentation) */
 private String indent = "";
 /** Line separator to use (default is for UNIX) */
 private String lineSeparator = "\n";
 /** Encoding for output (default is UTF-8) */
 private String encoding = "UTF8";
 public void setLineSeparator(String lineSeparator) {
 this.lineSeparator = lineSeparator;
 }
 public void setEncoding(String encoding) {
 this.encoding = encoding;
 }
 public void setIndent(int numSpaces) {
 StringBuffer buffer = new StringBuffer( );
 for (int i=0; i<numSpaces; i++)
 buffer.append(" ");
 this.indent = buffer.toString( );
 }
 public void serialize(Document doc, OutputStream out)
 throws IOException {
 Writer writer = new OutputStreamWriter(out, encoding);
 serialize(doc, writer);
 }
 public void serialize(Document doc, File file)
 throws IOException {
 Writer writer = new FileWriter(file);
 serialize(doc, writer);
 }
 public void serialize(Document doc, Writer writer)
 throws IOException {
 // Serialize document
 }
}

One nice facet of DOM is that all of the DOM structures that represent XML (including the Document object) extend the DOM org.w3c.dom.Node interface. This enables the coding of a single method that handles serialization of all DOM node types. Within that method, you can differentiate between node types, but by accepting a Node as input, it enables a very simple way of handling all DOM types. Additionally, it sets up a methodology that allows for recursion, any programmer's best friend. Add the serializeNode( ) method shown here, as well as the initial invocation of that method in the serialize( ) method:

public void serialize(Document doc, Writer writer)
 throws IOException {
 // Start serialization recursion with no indenting
 serializeNode(doc, writer, "");
 writer.flush( );
}
private void serializeNode(Node node, Writer writer, String indentLevel)
 throws IOException {
}

In the serializeNode( ) method, an indentLevel variable is used; this sets the method up for recursion. In other words, the serializeNode( ) method can indicate how much the node being worked with should be indented, and when recursion takes place, can add another level of indentation (using the indent member variable). Starting out (within the initial call to serialize( )), there is an empty String for indentation; at the next level, the default is two spaces for indentation, then four spaces at the next level, and so on. Of course, as recursive calls unravel, things head back up to no indentation. All that's left now is to handle the various node types.

Working with Nodes

Once in the serializeNode( ) method, the first task is to determine what type of node has been passed in. Although you could approach this with a Java methodology, using the instanceof keyword and Java reflection, the DOM language bindings for Java make this task much simpler. The Node interface defines a helper method, getNodeType( ), which returns an int value. This value can be compared against a set of constants (also defined within the Node interface), and the type of Node being examined can be quickly and easily determined. This also fits very naturally into the Java switch construct, which can be used to break up serialization into logical sections. The code here covers almost all DOM node types; although there are some additional node types defined (see ), these are the most common, and the concepts here can be applied to the less common node types as well:

private void serializeNode(Node node, Writer writer, String indentLevel)
 throws IOException {
 // Determine action based on node type
 switch (node.getNodeType( )) {
 case Node.DOCUMENT_NODE:
 break; case Node.ELEMENT_NODE:
 break;
 case Node.TEXT_NODE:
 break;
 case Node.CDATA_SECTION_NODE:
 break;
 case Node.COMMENT_NODE:
 break;
 case Node.PROCESSING_INSTRUCTION_NODE:
 break;
 case Node.ENTITY_REFERENCE_NODE:
 break; case Node.DOCUMENT_TYPE_NODE: break; }
}

This code is fairly useless; however, it helps to see all the DOM node types laid out here in a line, rather than mixed in with all the code needed to perform actual serialization. I want to get to that now, though, starting with the first node passed into this method, an instance of the Document interface.

Document nodes

Because the Document interface is an extension of the Node interface, it can be used interchangeably with the other node types. However, it is a special case, as it contains the root element as well as the XML document's DTD and some other special information not within the XML element hierarchy. As a result, you need to extract the root element and pass that back to the serialization method (starting recursion). Additionally, the XML declaration itself is printed out:

case Node.DOCUMENT_NODE:
 Document doc = (Document)node;
 writer.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
 writer.write(lineSeparator);
 serializeNode(doc.getDocumentElement( ), writer, "");
 break;

Since the code needs to access a Document-specific method (as opposed to one defined in the generic Node interface), the Node implementation must be cast to the Document interface. Then invoke the object's getDocumentElement( ) method to obtain the root element of the XML input document, and in turn, pass that on to the serializeNode( ) method, starting the recursion and traversal of the DOM tree.

Accessing the XML declaration

DOM Level 2 does not provide you access to the XML declaration. That's why the case statement for Document nodes manually outputs a declaration. However, if you're working with a parser that supports DOM Level 3, you get a few additional methods on the Document interface:

public String getXmlVersion( );
public void setXmlVersion(String version);
public boolean getXmlStandalone( );
public void setXmlStandalone(boolean standalone);
public String getXmlEncoding( );

It is intentionaland correctthat there is no setXmlEncoding( ) option. That attribute of the XML declaration is essentially read-only, as any outgoing encoding is handled by your coding language, and should not be set explicitly by the DOM API.

If you've got access to these methods, make this change to your code:

case Node.DOCUMENT_NODE:
 Document doc = (Document)node;
/** DOM Level 2 code
 writer.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
*/
 writer.write("<?xml version=\"");
 writer.write(doc.getXmlVersion( ));
 writer.write("\" encoding=\"UTF-8\" standalone=\""); if (doc.getXmlStandalone( ))
 writer.write("yes");
 else
 writer.write("no");
 writer.write("\"");
 writer.write("?>");
 writer.write(lineSeparator);
 serializeNode(doc.getDocumentElement( ), writer, "");
 break;

To access DOM Level 3 core functionality in Xerces, either build Xerces from source, using the jars-dom3 target, or download a Xerces distribution built with DOM Level 3 (as I write this, that distribution is in beta release). For more on DOM Level 3 support in Xerces, check out http://xml.apache.org/xerces2-j/faq-dom.html and http://xml.apache.org/xerces2-j/dom3.html.

Take special note of how getXmlStandalone( ) is used; the method returns either TRue or false, but the XML declaration defines the standalone attribute as accepting either "yes" or "no". If you're not careful, it's easy to just dump the value of getXmlStandalone( ) into the document, which will cause errors in parsing. You've also got to make sure that the standalone attributes come after encoding, or you'll get nasty errors as well. Something else that should look a little odd to you is the hard-wiring of the encoding (it's set to UTF-8). While it's possible to get the document's encoding through getXmlEncoding( ), it's not safe to assume that encoding is being used for writing. Instead, UTF-8 is applied, making sure that what is written out of Java can be read correctly.

As an example of this problem, I had a version of DOMSerializer that did output the parsed document's encoding. In one case, the document specified us-ascii as the encoding, and that's what my serializer output. However, when reading back in the serialized document, parsing failedthere were UTF-8 characters, output from my UTF-8 Java Writer, in a document that said it contained only ASCII characters. I learned my lesson, and ensured that the encoding I was outputting in was what was supplied via the XML declaration.

Element nodes

The most common task in serialization is to take a DOM Element and print out its name, attributes, and value, and then serialize its children. First you need to get the name of the XML element, which is available through the getNodeName( ) method on the Node interface. You can then grab the children of the current element and serialize these as well. A Node's children can be accessed through the getChildNodes( ) method, which returns an instance of a DOM NodeList. It is trivial to obtain the length of this list, and then iterate through the children calling the serializeNode( ) method on each. There's also quite a bit of logic that ensures correct indentation and new lines; these are really just formatting issues, and I won't spend time on them here:

case Node.ELEMENT_NODE: String name = node.getNodeName( );
 writer.write(indentLevel + "<" + name);
 writer.write(">");
 // recurse on each child
 NodeList children = node.getChildNodes( );
 if (children != null) {
 if ((children.item(0) != null) &&
 (children.item(0).getNodeType( ) == Node.ELEMENT_NODE)) writer.write(lineSeparator);
 for (int i=0; i<children.getLength( ); i++) serializeNode(children.item(i), writer,
 indentLevel + indent);
 if ((children.item(0) != null) &&
 (children.item(children.getLength( )-1)
 .getNodeType( ) == Node.ELEMENT_NODE))
 writer.write(indentLevel);
 }
 writer.write("</" + name + ">");
 writer.write(lineSeparator);
 break;

Of course, astute readers (or DOM experts) will notice that I left out something important: the element's attributes! These are the only pseudoexceptions to the strict tree that DOM builds; attributes are not children (leaf nodes) of elements, but exist as properties of the element. They should be an exception, though, since an attribute is not really a child of an element; it's (sort of) lateral to it. In any case, the attributes of an element are available through the getAttributes( ) method on the Node interface. This method returns a NamedNodeMap, and that too can be iterated through. Each Node within this list can be polled for its name and value, and suddenly the attributes are handled! Enter the code as shown here to take care of this:

case Node.ELEMENT_NODE: String name = node.getNodeName( );
 writer.write(indentLevel + "<" + name);
 NamedNodeMap attributes = node.getAttributes( );
 for (int i=0; i<attributes.getLength( ); i++) {
 Node current = attributes.item(i);
 writer.write(" " + current.getNodeName( ) + "=\"");
 print(writer, current.getNodeValue( ));
 writer.write("\"");
 }
 writer.write(">");
 // recurse on each child
 NodeList children = node.getChildNodes( );
 if (children != null) {
 if ((children.item(0) != null) &&
 (children.item(0).getNodeType( ) == Node.ELEMENT_NODE)) writer.write(lineSeparator);
 for (int i=0; i<children.getLength( ); i++) serializeNode(children.item(i), writer,
 indentLevel + indent);
 if ((children.item(0) != null) &&
 (children.item(children.getLength( )-1)
 .getNodeType( ) == Node.ELEMENT_NODE))
 writer.write(indentLevel);
 }
 writer.write("</" + name + ">");
 writer.write(lineSeparator);
 break;

I've snuck in a new method here: print( ). Since there are a lot of special characters in XML (&, <, >, and so forth), those need to be handled differently. That's the job of the utility print( ) method:

private void print(Writer writer, String s) throws IOException{
 if (s == null) return;
 for (int i=0, len=s.length( ); i<len; i++) {
 char c = s.charAt(i);
 switch(c) {
 case '<':
 writer.write("&lt;");
 break;
 case '>':
 writer.write("&gt;");
 break;
 case '&':
 writer.write("&amp;");
 break;
 case '\r':
 writer.write("
");
 break;
 default:
 writer.write(c);
 }
 }
}

You'll see this used in printing out element text as well, which leads us to the next Node type.

Text and CDATA nodes

Next on the list of node types is Text nodes. Output is quite simple, as you only need to use the now-familiar getNodeValue( ) method of the DOM Node interface to get the textual data and print it out; the same is true for CDATA nodes, except that the data within a CDATA section should be enclosed within the CDATA XML semantics (surrounded by <![CDATA[ and ]]>) and need not use the special character handling in the print() method above. You can add the logic for those two cases now:

case Node.TEXT_NODE:
 print(writer, node.getNodeValue( ));
 break;
case Node.CDATA_SECTION_NODE:
 writer.write("<![CDATA[");
 writer.write (node.getNodeValue( ));
 writer.write("]]>");
 break;

Comment nodes

Dealing with comments in DOM is about as simple as it gets. The getNodeValue( ) method returns the text within the  XML constructs. That's really all there is to it:

case Node.COMMENT_NODE:
 writer.write(indentLevel + "<!-- " +
 node.getNodeValue( ) + " -->");
 writer.write(lineSeparator);
 break;

Processing instruction nodes

Moving on to the next DOM node type: the DOM bindings for Java define an interface to handle processing instructions that are within the input XML document, rather obviously called ProcessingInstruction. This is useful, as these instructions do not follow the same markup model as XML elements and attributes, but are still important for apps to know about. The PI node in the DOM is a little bit of a break from what you have seen so far: to fit the syntax into the Node interface model, the getNodeValue( ) method returns all data instructions within a PI in one String. This allows quick output of the PI; however, you still need to use getNodeName( ) to get the name of the PI.

If you were writing an app that received PIs from an XML document, you might prefer to use the actual ProcessingInstruction interface; although it exposes the same data, the method names (getTarget( ) and geTData( )) are more in line with a PI's format.

With this understanding, you can add in the code to print out any PIs in supplied XML documents:

case Node.PROCESSING_INSTRUCTION_NODE:
 writer.write("<?" + node.getNodeName( ) +
 " " + node.getNodeValue( ) +
 "?>"); writer.write(lineSeparator);
 break;

While the code to deal with PIs is perfectly workable, there is a problem. In the case that handled document nodes, all the serializer did was pull out the document element and recurse. The problem is that this approach ignores any other child nodes of the Document object, such as top-level PIs and any DOCTYPE declarations. Those node types are actually lateral to the document element (root element), and are ignored. Instead of just pulling out the document element, then, the following code serializes all child nodes on the supplied Document object:

case Node.DOCUMENT_NODE:
 Document doc = (Document)node;
 writer.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
 writer.write(lineSeparator);
 // recurse on each top-level node
 NodeList nodes = node.getChildNodes( );
 if (nodes != null)
 for (int i=0; i<nodes.getLength( ); i++)
 serializeNode(nodes.item(i), writer, "");
 // serializeNode(doc.getDocumentElement( ), writer, "");
 break;

DocumentType nodes

With this in place, the code can deal with DocumentType nodes, which represent a DOCTYPE declaration. Like PIs, a DTD declaration can be helpful in exposing external information that might be needed in processing an XML document. However, since there can be public and system IDs as well as other DTD-specific data, the code needs to cast the Node instance to the DocumentType interface to access this additional data. Then, use the helper methods to get the name of the Node, which returns the name of the element in the document that is being constrained, the public ID (if it exists), and the system ID of the DTD referenced. It then adds any internal subset information. Using this information, the original DTD can be serialized:

case Node.DOCUMENT_TYPE_NODE: DocumentType docType = (DocumentType)node;
 String publicId = docType.getPublicId( );
 String systemId = docType.getSystemId( );
 String internalSubset = docType.getInternalSubset( );
 writer.write("<!DOCTYPE " + docType.getName( ));
 if (publicId != null)
 writer.write(" PUBLIC \"" + publicId + "\" "); else
 writer.write(" SYSTEM ");
 writer.write("\"" + systemId + "\"");
 if (internalSubset != null)
 writer.write(" [" + internalSubset + "]");
 writer.write(">");
 writer.write(lineSeparator);
 break;

Entity Reference nodes

All that's left at this point is handling entities and entity references. In this chapter, I will skim over entities and focus on entity references; more details on entities and notations are in the next chapter. For now, a reference can simply be output with the & and ; characters surrounding it:

case Node.ENTITY_REFERENCE_NODE:
 writer.write("&" + node.getNodeName( ) + ";");
 break;

There are a few surprises that may trip you up when it comes to the output from a node such as this. The definition of how entity references should be processed within DOM allows a lot of latitude, and also relies heavily on the underlying parser's behavior. In fact, most XML parsers have expanded and processed entity references before the XML document's data ever makes its way into the DOM tree. Often, when expecting to see an entity reference within your DOM structure, you will find the text or values referenced rather than the entity reference itself.

And that's it! As I mentioned, there are a few other node types, but covering them isn't worth the trouble at this point; you get the idea about how DOM works.

The Results

With the DOMSerializer class complete, all that's left is to invoke the serializer's serialize( ) method in the test class. To do this, add the following lines to the SerializeTester class:

public void test(OutputStream outputStream) throws Exception {
 DOMParser parser = new DOMParser( );
 // Get the DOM tree as a Document object
 parser.parse(new InputSource(
 new FileInputStream(inputXML)));
 Document doc = parser.getDocument( );
 // Serialize
 DOMSerializer serializer = new DOMSerializer( );
 serializer.setIndent(2);
 serializer.serialize(doc, outputStream);
}

I ran this program on a couple of files, most notably an XML version of the DOM Level 3 Load and Save module specification (http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/xml-source.xml). A section of the rather large output is shown here:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- $Id$ -->
<!DOCTYPE spec PUBLIC "-//W3C//DTD Specification V2.2-Based DOM//EN" "http://www.w3.org/2002/08/xmlspec-v22-dom.dtd">
<spec role="public" w3c-doctype="rec">
 <!-- *************************************************************************
 * FRONT MATTER *
 *************************************************************************
 -->
 <!-- ****************************************************** | filenames to be used for each section |
 ******************************************************
 -->
<?command-options --map Copyright-Notice copyright-notice
--map Introduction introduction
--map TOC expanded-toc
--map Core core
--map Events events
--map idl idl-definitions
--map ecma-binding ecma-script-binding
--map java-binding java-binding
--map Index def-index
--map Objects object-index
--map References references
--map Errors errors
--map Level-3-AS abstract-schemas
--map Load-Save load-save
...

You may notice that there is quite a bit of extra whitespace in the output; that's because the serializer adds some new lines every time writer.write(lineSeparator) appears in the code. Of course, the underlying DOM tree has some new lines in it as well, which are reported as Text nodes. The end result in many of these cases is the double line breaks, as seen in the output.

Let me be very clear that the DOMSerializer class shown in this chapter is for example purposes and is not a good production solution. While you are welcome to use the class in your own apps, realize that several important options are left out, like setting advanced options for indentation, new lines, and line wrapping. Additionally, entities are handled only in passing (complete treatment would be twice as long as this chapter already is!). Your parser probably has its own serializer class, if not multiple classes, that perform this task at least as well, if not better, than the example in this chapter. However, you now should understand what's going on under the hood in those classes. As a matter of reference, if you are using Apache Xerces, the classes to look at are in org.apache.xml.serialize. Some particularly useful ones are XMLSerializer, XHTMLSerializer, and HTMLSerializer. Check them outthey offer a good solution, until DOM Level 3 support is more common.