Modifying and Creating XML - XML - Java Programming Language

The biggest limitation when using SAX for dealing with XML is that you cannot change any of the XML structure you encounter, at least not without using filters and writers. Those aren't intended to be used for wholesale document changes anyway, so you'll need to use another API when you want to modify XML. DOM fits the bill nicely, as it provides XML creation and modification facilities. In working with DOM, the process of creating an XML document is quite different from changing an existing one, so I'll take them one at a time. This section gives you a fairly realistic example to mull over. If you've ever been to an online auction site like eBay, you know that the most important aspects of the auction are the ability to find items, and the ability to find out about items. These functions depend on a user entering in a description of an item, and the auction using that information. The better auction sites allow users to enter in some basic information as well as actual HTML descriptions, which means savvy users can bold, italicize, link, and add other formatting to their items' descriptions. This provides a good case for using DOM.

Setting Up an Input Servlet

To get started, a little bit of groundwork is needed. Example 5-3 shows a servlet that displays a simple HTML form that takes basic information about an item to be listed on an auction site. This would obviously be dressed up more for a real site, but you get the idea.

Example This servlet-generated form submits the data it collects to itself

package javaxml3;
import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
import javax.servlet.ServletConfig;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
// DOM imports import org.w3c.dom.Attr;
import org.w3c.dom.Document;
import org.w3c.dom.DOMImplementation;
import org.w3c.dom.Element;
import org.w3c.dom.Text;
// Parser import import org.apache.xerces.dom.DOMImplementationImpl;
public class UpdateItemServlet extends HttpServlet {
 private String outputDir;
 public void init(ServletConfig config) throws ServletException {
 super.init(config);
 outputDir = config.getInitParameter("OutputDirectory");
 if (outputDir == null) outputDir = "";
 }
 public void doGet(HttpServletRequest req, HttpServletResponse res)
 throws ServletException, IOException {
 // Get output
 PrintWriter out = res.getWriter( );
 res.setContentType("text/html");
 // Output HTML out.println("<html>");
 out.println(" <head><title>Input/Update Item Listing</title></head>");
 out.println(" <body>");
 out.println(" <h1 align='center'>Input/Update Item Listing</h1>");
 out.println(" <p align='center'>");
 out.println(" <form method='POST' action='" + target + "'>");
 out.println(" Item ID (Unique Identifier): <br />");
 out.println(" <input name='id' type='text' maxLength='10' />" +
 "<br /><br />");
 out.println(" Item Name: <br />");
 out.println(" <input name='name' type='text' maxLength='50' />" +
 "<br /><br />");
 out.println(" Item Description: <br />");
 out.println(" <textarea name='description' rows='10' cols='30' " +
 "wrap='wrap' ></textarea><br /><br />");
 out.println(" <input type='reset' value='Reset Form'>&nbsp;&nbsp;");
 out.println(" <input type='submit' value='Add/Update Item' />");
 out.println(" </form>");
 out.println(" </p>");
 out.println(" </body>");
 out.println("</html>"); out.close( );
 }
}

Notice that the target of this form submission is actually itselfof course, since the submission is made via POST, the doPost( ) method would be called, instead of this doGet( ) method being invoked over and over. Also note the init( ) method, which grabs an init-param (from your servlet context's web.xml file). This parameter is used to specify an output directory where XML files will be written.

If you've never worked with a servlet engine, that probably all seemed like another language. If you want to run this example, and are unfamiliar with servlet basics, pick up Jason Hunter's Java Servlet Programming, and throw in a side of Ian Darwin and Jason Brittain's Tomcat: The Definitive Guide, both Oracle tutorials.

In the doPost( ) method, the request parameters need to be read in, and put into a DOM tree (showcasing DOM's ability to create XML, which is the whole point of this exercise). Then, using the DOMSerializer class, the DOM tree is written out to a file, preserved for other app components to use:

public void doPost(HttpServletRequest req, HttpServletResponse res)
 throws ServletException, IOException {
 // Get parameter values
 String id = req.getParameter("id");
 String name = req.getParameter("name");
 String description = req.getParameter("description");
 // Create new DOM tree
 DOMImplementation domImpl = new DOMImplementationImpl( );
 Document doc = domImpl.createDocument(null, "item", null);
 Element root = doc.getDocumentElement( );
 // ID of item (as attribute)
 root.setAttribute("id", id);
 // Name of item
 Element nameElement = doc.createElement("name");
 Text nameText = doc.createTextNode(name);
 nameElement.appendChild(nameText);
 root.appendChild(nameElement);
 // Description of item
 Element descriptionElement = doc.createElement("description");
 Text descriptionText = doc.createTextNode(description);
 descriptionElement.appendChild(descriptionText);
 root.appendChild(descriptionElement);
 // Serialize DOM tree
 DOMSerializer serializer = new DOMSerializer( );
 String filename = outputDir + "item-" + id + ".xml";
 File outputFile = new File(filename);
 serializer.serialize(doc, outputFile);
 // Print confirmation
 PrintWriter out = res.getWriter( );
 res.setContentType("text/html");
 out.println("<HTML><BODY>");
 out.println("<p>Thank you for your submission. " +
 "Your item has been processed.</p>");
 out.println("<p>Your item was saved as " + outputFile.getAbsolutePath( ) + "</p>");
 out.println("</BODY></HTML>");
 out.close( ); }

Make sure the DOMSerializer class is in your servlet's context classpath. In my setup, using Tomcat, my context is called javaxml3, in a directory named javaxml3 under the webapps directory. In my WEB-INF/classes directory, there is a directory (for the package), and then the DOMSerializer.class and UpdateItemServlet.class files are within that directory. You should also ensure that a copy of your parser's JAR files (xercesImpl.jar, xml-apis.jar, and xmlParserAPIs.jar in my case) is in the classpath of your engine. In Tomcat, you can simply drop a copy in Tomcat's common/lib directory. Then restart Tomcat and everything should work.

Once you've got your servlet in place and the servlet engine started, browse to the servlet and let the GET request your browser generates load the HTML input form. Fill this form out, as I have in .

Form submission with HTML description

For a better example, enter in HTML for the description; I entered this:

This custom-built Simpson GA (<i>Grand Auditorium</i>) is a <b>beautiful</b> instrument. It features <b>master grade</b> Ziricote for the back and sides, and a <b>AAA</b> redwood top. It doesn't get <i>any</i> better than this. For more on this and other Simpson instruments, visit <a href="http://web.archive.org/web/www.simpsonguitars.com">Jason Simpson</a> online.

Creating a New DOM Tree

There are two basic approaches to creating a DOM tree from scratch:

Create a new instance of the org.w3c.dom.Document class.
Create a new instance of the org.w3c.dom.DOMImplementation class.

In either case, you're actually going to need to create an instance of an implementation of these classes, as both are interfaces. After working with SAX, you should realize that these implementations are what parsers like Xerces provide. For example, Xerces provides the DocumentImpl class to implement Document, and the DOMImplementationImpl class to implement DOMImplementation.

Both of these classes are in the org.apache.xerces.dom package.

So the choice becomes one of functionality. You've already seen what Document provides, and it would seem the obvious choice (why involve another class if you don't need to?). However, DOMImplementation offers you the ability to create a DocType, and therefore set a DOCTYPE declaration on your XMLthis alone is worth using DOMImplementation. Further, DOMImplementation provides the hasFeature( ) method, which is critical for working with DOM modules (the focus of ). Once you've got an instance of DOMImplementation, things are pretty simple. Take a look at the relevant code again:

// Create new DOM tree DOMImplementation domImpl = new DOMImplementationImpl( );
Document doc = domImpl.createDocument(null, "item", null);
Element root = doc.getDocumentElement( );
// ID of item (as attribute)
root.setAttribute("id", id);
// Name of item Element nameElement = doc.createElement("name");
Text nameText = doc.createTextNode(name);
nameElement.appendChild(nameText);
root.appendChild(nameElement);
// Description of item Element descriptionElement = doc.createElement("description");
Text descriptionText = doc.createTextNode(description);
descriptionElement.appendChild(descriptionText);
root.appendChild(descriptionElement);
// Serialize DOM tree DOMSerializer serializer = new DOMSerializer( );
String filename = outputDir + "item-" + id + ".xml";
File outputFile = new File(filename);
serializer.serialize(doc, outputFile);

First, the createDocument( ) method is used to get a new Document instance. The first argument to this method is the namespace for the document's root element. For simplicity's sake, this is omitted. The second argument is the name of the root element itself, which is simply item. The last argument is an instance of a DocType class, and I again pass in a null value since there isn't one in this particular example.

If you did want a DocType, you could create one with the createDocType( ) method on DOMImplementation.

With a DOM tree to operate upon, it's simple enough to retrieve the new root element. Once you've got that, add an attribute with the ID of the item using setAttribute( ).Things begin to get even simpler now; each type of DOM construct can be created using the Document object as a factory. To create the name and description elements, use createElement. The same approach is used to create textual content for each; since an element has no content but instead has children that are Text nodes, the createTextNode( ) method is the right choice. This method takes in the text for the node, which works out to be the description and item name. You might be tempted to use the createCDATASection( ) method and wrap this text in CDATA tagsthere is HTML within this element. However, DOMSerializer handles the HTML characters (like < and &) in its print( ) method, so there's no need to worry about it. Once you've gotten all of these nodes created, all that's left is to link them together. Use appendChild( ), appending the elements to the root, and the textual content of the elements to the correct parent. Finally, the whole document is passed into the DOMSerializer class from the last chapter and written out to an XML file on disk.

I have assumed that the user is entering well-formed HTML; in other words, XHTML. In a production app, you would probably run this input through JTidy (http://www.sf.net/projects/jtidy) to ensure this; for this example, I'll just assume the input is XHTML.

Try this servlet out, and then browse to the directory you specified in web.xml. The output from my input is shown in Example 5-4.

Example The information is converted to XML and written to a file

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<item >
<name>Simpson GA Guitar</name>
<description>This custom-built Simpson GA (&lt;i&gt;Grand Auditorium&lt;/i&gt;) is a &lt;b&gt;beautiful&lt;/b&gt; instrument. It features &lt;b&gt;master grade&lt;/b&gt; Ziricote for the back and sides, and a &lt;b&gt;AAA&lt;/b&gt; redwood top. It doesn't get &lt;i&gt;any&lt;/i&gt; better than this. For more on this and other Simpson instruments, visit &lt;a href="http://web.archive.org/web/www.simpsonguitars.com"&gt;Jason Simpson&lt;/a&gt; online.</description>
</item>

As I mentioned before, you can see that DOMSerializer handled escaping all special characters.

Bootstrapping with DOM Level 3

Those of you who are really into vendor-neutral code, and avoiding being tied to a specific parser product, are probably turned off by working directly with Xerces classes:

DOMImplementation domImpl = new org.apache.xerces.dom.DOMImplementationImpl( );

If this bothers you as much as it does me, then DOM Level 3 should be of great interest to you. The newest version of the DOM specification allows you to use a factory that provides a vendor's implementation of DOMImplementation (I know, I know, it's a bit confusing). Vendors can set system properties or provide their own versions of this factory so that it returns the implementation class they want. The resulting code to create DOM trees then looks like this:

import org.w3c.dom.Document;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
// Class declaration and other Java constructs DOMImplementationRegistry registry =
 DOMImplementationRegistry.newInstance( );
DOMImplementation domImpl = registry.getDOMImplementation("XML 3.0");
Document doc = domImpl.createDocument( );
// And so on...

Even though you're requesting an implementation that supports "XML 3.0", you're not referring to Version 3.0 of the XML specification; you're requesting Level 3 of the XML module, which is essentially a core DOM Level 3 implementation. This is a rather ill-named feature, but hopefully they'll change that soon.

There are several other classes and interfaces in the org.w3c.dom.bootstrap package worth checking out. Until DOM Level 3 is in primetime, though, I'd rather focus on features that you can use immediately.

Modifying a DOM Tree

The process of changing an existing DOM tree is slightly different from the process of creating one; in general, it involves loading the DOM from some source, traversing the tree, and then making changes. These changes are usually either to structure or content. If the change is to structure, it becomes a matter of creation again:

// Add a copyright element to the root Element root = doc.getDocumentElement( );
Element url = doc.createElement("url");
url.appendChild(doc.createTextNode("http://www.simpsonguitars.com"));
root.appendChild(url);

The process of changing existing content is a little different, although not overly complex. Example 5-5 is a modified version of the UpdateItemServlet. This version reads the supplied ID and tries to load an existing file if it exists. If so, it doesn't create a new DOM tree, but instead modifies the existing one.

Example Modifying a DOM tree involves searching for a particular node and then changing its value

package javaxml3;
import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
import javax.servlet.ServletConfig;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import org.xml.sax.SAXException;
// DOM imports import org.w3c.dom.Attr;
import org.w3c.dom.Document;
import org.w3c.dom.DOMImplementation;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.w3c.dom.Text;
// Parser import import org.apache.xerces.dom.DOMImplementationImpl;
import org.apache.xerces.parsers.DOMParser;
public class ModifyItemServlet extends HttpServlet {
 // doGet( ) and init( ) methods are unchanged from UpdateItemServlet
 public void doPost(HttpServletRequest req, HttpServletResponse res)
 throws ServletException, IOException {
 // Get parameter values
 String id = req.getParameter("id");
 String name = req.getParameter("name");
 String description = req.getParameter("description");
 // See if this file exists
 Document doc = null;
 String filename = outputDir + "item-" + id + ".xml";
 File outputFile = new File(filename);
 if (!outputFile.exists( )) {
 // Create new DOM tree
 DOMImplementation domImpl = new DOMImplementationImpl( );
 doc = domImpl.createDocument(null, "item", null);
 Element root = doc.getDocumentElement( );
 // ID of item (as attribute)
 root.setAttribute("id", id);
 // Name of item
 Element nameElement = doc.createElement("name");
 Text nameText = doc.createTextNode(name);
 nameElement.appendChild(nameText);
 root.appendChild(nameElement);
 // Description of item
 Element descriptionElement = doc.createElement("description");
 Text descriptionText = doc.createTextNode(description);
 descriptionElement.appendChild(descriptionText);
 root.appendChild(descriptionElement);
 } else {
 // Load document
 try {
 DOMParser parser = new DOMParser( );
 parser.parse(outputFile.toURL( ).toString( ));
 doc = parser.getDocument( );
 Element root = doc.getDocumentElement( );
 // Name of item
 NodeList nameElements = root.getElementsByTagName("name");
 Element nameElement = (Element)nameElements.item(0);
 Text nameText = (Text)nameElement.getFirstChild( );
 nameText.setData(name);
 // Description of item
 NodeList descriptionElements = root.getElementsByTagName("description");
 Element descriptionElement = (Element)descriptionElements.item(0);
 // Remove and recreate description
 root.removeChild(descriptionElement);
 descriptionElement = doc.createElement("description");
 Text descriptionText = doc.createTextNode(description);
 descriptionElement.appendChild(descriptionText);
 root.appendChild(descriptionElement);
 } catch (SAXException e) {
 // Print error
 PrintWriter out = res.getWriter( );
 res.setContentType("text/html");
 out.println("<HTML><BODY>Error in reading XML: " +
 e.getMessage( ) + ".</BODY></HTML>");
 out.close( ); return;
 }
 }
 // Serialize DOM tree
 DOMSerializer serializer = new DOMSerializer( );
 serializer.serialize(doc, outputFile);
 // Print confirmation
 PrintWriter out = res.getWriter( );
 res.setContentType("text/html");
 out.println("<HTML><BODY>");
 out.println("<p>Thank you for your submission. " +
 "Your item has been processed.</p>");
 out.println("<p>Your item was saved as " + outputFile.getAbsolutePath( ) + "</p>");
 out.println("</BODY></HTML>");
 out.close( ); }
}

The changes are fairly simple, nothing that should confuse you. The outputFile is created earlier in the doPost( ) method, so the code can check and see if it already exists. If not, the method behaves just like doPost( ) in UpdateItemServlet, with no changes. If the XML already exists (indicating the item has already been submitted), the XML file is loaded and read into a DOM tree. At that point, some basic tree traversal begins.

Traversing a DOM Tree

The code grabs the root element, and then uses the getElementsByTagName( ) method to locate all elements named name and then all named description. In each case, the returned NodeList will have only one item.

This assumption is safe because we authored the code that creates the XML. In many cases, you can't be sure of things like this, and will have to iterate through the returned NodeList item by item.

You can access this item using the item( ) method on the NodeList, and supplying 0 as the argument (the indexes are all zero-based).

You could have gotten the children of the root through getChildren( ), and peeled off the first and second elements. However, using the element names makes the code much clearer.

The code gets the name element's textual content by invoking getFirstChild( ). Since we know that the name element has a single Text node, you can directly cast this to the appropriate type (Text). Finally, the setData( ) method allows the code to change the existing value for a new name, the updated information the user supplied through the form. An equally effective, albeit different, approach is used for the description element. Instead of changing its value, the code just replaces the node wholesale. This is mostly for demonstration purposes; however, if you need to replace an element and all of its children, this is a cleaner and quicker approach. It's no accident that this code is hardwired to the format the XML was written out to. In fact, most DOM modification code relies on at least some understanding of the content to be dealt with. Knowing how the XML is structured is a tremendous advantage. Methods like getFirstChild( ) can be used and the result cast to a specific type, rather than needing lengthy type checking and switch blocks.

For cases when the structure or format is unknown, the DOM Level 2 Traversal module is a better solution; that and other DOM modules are covered in .

Once the creation or modification is complete, the resulting DOM tree is serialized back to XML, and the process can repeat.