DOM Level 2 Modules - XML - Java Programming Language

I'll start with the DOM Level 2 modules. You should expect to find support for most of these (with the repeated exception of HTML-related modules) in most modern DOM-compliant parsers.

Traversal

First up on the list is the DOM Level 2 Traversal module. This module provides tree-walking capability, along with a highly customizable manner. In particular, the DOM Traversal module is useful when you don't knowor aren't sure aboutthe structure of an XML document you're parsing. The whole of the traversal module is contained within the org.w3c.dom.traversal package. Just as everything within core DOM begins with a Document interface, everything in DOM Traversal begins with the org.w3c.dom.traversal.DocumentTraversal interface . This interface provides two methods:

NodeIterator createNodeIterator(Node root, int whatToShow, NodeFilter filter,
 boolean expandEntityReferences);
TreeWalker createTreeWalker(Node root, int whatToShow, NodeFilter filter,
 boolean expandEntityReferences);

Most DOM implementations that support traversal choose to have their org.w3c.dom.Document implementation class implement the DocumentTraversal interface as well; in Xerces, you can use the default Document implementation, and you're all set. DocumentTraversal is shown along with the rest of the traversal classes in .

DOM Traversal module

There are just three other classes to worry about (all in the org.w3c.dom.traversal package); all focus on selecting certain DOM nodes, and working with the results of that selection. NodeFilter does just what it sounds like it does: provides a means of selecting only certain nodes based on filtering criteria. Using a NodeIterator provides a list view of the elements iterated over, and the TreeWalker class provides a tree view of that same data.

Selecting nodes

One of the more popular apps in today's web-centric world is a spider, or crawler, that searches and indexes web pages. Google has also begun to add more and more power to its search engine (http://www.google.com), all in an effort to return the most relevant results for a given set of search terms. Along those same lines, it's a fairly common task to try and parse a given web page, and determine what the subject of that web page is. While sometimes it's enough to simply extract words from the title of the page (within the title HTML element), that's not always sufficient. But how else can software make an educated guess at the focus of a site's content? One rather hackish approach (which, just coincidentally, serves the purposes of this text) is to key in on words formatted in a certain way. For example, you could grab all elements within a document that are in italics (<i>) or bold (<b>) text, as well as any headings (say, <h1>, <h2>, and <h3>). While this is somewhat crude, you'd be surprised at the number of times this sort of approach yields useful data. The resulting terms and phrases could then be used as a pool of data for a search app. Of course, reading in HTML line by line and searching for a certain set of tags is a huge pain (and doesn't take advantage of structured markup). What you really want is to parse the HTML into a DOM tree, and then specify a custom traversalselecting only the nodes that meet a certain criteria. This is a perfect app of the DOM Traversal module.

I'm making the rather largeand potentially unsafeassumption that you're dealing with XHTML, well-formed HTML. In cases where you don't have XHTML, just run the page through Tidy (http://tidy.sf.net) to clean up tagging before filtering.

Keep in mind that narrowing down the set of Nodes that must be filtered is always a good idea; in the case of HTML, you only need to search the contents of the body element; everything outside of that element is inconsequential for this example. Example 6-2, then, reads in a file supplied on the command line, parses the file into a DOM tree, locates the body element, and does some basic filtering.

Example Once you've got a DOM tree, you can set a basicNodeFilter and then traverse over the results of that filtering, using eitherNodeIterator orTreeWalker

package javaxml3;
import java.io.File;
// DOM imports import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.traversal.DocumentTraversal;
import org.w3c.dom.traversal.NodeFilter;
import org.w3c.dom.traversal.NodeIterator;
// Vendor parser import org.apache.xerces.parsers.DOMParser;
public class HTMLIndexer {
 public void index(String filename) throws Exception {
 // Parse into a DOM tree
 File file = new File(filename);
 DOMParser parser = new DOMParser( );
 parser.parse(file.toURL( ).toString( ));
 Document doc = parser.getDocument( );
 // Get node to start iterating with
 Element root = doc.getDocumentElement( );
 NodeList bodyElementList = root.getElementsByTagName("body");
 Element body = (Element)bodyElementList.item(0);
 // Get a NodeIterator
 NodeIterator i = ((DocumentTraversal)doc)
 .createNodeIterator(body, NodeFilter.SHOW_ALL, null, true);
 Node n;
 while ((n = i.nextNode( )) != null) {
 if (n.getNodeType( ) == Node.ELEMENT_NODE) {
 System.out.println("Encountered Element: '" + n.getNodeName( ) + "'");
 } else if (n.getNodeType( ) == Node.TEXT_NODE) {
 System.out.println("Encountered Text: '" + n.getNodeValue( ) + "'");
 }
 }
 }
 public static void main(String[] args) {
 if (args.length == 0) {
 System.out.println("No HTML files to search through specified.");
 return;
 }
 try {
 HTMLIndexer indexer = new HTMLIndexer( );
 for (int i=0; i<args.length; i++) {
 System.out.println("Processing file: " + args[i]);
 indexer.index(args[i]);
 }
 } catch (Exception e) {
 e.printStackTrace( );
 }
 }
}

As you can see, I've created a NodeIterator, and supplied it the body element to start with for iteration. The constant value passed as the filter instructs the iterator to show all nodes. You could just as easily provide values like NodeFilter.SHOW_ELEMENT and NodeFilter.SHOW_TEXT, which would show only elements or textual nodes, respectively. I haven't yet provided a NodeFilter implementation (I'll get to that next), and I allowed for entity reference expansion. What is nice about all this is that the iterator, once created, doesn't have just the child nodes of body. Instead, it actually has all nodes under body, even when nested multiple levels deep.

This ability to select child nodes, without knowing their structure, makes the Traversal module handy for dealing with an unknown XML structureexactly the case when working with HTML.

At this point, you still have all the nodes, which is not what you want. I added some code (the last while loop) to show you how to print out the element and text node results. You can run the code as is, but it's not really useful; you're going to get every bit of content within the body element. Instead, the code needs to provide a filter, so it only picks up elements with the formatting desired: the text within an i or b tag, or a heading element. You can provide this customized behavior by supplying a custom implementation of the NodeFilter interface, which defines only a single method:

public short acceptNode(Node n);

This method should always return NodeFilter.FILTER_SKIP, NodeFilter.FILTER_REJECT, or NodeFilter.FILTER_ACCEPT. The first skips the examined node but continues to iterate over its children; the second rejects the examined node and its children (only applicable in treeWalker); and the third accepts and passes on the examined node. It behaves a lot like SAX, in that you can intercept nodes as they are being iterated and decide if they should be passed on to the calling method. Add the following nonpublic class to the HTMLIndexer.java source file:

class ImportantWordsFilter implements NodeFilter {
 public short acceptNode(Node n) {
 if (n.getNodeType( ) == Node.TEXT_NODE) {
 Node parent = n.getParentNode( );
 if ((parent.getNodeName( ).equalsIgnoreCase("b")) ||
 (parent.getNodeName( ).equalsIgnoreCase("i")) ||
 (parent.getNodeName( ).equalsIgnoreCase("h1")) ||
 (parent.getNodeName( ).equalsIgnoreCase("h2")) ||
 (parent.getNodeName( ).equalsIgnoreCase("h3"))) {
 return FILTER_ACCEPT;
 }
 }
 // If we got here, not interested
 return FILTER_SKIP;
 }
}

This is basic core DOM code, and shouldn't pose any difficulty to you. First, the code ignores anything but text nodes; the text of the formatted elements is desired, not the elements themselves. Next, the parent is determined, and since it's safe to assume that Text nodes have Element node parents, the code immediately invokes getNodeName( ). If the element name matches one of the "important" elements, the code returns FILTER_ACCEPT. Otherwise, FILTER_SKIP is returned. All that's left now is a change to the iterator creation call instructing it to use the new filter implementation, and to the output, both in the existing search( ) method of the ItemSearcher class:

// Get a NodeIterator NodeIterator i = ((DocumentTraversal)doc)
 .createNodeIterator(description, NodeFilter.SHOW_ALL, new ImportantWordsFilter( ), true);
Node n;
while ((n = i.nextNode( )) != null) {
 System.out.println("Search phrase found: '" + n.getNodeValue( ) + "'");
}

Some astute readers will wonder what happens when a NodeFilter implementation conflicts with the constant supplied to the createNodeIterator( ) method (in this case that constant is NodeFilter.SHOW_ALL). The constant filter is applied first, and then the resulting list of nodes is passed to the filter implementation. If I had supplied the constant NodeFilter.SHOW_ELEMENT, I would not have gotten any search phrases, because my filter would not have received any Text nodes to examine; just Element nodes. Be careful to use the two together in a way that makes sense. In the example, I could have safely used NodeFilter.SHOW_TEXT also.

Now, the class is useful and ready to run. I ran it on the HTML from the front page of Ajaxian.com (http://www.ajaxian.com) and got these results:

Processing file: Ajaxian-05242005.xhtml Search phrase found: 'May 24, 2005'
Search phrase found: 'Ajax Slashdotted Again'
Search phrase found: 'JavaScript Threading and Continuations'
Search phrase found: 'Dean Edwards' IE7'
Search phrase found: 'Ajax Usability Mistakes'
Search phrase found: 'Not giving immediate visual cues for clicking widgets:'
Search phrase found: 'Breaking the back button:'
Search phrase found: 'Changing state with links (GET requests):'
Search phrase found: 'Blinking and changing parts of the page unexpectedly:'
Search phrase found: 'Not using links I can pass to friends or tutorialmark:'
Search phrase found: 'Too much code makes the browser slow:'
Search phrase found: 'Inventing new UI conventions:'
Search phrase found: 'Not cascading local changes to other parts of the page:'
Search phrase found: 'Asynchronously performing batch operations'
Search phrase found: 'Scrolling the page and making me lose my place:'
Search phrase found: 'May 23, 2005'
Search phrase found: 'Google Maps Platform: ChicagoCrime.org'
Search phrase found: 'Showcase: Lace - Ajaxian Chat Service'
Search phrase found: 'Server to Client callback via mod_pubsub'
Search phrase found: 'May 20, 2005'
Search phrase found: 'Oracle ADF Faces gets Ajaxian, er Partial Page Rendering'
Search phrase found: 'JavaServer Faces Ajaxian Components'
Search phrase found: 'Thoughts on Rich Clients and Ajax'
Search phrase found: 'May 19, 2005'
Search phrase found: 'XHR Server Validation with DWR'
Search phrase found: 'JavaScript'
Search phrase found: 'Servlet: web.xml'
Search phrase found: 'May 18, 2005'
Search phrase found: 'AjaxPatterns.org'
Search phrase found: 'Example: Live Preview'
Search phrase found: 'Search'
Search phrase found: 'Recent Entries'
Search phrase found: 'Contact Us'
Search phrase found: 'Resources'
Search phrase found: 'Feeds'
Search phrase found: 'Archives'

These turn out to be remarkably useful; note how often Ajax, patterns, JavaScript, rich clients, and related terms turn up!

You could refine this further by setting up a pool of common terms to reject in your NodeFilter implementation. For example, you could return NodeFilter.FILTER_REJECT for terms like "Archives" and "Contact Us", as well as dates, to try and eliminate commonly appearing terms that aren't applicable.

Walking filtered DOM trees

The TreeWalker interface is almost exactly the same as the NodeIterator interface; the only difference is that you get a tree view instead of a list view. This is primarily useful if you want to deal with only a certain type of node within a tree; for instance, you want to see a DOM tree with only elements, or without any comments. By using a constant filter value (such as NodeFilter.SHOW_ELEMENT) and a filter implementation (like one that returns FILTER_SKIP for all comments), you can construct a view of a DOM tree without extraneous information. The TReeWalker interface provides all the basic DOM node operations, such as firstChild( ), parentNode( ), nextSibling( ), and of course getCurrentNode( ).

Range

The DOM Level 2 Range module is one of the least commonly used modules, probably due to a lack of understanding of the module rather than lack of usefulness. This module provides a way to deal with a set of content within a document, en masse. Once you've defined that range of content, you can insert into it, copy it, delete parts of it, and manipulate it in various ways. The most important thing to start with is realizing that "range" in this sense refers to a number of pieces of a DOM tree grouped together. It does not refer to a set of allowed values, where a high and low or start and end are defined. Therefore, DOM Range has nothing at all to do with validation of data values. Like traversal, working with the Range module involves a new DOM package: org.w3c.dom.ranges. There are actually only two interfaces and one exception within this class, so it won't take you long to get your bearings ( shows the UML for the package).

DOM Range module

First is the analog to Document (and DocumentTraversal): org.w3c.dom.ranges.DocumentRange . Like the DocumentTraversal interface, DocumentRange is implemented by Xerces's Document implementation class. And also like DocumentTraversal, it has very few interesting methods; in fact, only one:

public Range createRange( );

All other range operations operate upon the Range class (rather, an implementation of the interface, but you get the idea). Once you've got an instance of the Range interface, you can set the starting and ending points, and edit away. Just as the Traversal module is ideal for working with documents in which the structure is unknown, so is the Range module. You can set ranges based on starting and ending points, even if you don't know what comes between those two points. For example, it's very easy to clear out all of the contents of an HTML page's body element:

// Parse into a DOM tree File file = new File(filename);
DOMParser parser = new DOMParser( );
parser.parse(file.toURL( ).toString( ));
Document doc = parser.getDocument( );
// Get node to start iterating with Element root = doc.getDocumentElement( );
NodeList bodyElementList = root.getElementsByTagName("body");
Element body = (Element)bodyElementList.item(0);
// Nuke everything in the body tag Range range = ((DocumentRange)doc).createRange( );
range.setStartBefore(body.getFirstChild( ));
range.setEndAfter(body.getLastChild( ));
range.deleteContents( );
// Release contents of the range range.detach( );

To remove all the content, I first create a new Range, using the DocumentRange cast.

You'll need to add import statements for the DocumentRange and Range classes, of course; both in the org.w3c.dom.ranges package.

Once the range is created, set the starting and ending points. Since I want all content within the body element, I start before the first child of that Element node (using setStartBefore( )), and end after its last child (using setEndAfter( )).

There are other, similar methods for this task, like setStartAfter( ) and setEndBefore( ).

With the range set up, it's simple to call deleteContents( ). Just like that, not a bit of content is left. Finally, let the JVM know that it can release any resources associated with the Range by calling detach( ). While this step is commonly overlooked, it can really help with lengthy bits of code that use the extra resources. Another option is to use extractContents( ) instead of deleteContents( ). This method removes content, but returns the content that has been removed. For instance, you might grab a blog entry from an XML feed, and move it from the current listings to an archive section:

// Parse into a DOM tree File file = new File(filename);
DOMParser parser = new DOMParser( );
parser.parse(file.toURL( ).toString( ));
Document doc = parser.getDocument( );
// Get node to start iterating with Element root = doc.getDocumentElement( );
NodeList blogElementList = root.getElementsByTagName("blog");
Element currentBlog = (Element)blogElementList.item(0);
// Nuke everything in the current blog tag Range range = ((DocumentRange)doc).createRange( );
range.setStartBefore(currentBlog.getFirstChild( ));
range.setEndAfter(currentBlog.getFirstChild( ));
Node entry = range.extractContents( );
// Insert the entry into the archived blogs listings NodeList archiveBlogList = root.getElementsByTagName("archived-blogs");
Element archiveBlog = (Element)archiveBlogList.item(0);
archiveBlog.insertBefore(entry, archiveBlog.getFirstChild( ));
// Release contents of the range range.detach( );

Events, Views, and Style

Aside from the HTML module, which I'll talk about next, there are three other DOM Level 2 modules: Events, Views, and Style. I'm not going to cover these three in depth in this tutorial, largely because I believe that they are more useful for client programming. So far, I've focused on server-side programming, and I'm going to keep in that vein throughout most of the tutorial. These three modules are most often used on client software such as IDEs, web pages, and the like. Still, I want to briefly touch on each so you'll still be on top of the DOM heap at the next alpha-geek soirée.

Events

The Events module provides just what you are probably expecting: a means of "listening" to a DOM document. The relevant classes are in the org.w3c.dom.events package, and the class that gets things going is DocumentEvent. No surprise here; compliant parsers (like Xerces) implement this interface in the same class that implements org.w3c.dom.Document. The interface defines only one method:

public Event createEvent(String eventType);

The string passed in is the type of event; valid values in DOM Level 2 are UIEvent, MutationEvent, and MouseEvent. Each of these has a corresponding interface: UIEvent, MutationEvent, and MouseEvent. provides a visual take on this module. You'll note, in looking at the Xerces Javadoc, that they provide only the MutationEvent interface, the only event type Xerces supports. When an event is "fired" off, it can be handled (or "caught") by an EventListener.

DOM Events module

This is where the DOM core support comes in; a parser supporting DOM events should have the org.w3c.dom.Node interface implementing the org.w3c.dom.events.EventTarget interface (see ). So every node can be the target of an event. This means that you have the following method available on any Node, inherited from the EventTarget interface:

public void addEventListener(String type, EventListener listener, boolean capture);

To use the module, create a new EventListener implementation. You need to implement only a single method:

public void handleEvent(Event event);

Register that listener on any and all nodes you want to work with. The code in this method typically does some useful task, like emailing users that their information has been changed (in some XML file), revalidating the XML (think XML editors), or asking users if they are sure they want to perform the requested action. At the same time, you'll want your code to trigger a new Event on certain actions, like the user clicking on a node in an IDE and entering new text, or deleting a selected element. When the Event is triggered, it is passed to the available EventListener instances, starting with the active node and moving up. This is where your listener's code executes, if the event types are the same; if the events don't match up, then the propagation continues. When your code does see a matching event and executes, it can stop the propagation, or continue to bubble the event up the chainallowing it to be (possibly) handled by other registered listeners.

Views

Next on the list is DOM Level 2 Views. The reason I don't cover Views in much detail is that, really, there is very little to be said. From every reading I can make of the (one-page!) specification, it's simply a basis for future work, perhaps in vertical markets. The specification defines only two interfaces, both in the org.w3c.dom.views package. Here's the first:

package org.w3c.dom.views;
public interface AbstractView {
 public DocumentView getDocument( );
}

And here's the second:

package org.w3c.dom.views;
public interface DocumentView {
 public AbstractView getDefaultView( );
}

gives you a UML-ish view of these, for the visually inclined.

DOM Views module

Seems a bit cyclical, doesn't it? A single source document (a DOM tree) can have multiple views associated with it. In this case, view refers to a presentation, like a styled document (after XSL or CSS has been applied), or perhaps a version with Flash and one without. By implementing the AbstractView interface, you can define your own customized versions of displaying a DOM tree. For example, consider this subinterface:

package javaxml3;
import org.w3c.dom.views.AbstractView;
public interface StyledView implements AbstractView {
 public void setStylesheet(String stylesheetURI);
 public String getStylesheetURI( ); }

I've left out the method implementations, but you can see how this could be used to provide stylized views of a DOM tree. Additionally, a compliant parser implementation would have the org.w3c.dom.Document implementation implement DocumentView, allowing you to query a document for its default view. It's expected that in a later version of the specification, you will be able to register multiple views for a document, and more closely tie a view or views to a document.

Look for this to be fleshed out more as browsers like Netscape, Mozilla, and Internet Explorer provide these sorts of views of XML.

Style

There is the Style module, referred to as simply CSS. You can check this specification out at http://www.w3.org/TR/DOM-Level-2-Style. This module provides a binding for CSS stylesheets to be represented by DOM constructs. Everything of interest is in the org.w3c.dom.stylesheets and org.w3c.dom.css packages (see Figures 6-5 and 6-6). The former contains generic base classes, and the latter provides specific apps to CSS. Both are primarily used for showing a client a styled document.

DOM Sylesheets module

DOM Stylesheet implementation for CSS

You use this module exactly like you use the core DOM interfaces: you get a Style-compliant parser, parse a stylesheet, and use the CSS language bindings. This is particularly handy when you want to parse a CSS stylesheet and apply it to a DOM document. You're working from the same basic set of concepts, if that makes sense to you (and it should; when you can do two things with an API instead of one, that's generally good). Again, I only briefly touch on the Style module, because it's accessible with the Javadoc in its entirety. The classes are aptly named (CSSValueList, Rect, CSSDOMImplementation) and are close enough to their XML DOM counterparts that I'm confident you'll have no problem using them if you need to.

HTML

For HTML, DOM provides a set of interfaces that model the various HTML elements. For example, you can use the HTMLDocument class, the HTMLAnchorElement, and the HTMLSelectElement (all in the org.w3c.dom.html package) to represent their analog tags in HTML (<HTML>, <A>, and <SELECT> in this case). All of these provide convenience methods like setTitle( ) (on HTMLDocument), setHref( ) (on HTMLAnchorElement), and getOptions( ) (on HTMLSelectElement). Further, these all extend core DOM structures like Document and Element, and so can be used as any other DOM node could. The HTML package has more than 50 interfaces in it. contains the UML for a few of them.

Sampling of DOM HTML interfaces

Personally, I find that these classes are a bit cumbersome; for example, I know DOM well enoughand you should by now, tooto prefer calling getFirstChild( ) or setNodeValue( ) to remembering all the HTML-specific methods, like setLink( ) or getEnctype( ). The only time I find the HTML module of much practical use is in creating new documents that are to be output as HTML. In these cases, it's sometimes nice to use calls like HTMLFormElement.setAction( ) , because there's no mistaking what the method does (in this case, it's also a lot nicer than creating an attribute called action, and then setting its value). Unfortunately, there's a rather nasty catch when working with the HTML module: you become tied to a specific implementation very quickly. There is no factory for creating HTML elements, so you have to write code like this:

HTMLFormElement form1 = new my.parser.package.HTMLFormElementImpl( );

This ties you very quickly to your specific parser, especially when you consider that most HTML documents have hundreds of elements (meaning hundreds of these vendor-specific instantiations). For all of these reasons, I've found the HTML module to be a real pain to work with in general app programming.

All that said, if you've got a closed-box solution, like a piece of software that is packaged and sold, the HTML module might be perfect. You probably don't need to worry much about changing a parser, as your company has invested money into one, and any major rewrite would affect the whole product cycle, anyway. So there are uses of the HTML module that are very legitimate.