Parsing with StAX - XML - Java Programming Language

Reading a XML document with the two StAX reader interfaces is relatively similar. Both XMLStreamReader and XMLEventReader provide an interface similar to java.util.Iterator. XMLEventReader extends Iterator, whereas XMLStreamReader has methods named hasNext( ) and next( ), just as Iterator does, but the next( ) method returns an int, not an Object. Because of this relation to Iterator, the primary use of either interface looks like one of the event loops in Examples 8-1 and 8-2.

Example Basic XMLStreamReader event loop

while (streamReader.hasNext( )) {
 int eventTypeID = streamReader.next( );
 // do something
}

Example Basic XMLEventReader event loop

while (eventReader.hasNext( ) {
 XMLEvent event = (XMLEvent) eventReader.next( );
 // do something with event
}

Creating a Reader

As described above, javax.xml.stream.XMLInputFactory is used to create instances of XMLStreamReader and XMLEventReader. XMLInputFactory has six different overloaded methods named createXMLStreamReader( ) for creating XMLStreamReader( ) instances and seven different overloaded methods named createXMLEventReader( ) for creating XMLEventReader instances (the seventh being to create an XMLEventReader that wraps an already-created XMLStreamReader). The parameters that can be passed to these create methods are:

A java.io.InputStream
A java.io.InputStream and a character encoding
A java.io.InputStream and a system ID to use for resolving relative URIs
A java.io.Reader
A java.io.Reader and a system ID to use for resolving relative URIs
A javax.xml.transform.Source

The last of these, javax.xml.transform.Source, is optional. If an implementation does not provide support for Source inputs, both createXMLEventReader( ) and createXMLStreamReader( ) will throw a java.lang.UnsupportedOperationException. One case of an implementation that does not support Source inputs is the reference implementation. If you already have a String object in memory containing your XML document, you can use java.io.StringReader as follows:

// I have the document as a String named documentAsString StringReader stringReader = new StringReader(documentAsString);
XMLInputFactory inputFactory = XMLInputFactory.newInstance( );
XmlStreamReader reader = inputFactory.createXMLStreamReader(stringReader);

By default, XMLInputFactory will create XMLStreamReader and XMLEventReader instances that are nonvalidating and namespace aware. These defaults can be changed by calling the setProperty( ) method on XMLInputFactory. We'll discuss this later in the "Factory Properties" section.

XMLStreamReader

XMLStreamReader is the parsing interface from the cursor API. As mentioned above, it does not extend java.util.Iterator, but does look like it. Like Iterator, you call hasNext( ) to determine if there are more events to process in the document. A UML diagram for the full XMLStreamReader interface is in .

The XMLStreamReader interface

That's a lot of methods, but in practice, you'll use relatively few of these. Most important, not all of methods can be called for every event. For example, if you try to call getText( ), which returns the text content of an event on an END_ELEMENT event, a java.lang.IllegalStateException will be thrown because, to put it simply, you are in an illegal state to call getText( ). Table 8-3 lists the available methods for each event type.

Table 8-3. Available XMLStreamReader methods by event type

Event type	Methods available
All event types	`getProperty( )`, `hasNext( )`, `require( )`, `close( )`, `getNamespaceURI( )`, `isStartElement( )`, `isEndElement( )`, `isCharacters( )`, `isWhiteSpace( )`, `getNamespaceContext( )`, `getEventType( )`, `getLocation( )`, `hasText( )`, `hasName( )`
START_ELEMENT	`next( )`, `getName( )`, `getLocalName( )`, `hasName( )`, `getPrefix( )`, `nextTag( )`, `getAttributeXXX( )`, `isAttributeSpecified( )`, `getNamespaceXXX( )`, `getElementText( )`
END_ELEMENT	`next( )`, `getName( )`, `getLocalName( )`, `hasName( )`, `getPrefix( )`, `nextTag( )`, `getNamespaceXXX( )`
PROCESSING_INSTRUCTION	`next( )`, `getPITarget( )`, `getPIData( )`, `nextTag( )`
CHARACTERS	`next( )`, `getTextXXX( )`, `nextTag( )`
COMMENT	`next( )`, `getTextXXX( )`, `nextTag( )`
SPACE	`next( )`, `getTextXXX( )`, `nextTag( )`
START_DOCUMENT	`next( )`, `getEncoding( )`, `getVersion( )`, `isStandalone( )`, `standaloneSet( )`, `getCharacterEncodingScheme( )`, `nextTag( )`
END_DOCUMENT	`next( )`, `getText( )`, `nextTag( )`
ENTITY_REFERENCE	`next( )`, `getLocalName( )`, `getText( )`, `nextTag( )`
ATTRIBUTE	`next( )`, `nextTag( )`, `getAttributeXXX( )`, `isAttributeSpecified( )`
DTD	`next( )`, `getText( )`, `nextTag( )`
CDATA	`next( )`, `getTextXXX( )`, `nextTag( )`
NAMESPACE	`next( )`, `nextTag( )`, `getNamespaceXXX( )`
NOTATION_DECLARATION	Not defined
ENTITY_DECLARATION	Not defined

In , we built a Java app that created a tree view of an XML document. We had a class called SAXTreeViewer that extended javax.swing.JFrame and had another class called JtreeHandler that implemented two SAX interfaces ContentHandler and ErrorHandler. SAXTreeViewer created the tree's model, instantiated JtreeHandler and a SAX parser, and then asked the SAX parser to parse a document. Meanwhile, JtreeHandler was responsible for receiving the parser events and creating nodes in the tree model. To do the same thing with the StAX cursor API, let's start by using much of the same boilerplate code from . Example 8-3 contains the boilerplate code that creates the Swing components we'll use to display our tree as well as the code to obtain the XMLStreamReader implementation.

Example StAXStreamTreeViewer

package javaxml3;
import java.awt.BorderLayout;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import javax.swing.JFrame;
import javax.swing.JScrollPane;
import javax.swing.JTree;
import javax.swing.tree.DefaultMutableTreeNode;
import javax.swing.tree.DefaultTreeModel;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
public class StAXStreamTreeViewer extends JFrame {
 /** The base tree to render */
 private JTree jTree;
 /** Tree model to use */
 DefaultTreeModel defaultTreeModel;
 public StAXStreamTreeViewer( ) {
 // Handle Swing setup
 super("StAX Tree Viewer");
 setSize(600, 450);
 }
 public void init(File file) throws XMLStreamException, FileNotFoundException {
 DefaultMutableTreeNode base = new DefaultMutableTreeNode(
 "XML Document: " + file.getAbsolutePath( ));
 // Build the tree model
 defaultTreeModel = new DefaultTreeModel(base);
 jTree = new JTree(defaultTreeModel);
 // Construct the tree hierarchy
 buildTree(defaultTreeModel, base, file);
 // Display the results
 getContentPane( ).add(new JScrollPane(jTree), BorderLayout.CENTER);
 }
 // Swing-related variables and methods, including
 // setting up a JTree and basic content pane
 public static void main(String[] args) {
 try {
 if (args.length != 1) {
 System.out.println("Usage: java javaxml3.StAXStreamTreeViewer "
 + "[XML Document]");
 return;
 }
 StAXStreamTreeViewer viewer = new StAXStreamTreeViewer( );
 File f = new File(args[0]);
 viewer.init(f);
 viewer.setVisible(true);
 } catch (Exception e) {
 e.printStackTrace( );
 }
 }
 public void buildTree(DefaultTreeModel treeModel,
 DefaultMutableTreeNode current, File file)
 throws XMLStreamException, FileNotFoundException {
 FileInputStream inputStream = new FileInputStream(file); XMLInputFactory inputFactory = XMLInputFactory.newInstance( );
 XMLStreamReader reader = inputFactory
 .createXMLStreamReader(inputStream);
 // parse away!
 }
}

The START_DOCUMENT event

When you create an XMLStreamReader instance, the reader positions itself at the start of the document. This is represented by the START_DOCUMENT event. As a result, if we were to insert:

 System.out.println(reader.getEventType( ));

at the end of the buildTree( ) method in Example 8-3, the number 7 will be printed to the console, as 7 is the value of XMLStreamConstants.START_DOCUMENT. A quick glance back at Table 8-3 shows that we can get a few interesting bits of information about this document's declaration that weren't available in the SAX version:

public void buildTree(DefaultTreeModel treeModel,
 DefaultMutableTreeNode current, File file)
 throws XMLStreamException, FileNotFoundException {
 FileInputStream inputStream = new FileInputStream(file);
 XMLInputFactory inputFactory = XMLInputFactory.newInstance( );
 XMLStreamReader reader = inputFactory
 .createXMLStreamReader(inputStream);
 addStartDocumentNodes(reader, current);
 // parse rest of document
 }
 private void addStartDocumentNodes(XMLStreamReader reader,
 DefaultMutableTreeNode current) {
 DefaultMutableTreeNode version = new DefaultMutableTreeNode(
 "XML Version: " + reader.getVersion( ));
 current.add(version);
 DefaultMutableTreeNode standalone = new DefaultMutableTreeNode(
 "Standalone? " + reader.isStandalone( ));
 current.add(standalone);
 DefaultMutableTreeNode standaloneSet = new DefaultMutableTreeNode(
 "Was Standalone Set? " + reader.standaloneSet( ));
 current.add(standaloneSet); DefaultMutableTreeNode encoding = new DefaultMutableTreeNode(
 "Encoding: " + reader.getEncoding( ));
 current.add(encoding);
 DefaultMutableTreeNode declaredEncoding = new DefaultMutableTreeNode(
 "Declared Encoding Scheme: "
 + reader.getCharacterEncodingScheme( ));
 current.add(declaredEncoding);
 }

Note that through these methods, you are able to discover the standalone and encoding values from the XML declaration and, in addition, which of these values were in the declaration. You may also run across XML files that do not have a declaration at all. In these cases, the result of getVersion( ) will be null. If we have an XML document with this declaration:

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes" ?>

The output of StAXStreamTreeViewer looks like .

Tree viewer output for XML declaration

Parsing the rest of the document

For the purpose of this example, we'll parse the actual document inside a while loop. As long as hasNext( ) returns true, there is more of the document to be parsed. Inside this loop, the document cursor is advanced through the document by calls to the next( ) method. If next( ) is called when the cursor is at the end of the document, an IllegalStateException is thrown. The result of the call to next( ) is the event type ID.

 private void parseRestOfDocument(XMLStreamReader reader,
 DefaultMutableTreeNode current) throws XMLStreamException {
 while (reader.hasNext( )) {
 int type = reader.next( );
 System.out.println(type);
 }
 }

This method produces a sequence of numbers. The exact sequence depends upon the XML document you are parsing. The simplest document, one containing a single element produces a 1 for the START_ELEMENT event, then a 2 for the END_ELEMENT event, and finally an 8 for the END_DOCUMENT event. For our example app, we want to create a new tree node on START_ELEMENT events. These nodes should contain a sub-node for the namespace, if any, of the element as well as subnodes for each attribute. On an END_ELEMENT event, we need to walk back up the tree so that the parent of the current node is the new current node. To provide different handling for each event type, we will use the switch construct. Our new parseRestOfDocument( ) method is in Example 8-4.

Example New parseRestOfDocument( ) method

private void parseRestOfDocument(XMLStreamReader reader,
 DefaultMutableTreeNode current) throws XMLStreamException {
 while (reader.hasNext( )) {
 int type = reader.next( );
 switch (type) {
 case XMLStreamConstants.START_ELEMENT:
 DefaultMutableTreeNode element = new DefaultMutableTreeNode(
 "Element: " + reader.getLocalName( ));
 current.add(element);
 current = element;
 // Determine namespace
 if (reader.getNamespaceURI( ) != null) {
 String prefix = reader.getPrefix( );
 if (prefix == null) {
 prefix = "[None]";
 }
 DefaultMutableTreeNode namespace = new DefaultMutableTreeNode(
 "Namespace: prefix = '" + prefix + "', URI = '"
 + reader.getNamespaceURI( ) + "'");
 current.add(namespace);
 }
 if (reader.getAttributeCount( ) > 0) {
 for (int i = 0; i < reader.getAttributeCount( ); i++) {
 DefaultMutableTreeNode attr = new DefaultMutableTreeNode(
 "Attribute (name = '"
 + reader.getAttributeLocalName(i)
 + "', value = '"
 + reader.getAttributeValue(i) + "')");
 String attURI = reader.getAttributeNamespace(i);
 if (attURI != null) {
 String attPrefix = reader.getAttributePrefix(i);
 if (attPrefix == null || attPrefix.equals("")) {
 attPrefix = "[None]";
 }
 DefaultMutableTreeNode attNs = new DefaultMutableTreeNode(
 "Namespace: prefix = '" + attPrefix
 + "', URI = '" + attURI + "'");
 attr.add(attNs);
 }
 current.add(attr);
 }
 }
 break;
 case XMLStreamConstants.END_ELEMENT:
 current = (DefaultMutableTreeNode) current.getParent( );
 break;
 default:
 System.out.println(type);
 }
 }
}

The node manipulation code is relatively similar to that used in the SAX version from . The main difference is in how the node data is populated. When creating the element node, we set its title to be "Element:" followed by the local name of the element. To get the local name, we call getLocalName( ) on the reader object itself, we call getNamespaceURI( ) to get the namespace URI of the element, and so on. shows the result of this newly added code when parsing the default web.xml file from the Apache Tomcat servlet container. You will also still see some event IDs being written to the console for events other than START_ELEMENT and END_ELEMENT. At the minimum, this will be an 8 for the END_DOCUMENT event.

Tree viewer output with namespaces

You may be asking why there are events named NAMESPACE and ATTRIBUTE defined in Table 8-1 since we're able to get the namespace and attribute information from the various getNamespaceXXX( ) and getAttributeXXX( ) methods on the reader object. In fact, when parsing a full document, you'll never run across the NAMESPACE and ATTRIBUTE events. They exist because in some cases, the attribute may be returned directly, outside the context of an element. The only example of this given in the StAX specification is XPath. You also use these events when creating documents with the writer interfaces as discussed in the "Document Output with StAX" section later in this chapter.

Getting character data

The CHARACTERS and CDATA events occur when character data is encountered in the document. Retrieving the character data as a String is done by simply calling getText( ) on the reader. There are also methods for retrieving this data as a character array.

 case XMLStreamConstants.CHARACTERS:
 case XMLStreamConstants.CDATA:
 DefaultMutableTreeNode data = new DefaultMutableTreeNode(
 "Character Data: '" + reader.getText( ) + "'");
 current.add(data);
 break;

Whitespace handling

An event called SPACE occurs when the parser encounters ignorable whitespace in the document. If you do not have a DTD for your XML document, the parser cannot determine which whitespace is ignorable and which is significant. As a result, without a DTD, all whitespace results in CHARACTERS events, but with a DTD, the SPACE event occurs as appropriate. For example, the XML document in Example 8-5, without a DTD, will result in .

Example XML document with whitespace

<?xml version="1.0"?>
<person>
 <name>
 <first_name>Alan</first_name>
 <last_name>Turing</last_name>
 </name>
 <profession>computer scientist</profession>
 <profession>mathematician</profession>
 <profession>cryptographer</profession>
</person>

Whitespace as CHARACTERS events without a DTD

But when we provide an internal DTD that strictly limits where whitespace is significant (inside the first_name, last_name, and profession elements), as in Example 8-6, we get SPACE events for the ignorable whitespace, leading to the output displayed in .

Example XML document with internal DTD defining ignorable whitespace

<?xml version="1.0"?>
<!DOCTYPE person [
 <!ELEMENT first_name (#PCDATA)>
 <!ELEMENT last_name (#PCDATA)>
 <!ELEMENT profession (#PCDATA)>
 <!ELEMENT name (first_name, last_name)>
 <!ELEMENT person (name, profession*)>
]>
<person>
 <name>
 <first_name>Alan</first_name>
 <last_name>Turing</last_name>
 </name>
 <profession>computer scientist</profession>
 <profession>mathematician</profession>
 <profession>cryptographer</profession>
</person>

Whitespace as SPACE events with a DTD

In addition to the cleaner output, you will see the number 6 (the event ID for SPACE) appear in the console. To get rid of this, we can add a new case for the SPACE event:

 case XMLStreamConstants.SPACE:
 // let's ignore this
 break;

Because there are frequently cases in which no DTD is defined but the document contains a significant amount of whitespace you know you can ignore (even if the XML specification thinks you can't ignore it), StAX provides an easy-to-use mechanism to skip a CHARACTERS element if it contains only whitespace. If we replace the section of parseRestOfDocument( ) that handles the CHARACTERS event with the code in Example 8-7, we will get the same output for Examples 8-5 and 8-6.

Example New CHARACTERS handling code

 case XMLStreamConstants.CHARACTERS:
 if (!reader.isWhiteSpace( )) {
 DefaultMutableTreeNode data = new DefaultMutableTreeNode(
 "Character Data: '" + reader.getText( ) + "'");
 current.add(data);
 }
 break;

Two more events

The last two events to mention are COMMENT and DTD. These are relatively self-explanatory. In both cases, you can use the getText( ) method to return the content of the comment or DTD. The content of the DTD will be the full DOCTYPE block. If the document contains a reference to an external DTD, the content will be along the lines of <!DOCTYPE contacts SYSTEM "name.dtd">, whereas if the DTD is internal, such as in the example above, the content would be a String containing the following:

<!DOCTYPE person [
 <!ELEMENT first_name (#PCDATA)>
 <!ELEMENT last_name (#PCDATA)>
 <!ELEMENT profession (#PCDATA)>
 <!ELEMENT name (first_name, last_name)>
 <!ELEMENT person (name, profession*)>
]>

XMLEventReader

XMLEventReader is the parsing interface from the event iterator API. Unlike XMLStreamReader, it does extend java.util.Iterator. The full XMLEventReader interface is contained in .

The XMLEventReader interface

As you can see, this is a much simpler interface then XMLStreamReader. This is because all of the access methods (getLocalName( ), getAttributeXXX( ), getNamespaceXXX( ), etc.) are encapsulated in the XMLEvent object. There is one cursor-type method getElementText( )which, as should be expected, throws an IllegalStateException if the current event isn't a START_ELEMENT event. Other than this, all document access is done through a subclass of XMLEvent. XMLEvent is a base interface from which the StAX API defines 12 interfaces that extend XMLEvent. The correspondence between these interfaces and the event types defined in Table 8-1 are listed in Table 8-4. All interfaces are in the package javax.xml.stream.events.

Table 8-4. Event interfaces

Event type name	Event interface name
START_ELEMENT	StartElement
END_ELEMENT	EndElement
PROCESSING_INSTRUCTION	ProcessingInstruction
CHARACTERS	Characters
COMMENT	Comment
SPACE	Characters
START_DOCUMENT	StartDocument
END_DOCUMENT	EndDocument
ENTITY_REFERENCE	EntityReference
ATTRIBUTE	Attribute
DTD	DTD
CDATA	Characters
NAMESPACE	n/a
NOTATION_DECLARATION	NotationDeclaration
ENTITY_DECLARATION	EntityDeclaration

As you can see, the Characters interface can represent three different events: CHARACTERS, SPACE, and CDATA. To differentiate between them, the Characters interface defines methods called isCData( ) and isIgnorableWhitespace( ). In general, these interfaces define methods similar to those defined on the XMLStreamReader interface. However, each interface defines only the methods appropriate to the event it represents. As a result, it's nearly impossible to throw an IllegalStateException, because the invalid methods simply don't exist. Other than simple name changes (for example: XMLStreamReader.getPITarget( ) maps to ProcessingInstruction.getTarget( )), the most significant difference between the methods on XMLStreamReader and the methods exposed by the various event interfaces is that you are required to use javax.xml.namespace.QName . As we saw above, XMLStreamReader defined a method called getName( ) that returned the QName for an element and also methods called getLocalName( ), getPrefix( ), and getNamespaceURI( ). In the StartElement interface, only getName( ) is defined and you must call getLocalPart( ), getPrefix( ), and getNamespaceURI( ) on the resulting QName object to read these values. When you obtain an instance of XMLEvent tHRough a call to XMLEventReader.nextEvent( ), you can determine the type of event through three different mechanisms:

Call XMLEvent.getEventType( ) and compare to one of the values from XMLStreamConstants.
Call one of the is methods on XMLEvent, such as isStartElement( ).
Use the instanceof operator.

For example, these three blocks of code do exactly the same thing:

// Block one using ==
if (event.getEventType( ) == XMLStreamConstant.START_ELEMENT) {
 System.out.println("I'm a start element event!");
}
// Block two using isStartElement if (event.isStartElement( )) {
 System.out.println("I'm a start element event!");
}
// Block three using instanceof if (event instanceof StartElement) {
 System.out.println("I'm a start element event!");
}

These are interchangeable, and which you use is largely a matter of style. The one exception to this is if you want to use the switch construct. In this context, you must use the getEventType( ) method. Example 8-8 contains the StAX Tree Viewer example app rewritten to use XMLEventReader rather than XMLStreamReader, with the same functionality as the XMLStreamReader version.

Example StAXEventTreeViewer

package javaxml3;
import java.awt.BorderLayout;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.util.Iterator;
import javax.swing.JFrame;
import javax.swing.JScrollPane;
import javax.swing.JTree;
import javax.swing.tree.DefaultMutableTreeNode;
import javax.swing.tree.DefaultTreeModel;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.Characters;
import javax.xml.stream.events.DTD;
import javax.xml.stream.events.StartDocument;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import org.xml.sax.InputSource;
public class StAXEventTreeViewer extends JFrame {
 /** The base tree to render */
 private JTree jTree;
 /** Tree model to use */
 DefaultTreeModel defaultTreeModel;
 public StAXEventTreeViewer( ) {
 // Handle Swing setup
 super("StAX Tree Viewer");
 setSize(600, 450);
 }
 public void init(File file) throws XMLStreamException,
 FileNotFoundException {
 DefaultMutableTreeNode base = new DefaultMutableTreeNode(
 "XML Document: " + file.getAbsolutePath( ));
 // Build the tree model
 defaultTreeModel = new DefaultTreeModel(base);
 jTree = new JTree(defaultTreeModel);
 // Construct the tree hierarchy
 buildTree(defaultTreeModel, base, file);
 // Display the results
 getContentPane( ).add(new JScrollPane(jTree), BorderLayout.CENTER);
 }
 // Swing-related variables and methods, including
 // setting up a JTree and basic content pane
 public static void main(String[] args) {
 try {
 if (args.length != 1) {
 System.out.println("Usage: java javaxml3.StAXEventTreeViewer "
 + "[XML Document]");
 return;
 }
 StAXEventTreeViewer viewer = new StAXEventTreeViewer( );
 File f = new File(args[0]);
 viewer.init(f);
 viewer.setVisible(true);
 } catch (Exception e) {
 e.printStackTrace( );
 }
 }
 public void buildTree(DefaultTreeModel treeModel,
 DefaultMutableTreeNode current, File file)
 throws XMLStreamException, FileNotFoundException {
 XMLInputFactory inputFactory = XMLInputFactory.newInstance( );
 XMLEventReader reader = inputFactory
 .createXMLEventReader(new FileInputStream(file));
 while (reader.hasNext( )) {
 XMLEvent event = reader.nextEvent( );
 switch (event.getEventType( )) {
 case XMLStreamConstants.START_DOCUMENT:
 StartDocument startDocument = (StartDocument) event;
 DefaultMutableTreeNode version = new DefaultMutableTreeNode(
 "XML Version: " + startDocument.getVersion( ));
 current.add(version);
 DefaultMutableTreeNode standalone = new DefaultMutableTreeNode(
 "Standalone? " + startDocument.isStandalone( ));
 current.add(standalone);
 DefaultMutableTreeNode standaloneSet = new DefaultMutableTreeNode(
 "Was Standalone Set? " + startDocument.standaloneSet( ));
 current.add(standaloneSet);
 DefaultMutableTreeNode encoding = new DefaultMutableTreeNode(
 "Was Encoding Set? " + startDocument.encodingSet( ));
 current.add(encoding);
 DefaultMutableTreeNode decEnc = new DefaultMutableTreeNode(
 "Declared Encoding: "
 + startDocument.getCharacterEncodingScheme( ));
 current.add(decEnc);
 break;
 case XMLStreamConstants.START_ELEMENT:
 StartElement startElement = (StartElement) event;
 QName elementName = startElement.getName( );
 DefaultMutableTreeNode element = new DefaultMutableTreeNode(
 "Element: " + elementName.getLocalPart( ));
 current.add(element);
 current = element;
 if (!elementName.getNamespaceURI( ).equals("")) {
 String prefix = elementName.getPrefix( );
 if (prefix.equals("")) {
 prefix = "[None]";
 }
 DefaultMutableTreeNode namespace = new DefaultMutableTreeNode(
 "Namespace: prefix = '" + prefix + "', URI = '"
 + elementName.getNamespaceURI( ) + "'");
 current.add(namespace);
 }
 for (Iterator it = startElement.getAttributes(); it.hasNext( );) {
 Attribute attr = (Attribute) it.next( );
 DefaultMutableTreeNode attribute = new DefaultMutableTreeNode(
 "Attribute (name = '"
 + attr.getName().getLocalPart( )
 + "', value = '" + attr.getValue( ) + "')");
 String attURI = attr.getName().getNamespaceURI( );
 if (!attURI.equals("")) {
 String attPrefix = attr.getName().getPrefix( );
 if (attPrefix.equals("")) {
 attPrefix = "[None]";
 }
 DefaultMutableTreeNode attNs = new DefaultMutableTreeNode(
 "Namespace: prefix = '" + attPrefix
 + "', URI = '" + attURI + "'");
 attribute.add(attNs);
 }
 current.add(attribute);
 }
 break;
 case XMLStreamConstants.END_ELEMENT:
 current = (DefaultMutableTreeNode) current.getParent( );
 break;
 case XMLStreamConstants.CHARACTERS:
 Characters characters = (Characters) event;
 if (!characters.isIgnorableWhiteSpace( )
 && !characters.isWhiteSpace( )) {
 String data = characters.getData( );
 if (data.length( ) != 0) {
 DefaultMutableTreeNode chars = new DefaultMutableTreeNode(
 "Character Data: '" + characters.getData( )
 + "'");
 current.add(chars);
 }
 }
 break;
 case XMLStreamConstants.DTD:
 DTD dtde = (DTD) event;
 DefaultMutableTreeNode dtd = new DefaultMutableTreeNode(
 "DTD: '" + dtde.getDocumentTypeDeclaration( ) + "'");
 current.add(dtd);
 default:
 System.out.println(event.getClass().getName( ));
 }
 }
 }
}

XMLEventReader advantages

The difference between StAXEventTreeViewer and StAXStreamTreeViewer is largely cosmetic. The event iterator API becomes significantly more useful when you want to encapsulate your event processing code. To write code, such as that in Example 8-9, that calls a method to process an event with XMLStreamReader, you have to pass the reader itself to the event processing method.

Example Encapsulation problem with XMLStreamReader

 // create an instance of XMLStreamReader and call it reader
 while (reader.hasNext( )) {
 int eventTypeID = reader.next( );
 if (eventTypeID == XMLStreamConstants.START_ELEMENT) {
 if (reader.hasNext( )) {
 processStartElement(reader);
 eventTypeID = reader.next( );
 System.out.println("Event Type ID following START_ELEMENT is "
 + eventTypeID);
 processAfterStartElement(reader);
 } else {
 processStartElement(reader);
 }
 } else {
 processOther (reader);
 }
}

This creates a simple problemwe have no way of ensuring that the processing methods don't change the state of the reader. In Example 8-9, we assume processStartElement( ) won't call next( ) on the XMLStreamReader instance. If it does, then the result of the call to next( ) on line 8 will not be the event type ID for the event following the START_ELEMENT event. What's worse, the processStartElement( ) method could advance the cursor to the end of the document, in which case the call to next( ) on line 8 would throw a java.util.NoSuchElementException. Although this assumption can be documented, it cannot be enforced at compile time. An implementation of this using XMLEventReader avoids this issue entirely because it passes XMLEvent objects to the event processing methods, as seen in Example 8-10. Since they don't have a reference to the reader object, there is no way for them to change the reader's state.

Example Encapsulation with XMLEventReader

while (reader.hasNext( )) {
 XMLEvent event = reader.nextEvent( );
 if (event.isStartElement) {
 if (reader.hasNext( )) {
 processStartElement(event);
 event = reader.nextEvent( );
 System.out.println("Event Type ID following START_ELEMENT is "
 + event.getEventType( ));
 processAfterStartElement(event);
 } else {
 processOther(event)
 }
}

Other Traversal Options

In addition to traversing the XML document event by event using the next( ) and nextEvent( ) methods, StAX defines a few additional methods for traversing an XML document. These complement, rather than replace, the next( ) and nextEvent( ) methods. These methods are nextTag( ), require( ), and peek( ).

nextTag( )

The first of these methods, nextTag( ), is defined for both XMLStreamReader and XMLEventReader. For XMLStreamReader, nextTag( ) advances the cursor past any "insignificant" events until it reaches the next START_ELEMENT or END_ELEMENT event. For XMLEventReader, the same advancement occurs, and the new current event is returned. The StAX JavaDocs define insignificant events as SPACE, COMMENT, and PROCESSING_INSTRUCTION events as well as CHARACTERS or CDATA events that are composed only of whitespace. If a CHARACTERS or CDATA event is encountered that contains something other than whitespace, an XMLStreamException is thrown. This is helpful when processing part of an XML document in a linear, as opposed to looping, fashion. To obtain the text within the first_name element from Example 8-11 without using nextTag( ) would look something like Example 8-12.

Example Person XML document

<?xml version="1.0"?>
<person>
 <name>
 <first_name>Alan</first_name>
 <last_name>Turing</last_name>
 </name>
 <profession>computer scientist</profession>
 <profession>mathematician</profession>
 <profession>cryptographer</profession>
</person>

Example Person parsingwith next( )

package javaxml3;
import java.io.File;
import java.io.FileInputStream;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
public class NextExample {
 public static void main(String[] args) throws Exception {
 if (args.length != 1) {
 System.out.println("Usage: java javaxml3.NextExample "
 + "[XML Document]");
 return;
 }
 File file = new File(args[0]);
 XMLInputFactory inputFactory = XMLInputFactory.newInstance( );
 XMLStreamReader reader = inputFactory
 .createXMLStreamReader(new FileInputStream(file));
 int eventTypeID = reader.next( );
 // skip past any initial whitespace
 while (reader.getEventType( ) == 6)
 reader.next( );
 // the cursor is now at the person start element
 eventTypeID = reader.next( );
 // the cursor is now at the whitespace between contact and name
 eventTypeID = reader.next( );
 // the cursor is now at the name start element
 eventTypeID = reader.next( );
 // the cursor is now at the whitespace between name and first_name
 eventTypeID = reader.next( );
 // the cursor is now at the first_name start element
 eventTypeID = reader.next( );
 // the cursor should now be at the text within the first_name elemnt
 System.out.println("Hello " + reader.getText( ));
 }
}

There's an inconsistency between various StAX implementations that is being accommodated in Example 8-12: some implementations report a SPACE event for the whitespace between the XML declaration and the start of the document's content, which could be a comment, processing instruction, or (most commonly) an element. The specification is unfortunately vague on this point.

The code is more brittle than it should be. Adding a comment or processing instruction anywhere before the first_name element would result in either the wrong text being output (a newline and four spaces in this case) or a ClassCastException. Clearly we can do better. Using nextTag( ) eliminates some of the calls to next( ) and increases our code's ability to ignore comments and processing instructions. The rewritten class is in Example 8-13.

Example Person parsing with nextTag( )

package javaxml3;
import java.io.File;
import java.io.FileInputStream;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
public class NextTagExample {
 public static void main(String[] args) throws Exception {
 if (args.length != 1) {
 System.out
 .println("Usage: java NextTagExample " + "[XML Document]");
 return;
 }
 File file = new File(args[0]);
 XMLInputFactory inputFactory = XMLInputFactory.newInstance( );
 XMLStreamReader reader = inputFactory
 .createXMLStreamReader(new FileInputStream(file));
 int eventTypeID = reader.nextTag( );
 // the cursor is now at the person start element
 eventTypeID = reader.nextTag( );
 // the cursor is now at the name start element
 eventTypeID = reader.nextTag( );
 // the cursor is now at the first_name start element
 eventTypeID = reader.next( );
 // the cursor should now be at the text within the first_name elemnt
 System.out.println("Hello " + reader.getText( ));
 }
}

This code is still too error-prone. If the order of the first_name and last_name elements were reversed, we would just output the value of the last_name element. All we've done is output the text of the third element in the document. We can use require( ) to ensure that the third element is the correct one.

require( )

XMLStreamReader defines a method named require( ) that compares the cursor's position within the document to a set of expected values. If the cursor's position does not match all of the expected values, a javax.xml.stream.XMLStreamException is thrown. Otherwise, the method returns normally. At the minimum, require( ) compares the current event type ID with an event type ID passed to it. Additionally, you can pass a namespace URI and a local name to require( ). If either of these parameters is null, that comparison is not done. Here are a few sample calls to require( ):

require(START_ELEMENT, null, null): Succeeds if the current event type ID is START_ELEMENT
require(END_ELEMENT, "http://www.example.com/ns1", null): Succeeds if current event type ID is END_ELEMENT and the current namespace URI is http://www.example.com/ns1
require(START_ELEMENT, null, "name"): Succeeds if the current event type ID is START_ELEMENT and the current local name is name
require(END_ELEMENT, "http://www.example.com/ns1", "name"): Succeeds if the current event type ID is END_ELEMENT, the current namespace URI is http://www.example.com/ns1, and the current local name is name

This require( ) method is useful in cases where you have a defined XML syntax, but no DTD is available for validation purposes. In these cases, without some way to verify that the document follows your expectations, your code may throw nonintuitive exceptions like NullPointerException and IllegalStateException. Alternatively, it may just not do what you expect it to do. For example, we could have a document similar to Example 8-11, but with the first_name and last_name elements swapped so that it looks like:

<?xml version="1.0"?>
<person>
 <name>
 <last_name>Turing</last_name>
 <first_name>Alan</first_name>
 </name>
 <profession>computer scientist</profession>
 <profession>mathematician</profession>
 <profession>cryptographer</profession>
</person>

In this case, the code in Example 8-13 would output "Hello Turing" instead of the expected "Hello Alan." To ensure that we're only outputting the first name, we can add calls to require( ) to produce the code in Example 8-14.

Example Person parsing with nextTag() and require( )

package javaxml3;
import java.io.File;
import java.io.FileInputStream;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
public class RequireExample {
 public static void main(String[] args) throws Exception {
 if (args.length != 1) {
 System.out
 .println("Usage: java RequireExample " + "[XML Document]");
 return;
 }
 File file = new File(args[0]);
 XMLInputFactory inputFactory = XMLInputFactory.newInstance( );
 XMLStreamReader reader = inputFactory
 .createXMLStreamReader(new FileInputStream(file));
 int eventTypeID = reader.nextTag( );
 // the cursor is now at the person start element
 reader.require(XMLStreamConstants.START_ELEMENT, null, "person");
 eventTypeID = reader.nextTag( );
 // the cursor is now at the name start element
 reader.require(XMLStreamConstants.START_ELEMENT, null, "name");
 eventTypeID = reader.nextTag( );
 // the cursor is now at the first_name start element
 reader.require(XMLStreamConstants.START_ELEMENT, null, "first_name");
 eventTypeID = reader.next( );
 // the cursor should now be at the text within the first_name elemnt
 System.out.println("Hello " + reader.getText( ));
 }
}

Because our XML does not have namespaces, we have to pass null as the namespace URI parameter. If we pass the empty string (""), this would match only elements with a namespace such as:

<name xmlns=""/>

Running this class with an XML document where the first_name and last_name elements were swapped so that last_name came first would throw an XMLStreamException with a helpful message such as this:

LocalName first_name specified did not match with current local name

Don't worry if the exception message you get is worded differently. Different implementations are free to form these messages however they see fit.

You can also use the getLocation( ) method on XMLStreamReader to provide the explicit location within the document where the comparison failed.

boolean outputElementText = false;
try {
 reader.require(START_ELEMENT, null, "first_name");
 outputElementText = true;
} catch (XMLStreamException e) {
 System.out.println("Assertion failed. " + e.getMessage( )
 + " at " + reader.getLocation().getLineNumber( ) + ":"
 + reader.getLocation().getColumnNumber( ));
}
if (outputElementText)
 System.out.println(reader.getElementText( ));

This outputs:

Assertion failed. LocalName first_name specified did not match with current local
name at 4:16

Location Interface

XMLStreamReader and XMLEventReader both have a method named getLocation( ), which returns an instance of javax.xml.stream.Location . Like org.xml.sax.Locator, Location provides the following accessors: getLineNumber( ) getColumnNumber( ) getPublicId( ) getSystemId( ) In addition, Location provides a getCharacterOffset( ) method, which returns the current location as expressed in the number of characters from the beginning of the document. One thing to note about the result of getLocation( ) is that it returns an object representing the position at the end of the current event. If you have a line of XML such as this:

<identifier type="number"
>12345</identifier>

When the current event is the START_ELEMENT event for the identifier element, the Location object returned by getLocation( ) will be on the third line of this fragment.

peek( )

The peek( ) method of XMLEventReader returns what will be the result of the next invocation of nextEvent( ) or next( ). It does not affect the result of this future invocation. This method is useful for making the processing of the current event conditional on the next event such as in Example 8-15.

Example Usage of the peek( ) method

XMLEvent event = reader.nextEvent( );
XMLEvent next = reader.peek( );
if (next.isStartElement( )) {
 processEventWithChild(event);
} else {
 processEvent(event);
}

StAX Filters

The StAX API has built-in support for event filtering in both the cursor and event iterator APIs. EventFilter and StreamFilter are separate interfaces for the two APIs, but a filter class will commonly implement both. Both interfaces define a method named accept( ). Each has a single parameter, which is an XMLEvent in the case of EventFilter and an XMLStreamReader in the case of StreamFilter. Whether using EventFilter or StreamFilter, the accept( ) method is called for each event or change to the cursor positioningbasically when next( ), nextEvent( ), or peek( ) is called. If accept( ) returns true, the event is returned to the caller of next( ) or nextEvent( ). If accept( ) returns false, the cursor position is advanced until accept( ) returns true or the end of the document is reached.

The StreamFilter interface documentation states that the filter should not change the state of the reader. In other words, do not call next( ), nextTag( ), or close( ). Calling hasNext( ), require( ), or most other methods is acceptable.

Example 8-16 contains a filter that implements both EventFilter and StreamFilter to filter out all events other than START_ELEMENT and END_ELEMENT. It's common to have a single class implement both interfaces, as the acceptance criteria is generally similar, such as in Example 8-16, where both interface methods delegate to acceptInternal( ).

Example Example filter implementing EventFilter and StreamFilter

package javaxml3;
import javax.xml.stream.EventFilter;
import javax.xml.stream.StreamFilter;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.events.XMLEvent;
public class ElementOnlyFilter implements EventFilter, StreamFilter {
 /* implementation of EventFilter interface */
 public boolean accept(XMLEvent event) {
 return acceptInternal(event.getEventType( ));
 }
 /* implementation of StreamFilter interface */
 public boolean accept(XMLStreamReader reader) {
 return acceptInternal(reader.getEventType( ));
 }
 /* internal utility method */
 private boolean acceptInternal(int eventType) {
 return eventType == XMLStreamConstants.START_ELEMENT
 || eventType == XMLStreamConstants.END_ELEMENT;
 }
}

To create a filtered instance of XMLStreamReader or XMLEventReader, there are two methods on XMLInputFactory named createFilteredReader( ). One of these accepts an instance of XMLStreamReader and an instance of StreamFilter and the other accepts an instance of XMLEventReader and an instance of EventFilter. It's possible to nest filters through repeated calls to createdFilteredReader( ) such as:

XMLInputFactory inputFactory = XMLInputFactory.newInstance( );
XMLStreamReader reader = inputFactory
 .createXMLStreamReader(new FileInputStream(file));
reader = inputFactory.createFilteredReader(reader, new ElemenetOnlyFilter( ));
reader = inputFactory.createFilteredReader(reader, new OtherFilter( ));
reader = inputFactory.createFilteredReader(reader. new AnotherFilter( ));

Example 8-17 contains a sample class that counts the number of events in a document with an unfiltered reader, then counts the number of events in the same document with a reader filtered using the ElementOnlyFilter class from Example 8-16, and then compares the two counts.

Example Filter usage with XMLStreamReader

package javaxml3;
import java.io.File;
import java.io.FileInputStream;
import javax.xml.stream.StreamFilter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
public class StreamFilterExample {
 public static void main(String[] args) throws Exception {
 if (args.length != 1) {
 System.out.println("Usage: java javaxml3.StreamFilterExample "
 + "[XML Document]");
 return;
 }
 File file = new File(args[0]);
 XMLInputFactory inputFactory = XMLInputFactory.newInstance( );
 XMLStreamReader reader = inputFactory
 .createXMLStreamReader(new FileInputStream(file));
 int unfilteredCount = countEvents(reader);
 System.out.println("Unfiltered Count = " + unfilteredCount);
 // reinitialize the reader
 reader = inputFactory.createXMLStreamReader(new FileInputStream(file));
 // create the filter and filtered reader
 StreamFilter filter = new ElementOnlyFilter( );
 reader = inputFactory.createFilteredReader(reader, filter);
 int filteredCount = countEvents(reader);
 System.out.println("Filtered Count = " + filteredCount);
 System.out.println("Filter removed "
 + (unfilteredCount - filteredCount) + " events");
 }
 private static int countEvents(XMLStreamReader reader)
 throws XMLStreamException {
 int counter = 1;
 while (reader.hasNext( )) {
 reader.next( );
 counter++;
 }
 return counter;
 }
}

Rewriting the class to use XMLEventReader rather than XMLStreamReader would be as simple as replacing Stream with Event and changing the initial value of counter to 0 (as the first event from an XMLEventReader isn't read until the first call to next( ) or nextEvent( )).

As discussed in "The START_DOCUMENT event" previously in this chapter, when a regular XMLStreamReader is created, the current event is a START_DOCUMENT event. However, if a filter is applied to an XMLStreamReader, this may no longer be the case if the filter doesn't accept the START_DOCUMENT event. Unfortunately, this behavior is not defined in the specification, and some implementations (including the reference implementation) keep the START_DOCUMENT as the first event whereas some implementations (including Sun's SJSXP) advance the cursor until the current event is the first acceptable event. This ambiguity does not exist for EventFilters, as you must call next( ) or nextEvent( ) to obtain the first event.

StAX filters are limited to accepting or rejecting events from the reader. Unlike the XMLFilter interface that's part of SAX, there is no way to modify the input document as it's parsed. The NamespaceFilter class from the "Filters and Writers" section of cannot be implemented as a StAX filter. There is, however, a way of implementing this type of functionality through the XMLEventAllocator interface, as we'll see in the "Factory Properties" section later in this chapter.