Content Handlers - XML - Java Programming Language

To let an app do something useful with XML data, you must register handlers with the SAX parser. A handler is nothing more than a set of callbacks that SAX defines; a group, if you will, of related events to which you might want to attach code. There are four core handler interfaces defined by SAX 2.0: org.xml.sax.ContentHandler, org.xml.sax.ErrorHandler, org.xml.sax.DTDHandler, and org.xml.sax.EntityResolver.

In this chapter, I will discuss ContentHandler and ErrorHandler. I'll leave discussion of DTDHandler and EntityResolver for the next chapter; it is enough for now to understand that EntityResolver and DTDHandler work just like the other handlers, but just group different behaviors.

Your classes implement one or more of these handlers and fill in the callback methods with working code (or, if you desire, no code at all; this effectively ignores a certain type of event). You then register your handler implementations using setContentHandler( ), setErrorHandler( ), setDTDHandler( ), and setEntityResolver( ), all on the XMLReader class (see ). Then the reader invokes the callback methods on the appropriate handlers during parsing.

The handler classes are all passed into the XMLReader interface, and then used during parsing to trigger programmer-defined behaviors

For the SAXTreeViewer example, start by implementing the ContentHandler interface. ContentHandler, as the name implies, details events related to the content of an XML document: elements, attributes, character data, etc. Add the following class to the end of your SAXTreeViewer.java source listing:

class JTreeHandler implements ContentHandler {
 /** Tree Model to add nodes to */
 private DefaultTreeModel treeModel;
 /** Current node to add sub-nodes to */
 private DefaultMutableTreeNode current;
 public JTreeHandler(DefaultTreeModel treeModel, DefaultMutableTreeNode base) {
 this.treeModel = treeModel;
 this.current = base;
 }
 // ContentHandler callback implementations
}

Most of this early version is Swing-related. The handler will respond to each SAX event by adding a node to the Swing tree, building up a visual representation of the XML document.

Don't bother trying to compile the source file at this point; you'll get a ton of errors about methods defined in ContentHandler not being implemented. The rest of this section walks through each of these methods. Now you're ready to look at the various ContentHandler callbacks and implement each. They're all listed for you in .

Each of these callbacks returns void; it's not OO programming, but it gets the job done

The Document Locator

The first callback you need to implement is setDocumentLocator( ); this allows you to save a reference to an org.xml.sax.Locator for use within your other SAX events. When a callback event occurs, the class implementing a handler often needs access to the location of the SAX parser within an XML file. The Locator class has several useful methodssuch as getLineNumber( ) and getColumnNumber( )that return the current location of the parsing process within an XML file when invoked. Since this might be handy to use later, the code shown here saves the provided Locator instance to a member variable:

class JTreeHandler implements ContentHandler {
 /** Hold onto the locator for location information */
 private Locator locator;
 // Constructor
 public void setDocumentLocator(Locator locator) {
 // Save this for later use
 this.locator = locator;
 }
}

The Locator should be used only within the scope of the ContentHandler implementation; data it reports outside of the parsing process is unpredictable (and useless, anyway).

The Beginning and the End of a Document

In any lifecycle process, there must always be a beginning and an end. These important events should each occur once: the former before all other parsing events and the latter after all other events. This rather obvious fact is critical to your apps, as it allows you to know exactly when parsing begins and ends. SAX provides callback methods for each of these events, aptly named startDocument( ) and endDocument( ). startDocument( ) is called before any other parsing callbacks, including the callback methods within other SAX handlers, such as DTDHandler. In other words, startDocument( ) is not only the first method called within ContentHandler, but also within the entire parsing process, aside from setDocumentLocator( ). This ensures a finite beginning to parsing, and lets the app perform any tasks it needs to before parsing takes place. endDocument( ), is always the last method called, again across all handlers. This includes situations in which errors occur that cause parsing to halt.

Even if an unrecoverable error occurs, the ErrorHandler's callback method is invoked, and then a final call to endDocument( ) completes the attempted parsing.

In the example code, no visual event should occur with these methods; however, as with implementing any interface, the methods must still be present:

public void startDocument( ) throws SAXException {
 // No visual events occur here
}
public void endDocument( ) throws SAXException {
 // No visual events occur here
}

Both of these callback methods can throw SAXExceptions. The only types of exceptions that SAX events ever throw, SAXException provides another standard interface to the parsing behavior. However, these exceptions often wrap other exceptions that indicate what problems have occurred. For example, if an XML file is parsed over the network via a URL, and the connection becomes invalid, a java.net.SocketException can occur. Within the SAX reader, the original exception is caught and rethrown as a SAXException, with the originating exception stuffed inside the new one. This allows your apps to trap for one standard exception, while allowing specific details of what errors occurred within the parsing process to be wrapped and made available to the calling program through this standard exception. The SAXException class provides a method called getException( ) that returns the underlying Exception (if one exists).

Processing Instructions

mentioned processing instructions (PIs) within XML as a bit of a special case. They were not considered XML elements, and were handled differently by being made available to the calling app. Because of these special characteristics, SAX defines a specific callback for handling processing instructions. This method, processingInstruction( ), receives the target of the processing instruction and any data sent to the PI. For this chapter's example, the PI can be converted to a new node and displayed in the tree viewer:

public void processingInstruction(String target, String data)
 throws SAXException {
 DefaultMutableTreeNode pi = new DefaultMutableTreeNode("PI (target = '" + target +
 "', data = '" + data + "')");
 current.add(pi);
}

This method allows an app to receive instructions and set variable values, or even execute additional code to perform app-specific processing. For example, the Apache Cocoon publishing framework might set flags to perform transformations on the data once it is parsed, or to display the XML as a specific content type. This method, like the other SAX callbacks, throws a SAXException when errors occur.

This method will not receive notification of the XML declaration:

<?xml version="1.0" standalone="yes"?>

In fact, SAX provides no means of getting at this information (and you'll find out that it's not currently part of DOM Level 2, either!). The general underlying principle is that this information is for the XML parser or reader, not the consumer of the document's data. For that reason, it's not exposed to the developer.

Namespace Callbacks

From the discussion of namespaces in Chapters 1 and 2, you should be starting to realize their importance and impact on parsing and handling XML. Alongside XML Schema, XML Namespaces is one of the more significant concepts added to XML since the original XML 1.0 Recommendation. With SAX 2, support for namespaces was introduced at the element level. This allows a distinction to be made between the namespace of an element, associated with an element prefix and URI, and the local name of an element. Local name refers to the unprefixed name of an element. For example, the local name of the rdf:li element is simply li. The namespace prefix is rdf, and the namespace URI might be declared as http://www.w3.org/1999/02/22-rdf-syntax-ns#. You'll also see the term Q name (sometimes written QName), which refers to the prefixed name of the element; so li is the local name in this case, and rdf:li is the Q name. There are two SAX callbacks specifically dealing with namespaces. These callbacks are invoked when the parser reaches the beginning and end of a prefix mapping. Although this is a new term, it is not a new concept; a prefix mapping is simply an element that uses the xmlns attribute to declare a namespace. This is often the root element (which may have multiple mappings) but can be any element within an XML document that declares an explicit namespace. For example:

<catalog>
 <books>
 <book title="XML in a Nutshell" xmlns:xlink="http://www.w3.org/1999/xlink">
 <cover xlink:type="simple" xlink:show="onLoad" xlink:href="xmlnutcover.jpg" alt="Java XML in a Nutshell" />
 </book>
 </books>
</catalog>

In this case, an explicit namespace is declared several element nestings deep within the document. That prefix and URI mapping (in this case, xlink and http://www.w3.org/1999/xlink, respectively) are then available to elements and attributes within the declaring element. The startPrefixMapping( ) callback is passed the namespace prefix, as well as the URI associated with that prefix. The mapping is considered "closed" or "ended" when the element that declared the mapping is closed, which triggers the endPrefixMapping( ) callback. The only twist to these callbacks is that they don't quite behave in the sequential manner in which SAX is usually structured; the prefix mapping callback occurs directly before the callback for the element that declares the namespace, and the ending of the mapping results in an event just after the close of the declaring element. However, it actually makes a lot of sense: for the declaring element to be able to use the declared namespace mapping, the mapping must be available before the element's callback. It works in just the opposite way for ending a mapping: the element must close (as it may use the namespace), and then the namespace mapping can be removed from the list of available mappings. In the JTreeHandler, there aren't any visual events that should occur within these two callbacks. However, a common practice is to store the prefix and URI mappings in a data structure. You will see in a moment that the element callbacks report the namespace URI, but not the namespace prefix. If you don't store these prefixes yourself (reported through startPrefixMapping( )), they won't be available in your element callback code. The easiest way to do this is to use a Map, add the reported prefix and URI to this Map in startPrefixMapping( ), and then remove them in endPrefixMapping( ). This can be accomplished with the following code additions:

/** Store URI to prefix mappings */
private Map namespaceMappings;
public JTreeHandler(DefaultTreeModel treeModel, DefaultMutableTreeNode base) {
 this.treeModel = treeModel;
 this.current = base;
 this.namespaceMappings = new HashMap( );
}
public void startPrefixMapping(String prefix, String uri) {
 // No visual events occur here.
 namespaceMappings.put(uri, prefix);
}
public void endPrefixMapping(String prefix) {
 // No visual events occur here.
 for (Iterator i = namespaceMappings.keySet( ).iterator( ); i.hasNext( ); ) {
 String uri = (String)i.next( );
 String thisPrefix = (String)namespaceMappings.get(uri);
 if (prefix.equals(thisPrefix)) {
 namespaceMappings.remove(uri);
 break;
 }
 }
}

Notice that I used the URI as a key to the mappings, rather than the prefix. The startElement( ) callback reports the namespace URI for the element, not the prefix. So keying on URIs makes those lookups faster. However, as you see in endPrefixMapping( ), it does add a little bit of work to removing the mapping when it is no longer available.

The solution shown here is far from a complete one in terms of dealing with more complex namespace issues. It's perfectly legal to reassign prefixes to new URIs for an element's scope, or to assign multiple prefixes to the same URI. In the example, this would result in widely scoped namespace mappings being overwritten by narrowly scoped ones (in the case where identical URIs were mapped to different prefixes). In a more robust app, you would want to store prefixes and URIs separately, and have a method of relating the two without causing overwriting. However, you get the idea in the example of how to handle namespaces in the general sense.

Element Callbacks

More than half of SAX callbacks have nothing to do with XML elements, attributes, and data. Remember, the process of parsing XML is intended to do more than simply provide your app with the XML data; it should give the app instructions from XML PIs so your app knows what actions to take, let the app know when parsing begins and when it ends, and even tell it when there is whitespace that can be ignored! Of course, there certainly are SAX callbacks intended to give you access to the XML data within your documents. The three primary callbacks involved in accessing that data are the start and end of elements and the characters( ) callback. These tell you that the start tag for an element has been parsed, the data found within that element, and when the closing tag for that element is reached. The first of these, startElement( ), gives an app information about an XML element and any attributes it may have. The parameters to this callback are the name of the element (in various forms) and an org.xml.sax.Attributes instance (see ). The Attributes interface (or, rather, your parser's implementation of the interface) holds references to all of the attributes within an element. It allows easy iteration through the element's attributes in a form similar to a Vector. In addition to being able to reference an attribute by its index (used when iterating through all attributes), it is possible to reference an attribute by its name. Of course, by now you should be a bit cautious when you see the word "name" referring to an XML element or attribute, as it can mean various things. In this case, either the attribute's Q name can be used, or the combination of its local name and namespace URI if a namespace is used. There are also helper methods such as getURI(int index) and getLocalName(int index) that help give additional namespace information about an attribute. Used as a whole, the Attributes interface provides a comprehensive set of information about an element's attributes.

The Attributes interface isn't a Java collection (unfortunately), but it does provide collection-like behavior

In addition to its attributes, you get several forms of the element's name. This again is in deference to XML namespaces. The namespace URI of the element is supplied first. This places the element in its correct context across the document's complete set of namespaces. Then the local name of the element is supplied, which is the unprefixed element name. In addition (and for backward compatibility), the Q name of the element is supplied. Now, back to the actual implementation of startElement( ). First, a new node is created and added to the tree with the local name of the element. Then, that node is set as the current node, so all nested elements and attributes are added as leaves.

Technically, an attribute is not nested within an element. Attributes are usually said to be on the element, and usually describe the element. That's a bit tricky to display though, so I've opted for simply nesting them; you're a smart reader and will know what I mean, though, won't you?

Next, the namespace is determined, using the supplied namespace URI and the namespaceMappings object (to get the prefix) that you just added to the code from the "Namespace Callbacks" section. This is added as a node, as well. Finally, the code iterates through the Attributes interface, adding each (with local name and namespace information) as a child node. The code to accomplish all this is shown here:

public void startElement(String namespaceURI, String localName,
 String qName, Attributes atts)
 throws SAXException {
 DefaultMutableTreeNode element =
 new DefaultMutableTreeNode("Element: " + localName);
 current.add(element);
 current = element;
 // Determine namespace
 if (namespaceURI.length( ) > 0) {
 String prefix =
 (String)namespaceMappings.get(namespaceURI);
 if (prefix.equals("")) {
 prefix = "[None]";
 }
 DefaultMutableTreeNode namespace =
 new DefaultMutableTreeNode("Namespace: prefix = '" +
 prefix + "', URI = '" + namespaceURI + "'");
 current.add(namespace);
 }
 // Process attributes
 for (int i=0; i<atts.getLength( ); i++) {
 DefaultMutableTreeNode attribute =
 new DefaultMutableTreeNode("Attribute (name = '" +
 atts.getLocalName(i) +
 "', value = '" +
 atts.getValue(i) + "')");
 String attURI = atts.getURI(i);
 if (attURI.length( ) > 0) {
 String attPrefix = (String)namespaceMappings.get(attURI);
 if (attPrefix.equals("")) {
 attPrefix = "[None]";
 }
 DefaultMutableTreeNode attNamespace =
 new DefaultMutableTreeNode("Namespace: prefix = '" +
 attPrefix + "', URI = '" + attURI + "'");
 attribute.add(attNamespace);
 }
 current.add(attribute);
 }
}

The end of an element is much easier to code. Since there is no need to give any visual information, all that must be done is to walk back up the tree one node, leaving the element's parent as the new current node:

public void endElement(String namespaceURI, String localName,
 String qName)
 throws SAXException {
 // Walk back up the tree
 current = (DefaultMutableTreeNode)current.getParent( );
}

Element Data

Once the beginning and end of an element block are identified and the element's attributes are enumerated, the next piece of important information is the actual data contained within the element itself. This generally consists of additional elements, textual data, or a combination of the two. When other elements appear, the callbacks for those elements are initiated, and a type of pseudorecursion happens: elements nested within elements result in callbacks "nested" within callbacks. At some point, though, textual data will be encountered. Typically the most important information to an XML client, this data is usually either what you show to the client or what you process to generate a client response. In SAX, textual data within elements is sent to your app via the characters( ) callback. This method provides your app with an array of characters as well as a starting index and the length of the characters to read. Generating a String from this array and applying the data is a piece of cake:

public void characters(char[] ch, int start, int length)
 throws SAXException {
 String s = new String(ch, start, length);
 DefaultMutableTreeNode data =
 new DefaultMutableTreeNode("Character Data: '" + s + "'");
 current.add(data);
}

Seemingly a simple callback, this method often results in a significant amount of confusion. A SAX parser may choose to return all contiguous character data in one invocation, or split this data up into multiple method invocations. For any given element, this method will be called not at all (if no character data is present within the element) or one or more times. Parsers implement this behavior differently, often using algorithms designed to increase parsing speed. Never count on having all the textual data for an element within one callback method; conversely, never assume that multiple callbacks would result from one element's contiguous character data.

Getting Ahead of the Data

The characters( ) callback method accepts a character array, as well as start and length parameters, to signify which index to start at and how far to read into the array. This can cause some confusion; a common mistake is to include code like this to read from the character array:

public void characters(char[] ch, int start, int length)
 throws SAXException {
 for (int i=0; i<ch.length; i++)
 System.out.println(ch[i]);
}

The mistake here is in reading from the beginning to the end of the character array, instead of from start to start+length. This common mistake results from years of iterating through arrays, either in Java, C, or another language. However, in the case of a SAX event, this can cause quite a problem. SAX parsers are required to pass starting and length values on the character array that any loop constructs should use to read from the array. This allows lower-level manipulation of textual data to occur in order to optimize parser performance, such as reading data ahead of the current location as well as array reuse. This is all legal behavior within SAX, as the expectation is that a wrapping app will not try to "read past" the length parameter sent to the callback. Mistakes as in the example shown can result in gibberish data being output to the screen or used within your app, and is almost always problematic. The loop construct looks very normal and compiles without a hitch, so this can be a very tricky problem to track down. Remember to always simply convert this data to a String and use it directly:

String data = new String(ch, start, length);

Sequencing mixups

As you write SAX event handlers, be sure to keep your mind in a hierarchical mode. In other words, you should not get in the habit of thinking that an element owns its data and child elements, but only that it serves as a parent. Also keep in mind that the parser is moving along, handling elements, attributes, and data as it comes across them. This can make for some surprising results. Consider the following XML document fragment:

<parent>This element has <child>embedded text</child> within it.</parent>

Forgetting that SAX parses sequentially, making callbacks as it sees elements and data, and forgetting that the XML is viewed as hierarchical, you might make the assumption that the output here would be something like .

If you're not careful, your assumptions about XML will result in a mismatch between what you see and what you get

This seems logical, as the parent element completely "owns" the child element. But what actually occurs is that a callback is made at each SAX event point, resulting in the tree shown in .

If you keep the parsing cycle in mind, you'll keep element data with its correct element, rather than a parent element further up the hierarchy

Whitespace

Whitespace is also often reported by the characters( ) method. This introduces additional confusion, as another SAX callback, ignorableWhitespace( ), also reports whitespace. You can avoid this confusion by remembering this simple rule: if no schema is referenced, ignorableWhitespace( ) will never be invoked.

As explained in , the generic term "schema" refers to any constraint model, including DTDs, XML Schema, Relax, etc.

A schema details the content model for an element. Consider Example 3-2, the DTD for Jon Bosak's XML versions of Shakespeare's plays.

Example This rather simple DTD constrains the entire body of Shakespeare's works, modeled in XML

<!ENTITY amp "&">
<!ELEMENT PLAY (TITLE, FM, PERSONAE, SCNDESCR, PLAYSUBT, INDUCT?,
 PROLOGUE?, ACT+, EPILOGUE?)>
<!ELEMENT TITLE (#PCDATA)>
<!ELEMENT FM (P+)>
<!ELEMENT P (#PCDATA)>
<!ELEMENT PERSONAE (TITLE, (PERSONA | PGROUP)+)>
<!ELEMENT PGROUP (PERSONA+, GRPDESCR)>
<!ELEMENT PERSONA (#PCDATA)>
<!ELEMENT GRPDESCR (#PCDATA)>
<!ELEMENT SCNDESCR (#PCDATA)>
<!ELEMENT PLAYSUBT (#PCDATA)>
<!ELEMENT INDUCT (TITLE, SUBTITLE*, (SCENE+|(SPEECH|STAGEDIR|SUBHEAD)+))>
<!ELEMENT ACT (TITLE, SUBTITLE*, PROLOGUE?, SCENE+, EPILOGUE?)>
<!ELEMENT SCENE (TITLE, SUBTITLE*, (SPEECH | STAGEDIR | SUBHEAD)+)>
<!ELEMENT PROLOGUE (TITLE, SUBTITLE*, (STAGEDIR | SPEECH)+)>
<!ELEMENT EPILOGUE (TITLE, SUBTITLE*, (STAGEDIR | SPEECH)+)>
<!ELEMENT SPEECH (SPEAKER+, (LINE | STAGEDIR | SUBHEAD)+)>
<!ELEMENT SPEAKER (#PCDATA)>
<!ELEMENT LINE (#PCDATA | STAGEDIR)*>
<!ELEMENT STAGEDIR (#PCDATA)>
<!ELEMENT SUBTITLE (#PCDATA)>
<!ELEMENT SUBHEAD (#PCDATA)>

As an example, the FM element can only have P elements within it. Any whitespace between the start of the FM element and the start of a P element is therefore ignorable. It doesn't mean anything, because the DTD says not to expect any character data (whitespace or otherwise). The same thing applies for whitespace between the end of an ACT element and the start of another ACT element, as the parent (PLAY) cannot contain character data; therefore, any whitespace can be ignored. However, without a constraint specifying that information to a parser, that whitespace cannot be interpreted as meaningless. Without a DTD, these various whitespaces would trigger the characters( ) callback, where previously they triggered the ignorableWhitespace( ) callback. Thus whitespace is never simply ignorable, or nonignorable; it all depends on what (if any) constraints are referenced. Change the constraints, and you might change the meaning of the whitespace. Let's dive even deeper. In the case where an element can only have other elements within it, things are reasonably clear. Whitespace in between elements is ignorable. However, consider a mixed content model:

<!ELEMENT LINE (#PCDATA | STAGEDIR)*>

In this model, there is no whitespace between the starting and ending LINE tags that will ever be reported as ignorable (with or without a DTD or schema reference). That's because it's impossible to distinguish between whitespace used for readability and whitespace that is supposed to be in the document. For example:

<SPEAKER>CELIA</SPEAKER>
<LINE>
 <STAGEDIR>Reads</STAGEDIR>
</LINE>
<LINE>Why should this a desert be?</LINE>

In this XHTML fragment, the whitespace between the opening LINE element and the opening STAGEDIR element is not ignorable, and therefore reported through the characters( ) callback. Be prepared to closely monitor both of the character-related callbacks.

Ignorable Whitespace

With all that whitespace discussion done, adding an implementation for the ignorableWhitespace( ) method is a piece of cake. Since the whitespace reported is ignorable, the code does just thatignores it:

public void ignorableWhitespace(char[] ch, int start, int length)
 throws SAXException {
 // This is ignorable, so don't display it
}

Whitespace is reported in the same manner as character data; it can be reported with one callback, or a SAX parser may break up the whitespace and report it over several method invocations. In either case, adhere closely to the precautions about not making assumptions or counting on whitespace as textual data in order to avoid troublesome bugs in your apps.

Entities

Entities often are used to refer to another fragment of XML, as well as special characters like & and >. When your XML document is parsed, those entities that do reference other files are usually expanded and inserted into the document flow. However, nonvalidating parsers are not required to resolve entity references, and instead may skip them; you can also usually configure your parser to intentionally skip entities. In both cases, SAX accounts for this with a callback that is issued when an entity is skipped. The callback gives the name of the entity, which can be included in the viewer's output:

public void skippedEntity(String name) throws SAXException {
 DefaultMutableTreeNode skipped =
 new DefaultMutableTreeNode("Skipped Entity: '" + name + "'");
 current.add(skipped);
}

You won't see this callback executed often; most established parsers will not skip entities, even if they are not validating. Apache Xerces, for example, never invokes this callback; instead, the entity reference is expanded and the result included in the XML data returned to your app.

The parameter passed to the callback does not include the leading ampersand and trailing semicolon in the entity reference. For &header;, only the name of the entity reference, header, is passed to skippedEntity( ).

The Results

Finally, you need to register the content handler implementation with the XMLReader you've instantiated. This is done with setContentHandler( ):

public void buildTree(DefaultTreeModel treeModel,
 DefaultMutableTreeNode base, String xmlURI)
 throws IOException, SAXException {
 // Create instances needed for parsing
 XMLReader reader =
 XMLReaderFactory.createXMLReader( );
 ContentHandler jTreeHandler = new jTreeHandler(treeModel, base);
 // Register content handler
 reader.setContentHandler(jTreeHandler);
 // Register error handler
 // Parse
 InputSource inputSource = new InputSource(xmlURI);
 reader.parse(inputSource);
}

Now compile the SAXTreeViewer.java source file. Once done, you may run the SAX viewer demonstration on the XML sample file created earlier. Also, make sure that you have added your working directory to the classpath. The complete Java command should read:

/usr/local/writing/javaxml3>java javaxml3.SAXTreeViewer as_you.xml

Java 5 users, you'll get a warning about unchecked operations when you compile SAXTreeViewer. You can easily fix those with parameterized collections; I chose not to show that, as the code wouldn't compile on pre-Java 5.0 environments.

This should result in a Swing window firing up, loaded with the XML document's content. Your output should look similar to , depending on what nodes you have expanded.

Shakespeare's plays are obviously rather long, but SAX still parsed this file remarkably quickly

Now you know how SAX handles a well-formed XML document. You should also have a pretty good understanding of the document callbacks that occur within the parsing process, and how your app uses these callbacks to get information about an XML document as the document is parsed. Before moving on, though, I want to address the issue of what happens when your XML document is not well formed, and the errors that can result from this condition.