Consuming SAX2 Events

Contents:

More About ContentHandler
The LexicalHandler Interface
Exposing DTD Information
Turning SAX Events into Data Structures
XML Pipelines

Most of the power of SAX is exposed through event callbacks. In previous chapters you've seen some of the most widely used event callbacks as well as how to ensure that all the callbacks are generated and reported to application code.

This chapter presents the rest of the standard SAX event-handling interfaces (including the extension handlers), then talks about some of the common ways that event consumers use those interfaces. These interfaces are primarily implemented by application code that consumes events and needs to solve particular problems. You might also write custom event producers, which call these interfaces directly rather than expecting some type of XMLReader to issue them.

More About ContentHandler

In "Basic ContentHandler Events", in "Introducing SAX2", we looked at the most important APIs used to handle XML document content. Some other APIs were deferred to this section because they aren't used as widely. Depending on what problems you're solving, you may rely heavily on some of these additional methods.

Other ContentHandler Methods

Five ContentHandler callbacks were discussed in Chapter 2: "Essential ContentHandler Callbacks" explained how characters and element boundaries were reported, and "ContentHandler and Prefix Mappings" explained how namespace-prefix scopes were reported. But the interface has five other methods. Here's what they do and when you'll want to use them:

The Locator Interface

This useful interface is sometimes overlooked. It gives information that is essential for providing location-sensitive diagnostics and is often given to SAXParseException constructors. That same information is also needed to resolve relative URIs in document content or attribute values (such as xml:base). Parsers provide one instance of this class, which can be used inside event callbacks to find what entity triggered the event and approximately where. Use that locator only during such callbacks. There are only a few methods in this class.

One common use for a locator is to report an error detected while an application processes document content. The SAXParseException class has two constructors that take locator parameters. (The descriptive string is always first, the locator is second, and an optional "root cause" exception is third.) Once you create such an exception, it can be thrown directly, which always terminates a parse. Or you pass it to an ErrorHandler to centralize error handling-policy in your application:

// "locator" was saved when setDocumentLocator() was called earlier // or was initialized to null; this is safe in both cases try {
 ... engine.setWarpFactor (11); ...
}
catch (DriveException e) {
 SAXParseException spe = new SAXParseException ( "The warp engine's gonna blow!", locator, e); errHandler.error (e); // we'll get here whenever such problems are ignored }

To resolve relative URIs in document content -- for example, one found in an <xhtml:a href="..."/> reference in a link checker -- you'd use code like this (ignoring xml:base complications):

public void startElement (String uri, String lname, String qname, Attributes atts) throws SAXException {
 if (xhtmlURI.equals (uri)) {
 if ("a".equals (lname)) {
 String href = atts.getValue ("href"); if (href != null) {
 // ASSUMES: locator is nonnull System.out.println ("Found href to: " + new URI (new URI(locator.getSystemId ()), href));
}
// else presumably <xhtml:a name="..."/>
}
} ... }

Some of the XMLReader implementations cannot possibly call ContentHandler.setDocumentLocator() with a Locator. When parsing in-memory data structures, such as a DOM document, a locator will normally be meaningless. When parsing in-memory buffers like a String (with a StringReader), there won't usually be a URI in the locator.

If your application supports the layered xml:base convention (which lets documents "lie" about their true locations for purposes of resolving relative URIs), it will need to track those attributes itself, as part of a context stack mechanism. (An example of such a stack is shown later, in Example 5-1.) Such attributes can sometimes help make up for SAX event sources that can't provide locator information, such as DOM-to-SAX producers. But they can confuse things too: in the following example, xml:base would apply to the top element and its direct children, but nothing within the external entity reference. (Let's assume, for the sake of discussion, that no element has an xml:base attribute.)

<top xml:base="http://www.example.com/moved/doc2.xml">
<xhtml:a href="abc.xml"/>
<xhtml:div> &external; </xhtml:div>
<xhtml:a href="xyz.xml"/>
</top>

When character content of an element is reported, characters from different external entities will get different callbacks, so the locator can be used to tell those different entities apart from each other.

Internationalization Concerns

One of the goals of XML was to bring Unicode into widespread use so that the Web could really become worldwide in terms of people, not just technology. This brings several concerns into text management. You may not need to worry about these if you're working only in ASCII or with just one character encoding. While you're just starting out with Java and XML you should certainly avoid worrying about these details. Some other users of SAX2 will need to understand these issues. Since they surface primarily with ContentHandler event callbacks, we briefly summarize them here.

If your application works with MathML, or in various languages whose character sets gained support in Unicode 3.1 through the so-called Astral Planes, you will need to know that what Java calls a char is not really the same thing as a Unicode character or an XML character. If you aren't using such languages, you'll probably be able to ignore this issue for a while. Still, you might want to read about Unicode 3.1 to learn more about this and minimize trouble later. By the time you read this, the W3C may even have completed its "Blueberry" XML update, intended to allow the use of some such characters within XML names.

In the case of such characters, whose Unicode code point is above the value U+FFFF (the maximum 16-bit code point), these characters are mapped to two Java char values, called a surrogate pair. The char values are in a range reserved for surrogate characters, with a high surrogate always immediately followed by a low surrogate. (This is called a big-endian sequence.) Surrogate pairs can show up in several places in XML, and hence in SAX2: in character content, processing instructions, attribute values (including defaults in the DTD), and comments.

At this time, Java does not have APIs to explicitly support characters using surrogate pairs, although character arrays and java.lang.String will hold them as if the char values weren't part of the same character. The java.lang.Character class doesn't recognize surrogate pairs. The best precaution seems to be to prefer APIs that talk in terms of slices of character arrays (or Strings), rather than in terms of individual Java char values. This approach also handles other situations where more than one char value is needed per character.

Depending on the character encodings you're using and the applications you're implementing, you may also need to pay attention to the W3C Character Model (http://www.w3.org/TR/charmod/ at this writing) and Unicode Normalization Form C. Briefly, these aim to eliminate undesirable representations of characters and to handle some other cases where Unicode characters aren't the same as XML characters or a Java char, such as composite characters. For example, many accented characters are represented by composing two or more Unicode characters. Systems work better when they only need to handle one way to represent such characters, and Form C addresses that problem.