Parsing with SAX - XML - Java Programming Language

Without spending any further time on the preliminaries, it's time to code. As a sample to familiarize you with SAX, this chapter details the SAXTreeViewer class. This utility uses SAX to parse an XML document, and displays the document visually as a Swing JTRee.

If you don't know anything about Swing, don't worry; I don't focus on that, but just use it for visual purposes. The focus will remain on SAX, and how events within parsing can be used to perform customized action.

The first thing you need to do in any SAX-based app is get an instance of a class that implements the SAX org.xml.sax.XMLReader interface; remember, this is why you downloaded a SAX-compliant parser in the first place.

Instantiating a Reader

SAX provides the org.xml.sax.XMLReader interface for all SAX-compliant XML parsers to implement. For example, the Xerces SAX parser implementation, org.apache.xerces.parsers.SAXParser, implements the XMLReader interface. If you have access to the source of your parser, you should see the same interface implemented in your parser's main SAX parser class. Each XML parser must have one class (and sometimes has more than one) that implements this interface, and that is the class you need to instantiate to allow for parsing XML:

// Instantiate a Reader XMLReader reader = new org.apache.xerces.parsers.SAXParser( );
// Do something with the parser reader.parse(uri);

For newcomers to SAX, you may be wondering why XMLReader isn't called Parser. In fact, it was in SAX 1.0, and then so many changes were introduced that the class had to be deprecated and renamed. As a result, you'll call the parse( ) method on the XMLReader class.

This approach ties you tightly to your parser vendor, though; you can use SAX's org.xml.sax.helpers.XMLReaderFactory to get away from this:

XMLReader reader = XMLReaderFactory.createXMLReader( );

Just set the org.xml.sax.driver system property, and you can get your vendor's XMLReader implementation, without importing your vendor's classes:

java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser
 [MyClassName]

Even better, most vendor's will set this property internally, meaning you don't have to worry about this system property at all; just call createXMLReader( ), and go.

As you might expect, Apache Xerces is one of these vendors.

With that in mind, it's worth looking at a more realistic app. Example 3-1 is the skeleton for the SAXTreeViewer class, which allows viewing of an XML document as a graphical tree.

Example This class sets up an XMLReader and then lists the basic parsing steps

public class SAXTreeViewer extends JFrame {
 // Swing-related variables and methods, including
 // setting up a JTree and basic content pane
 public void buildTree(DefaultTreeModel treeModel,
 DefaultMutableTreeNode base, String xmlURI)
 throws IOException, SAXException {
 // Create instances needed for parsing
 XMLReader reader =
 XMLReaderFactory.createXMLReader( );
 // Register content handler
 // Register error handler
 // Parse
 }
 public static void main(String[] args) {
 try {
 if (args.length != 1) {
 System.out.println(
 "Usage: java javaxml3.SAXTreeViewer " +
 "[XML Document]");
 return;
 }
 SAXTreeViewer viewer = new SAXTreeViewer( );
 viewer.init(args[0]);
 viewer.setVisible(true);
 } catch (Exception e) {
 e.printStackTrace( );
 }
 }
}

In this and the rest of this tutorial's examples, I've tried to cut down all but the crucial portions of code. Import statements and code that isn't related to the concepts at hand (in this case, Swing details) have been excised, and relegated to the online examples.

The buildTree( ) method is where we'll be spending our time in this chapter; you can already see I've placed a few comments to outline the basic steps involved in parsing with SAX.

Parsing the Document

Once a reader is loaded and ready for use, use the parse( ) method to parse XML; this method accepts either an org.xml.sax.InputSource or a simple string. It's a much better idea to use the SAX InputSource class, because it can be constructed with an I/O InputStream, Reader, or a string URI.

U-R-What?

A URI is a uniform resource identifier. As the name suggests, it provides a standard means of identifying (and thereby locating, in most cases) a specific resource; this resource is almost always some sort of XML document, for the purposes of this tutorial. URIs are also related to URLs, uniform resource locators. In fact, a URL is always a URI (although the reverse is not true). So in the examples in this and other chapters, you could specify a filename or a URL, like http://www.ibiblio.org/xml/examples/shakespeare/othello.xml, and either would be accepted.

Because the code loads an XML document, either locally or remotely, a java.io.IOException may result, and must be caught. In addition, the org.xml.sax.SAXException will be thrown if problems occur while parsing the document. Notice that the buildTree method can throw both of these exceptions:

public void buildTree(DefaultTreeModel treeModel,
 DefaultMutableTreeNode base, String xmlURI)
 throws IOException, SAXException {
 // Create instances needed for parsing
 XMLReader reader =
 XMLReaderFactory.createXMLReader( );
 // Register content handler
 // Register error handler
 // Parse
 InputSource inputSource = new InputSource(xmlURI);
 reader.parse(inputSource);
}

Using InputSource for input

The advantage to using an InputSource instead of directly supplying a URI is simple: InputSource can provide more information to the parser. An InputSource encapsulates information about a single object, the document to parse. In situations where a system identifier, public identifier, or stream may all be tied to one URI, using an InputSource for encapsulation can become very handy. The class has accessor and mutator methods for its system ID and public ID, a character encoding, a byte stream (java.io.InputStream), and a character stream (java.io.Reader). When passed as an argument to the parse( ) method, SAX also guarantees that the parser will never modify the InputSource. The original input to a parser is still available unchanged after its use by a parser or XML-aware app. To put this in perspective, consider parsing a document with a simple DTD reference:

<!DOCTYPE PLAY SYSTEM "play.dtd">

By using an InputSource and wrapping the supplied XML URI, you have set implicitly the system ID of the document. This effectively sets up the path to the document for the parser and allows it to resolve all relative paths within that document, like the play.dtd file. If instead of setting this ID, you parsed an I/O stream, the DTD wouldn't be able to be located (as it has no frame of reference); you could simulate this by changing the code in the buildTree( ) method to what is shown here:

// Parse InputSource inputSource = new InputSource(new java.io.FileInputStream(
 new java.io.File(xmlURI)));
reader.parse(inputSource);

You'll now get the following exception when running the viewer:

/usr/local/writing/javaxml3>java javaxml3.SAXTreeViewer /usr/local/contents.xml org.xml.sax.SAXParseException: File "file:///usr/local/writing/javaxml3/play.dtd" not found.

While this seems a little silly (wrapping a URI in a file and I/O stream), it's actually quite common to see people using I/O streams as input to parsers. You just need to set a system ID for the XML stream (using the setSystemID( ) method on InputSource). So the above code sample could be "fixed" by changing it to the following:

// Parse InputSource inputSource = new InputSource(new java.io.FileInputStream(
 new java.io.File(xmlURI)));
inputSource.setSystemID(xmlURI);
reader.parse(inputSource);

Not much going on...

If you compile and run the program now, nothing of any real interest seems to happen. Despite appearance, though, the XML document is parsed.

By default, Xerces looks for any DTD referred to in a DOCTYPE listing. This means that you'll need to be able to access the DTD referred to in any XML document you parse, either locally or via the network. Otherwise, you'll get an error indicating the DTD isn't available. Xerces won't actually validate XML by default, but does require the DTD referenced be accessible.

However, you've provided no callbacks to take action during the parsing; without these callbacks, a document is simply parsed quietly. Parser callbacks let you insert action into the program flow, and turn the rather boring, quiet parsing of an XML document into an app that can react to the data, elements, attributes, and structure of the document being parsed, as well as interact with other programs and clients along the way.