A Handler Base Class

SAX doesn't distinguish between different elements; it leaves that burden up to you. You have to sort out the element name in the start_element( ) handler, and maybe use a stack to keep track of element hierarchy. Don't you wish there were some way to abstract that stuff? Ken MacLeod has done just that with his XML::Handler::Subs module.

This module defines an object that branches handler calls to more specific handlers. If you want a handler that deals only with <title> elements, you can write that handler and it will be called. The handler dealing with a start tag must begin with s_, followed by the element's name (replace special characters with an underscore). End tag handlers are the same, but start with e_ instead of s_.

That's not all. The base object also has a built-in stack and provides an accessor method to check if you are inside a particular element. The $self->{Names} variable refers to a stack of element names. Use the method in_element( $name ) to test whether the parser is inside an element named $name at any point in time.

To try this out, let's write a program that does something element-specific. Given an HTML file, the program outputs everything inside an <h1> element, even inline elements used for emphasis. The code, shown in Example 5-7, is breathtakingly simple.

Example 5-7. A program subclassing the handler base

use XML::Parser::PerlSAX; use XML::Handler::Subs # # initialize the parser # use XML::Parser::PerlSAX; my $parser = XML::Parser::PerlSAX->new( Handler => H1_grabber->new( ) ); $parser->parse( Source => {SystemId => shift @ARGV} ); ## Handler object: H1_grabber ## package H1_grabber; use base( 'XML::Handler::Subs' ); sub new {
 my $type = shift; my $self = {@_};
 return bless( $self, $type );
}
# # handle start of document # sub start_document {
 SUPER::start_document( );
print "Summary of file:\n";
}
# # handle start of <h1>: output bracket as delineator # sub s_h1 {
 print "[";
}
# # handle end of <h1>: output bracket as delineator # sub e_h1 {
 print "]\n";
}
# # handle character data # sub characters {
 my( $self, $props ) = @_; my $data = $props->{Data};
 print $data if( $self->in_element( h1 ));
}

Let's feed the program a test file:

<html>
<head><title>The Life and Times of Fooby</title></head>
<body>
<h1>Fooby as a child</h1>
<p>...</p>
<h1>Fooby grows up</h1>
<p>...</p>
<h1>Fooby is in <em>big</em> trouble!</h1>
<p>...</p>
</body>
</html>

This is what we get on the other side:

Summary of file: [Fooby as a child] [Fooby grows up] [Fooby is in big trouble!]

Even the text inside the <em> element was included, thanks to the call to in_element( ). XML::Handler::Subs is definitely a useful module to have when doing SAX processing.