A Handler Base Class
SAX doesn't distinguish between different elements; it leaves that burden up to you. You have to sort out the element name in the start_element( )
handler, and maybe use a stack to keep track of element hierarchy. Don't you wish there were some way to abstract that stuff? Ken MacLeod has done just that with his XML::Handler::Subs
module.
This module defines an object that branches handler calls to more specific handlers. If you want a handler that deals only with <title>
elements, you can write that handler and it will be called. The handler dealing with a start tag must begin with s_
, followed by the element's name (replace special characters with an underscore). End tag handlers are the same, but start with e_
instead of s_
.
That's not all. The base object also has a built-in stack and provides an accessor method to check if you are inside a particular element. The $self->{Names}
variable refers to a stack of element names. Use the method in_element( $name )
to test whether the parser is inside an element named $name
at any point in time.
To try this out, let's write a program that does something element-specific. Given an HTML file, the program outputs everything inside an <h1>
element, even inline elements used for emphasis. The code, shown in Example 5-7, is breathtakingly simple.
Example 5-7. A program subclassing the handler base
use XML::Parser::PerlSAX; use XML::Handler::Subs # # initialize the parser # use XML::Parser::PerlSAX; my $parser = XML::Parser::PerlSAX->new( Handler => H1_grabber->new( ) ); $parser->parse( Source => {SystemId => shift @ARGV} ); ## Handler object: H1_grabber ## package H1_grabber; use base( 'XML::Handler::Subs' ); sub new { my $type = shift; my $self = {@_}; return bless( $self, $type ); } # # handle start of document # sub start_document { SUPER::start_document( ); print "Summary of file:\n"; } # # handle start of <h1>: output bracket as delineator # sub s_h1 { print "["; } # # handle end of <h1>: output bracket as delineator # sub e_h1 { print "]\n"; } # # handle character data # sub characters { my( $self, $props ) = @_; my $data = $props->{Data}; print $data if( $self->in_element( h1 )); }
Let's feed the program a test file:
<html> <head><title>The Life and Times of Fooby</title></head> <body> <h1>Fooby as a child</h1> <p>...</p> <h1>Fooby grows up</h1> <p>...</p> <h1>Fooby is in <em>big</em> trouble!</h1> <p>...</p> </body> </html>
This is what we get on the other side:
Summary of file: [Fooby as a child] [Fooby grows up] [Fooby is in big trouble!]
Even the text inside the <em>
element was included, thanks to the call to in_element( )
. XML::Handler::Subs
is definitely a useful module to have when doing SAX processing.