| Previous | Next
The HTML ModulesHTML modules provide an interface to parse HTML documents. After you parse the document, you can print or display it according to the markup tags or extract specific information such as hyperlinks. The HTML::parser module provides methods for, literally, parsing HTML. It can handle HTML text from a string or file and can separate out the syntactic structures and data. You shouldn't use HTML::Parser directly, however, since its interface hasn't been designed to make your life easy when you parse HTML. It's merely a base class from which you can build your own parser to deal with HTML in any way you want. And if you don't want to roll your own HTML parser or parser class, then there's always HTML::TokeParser and HTML::TreeBuilder, both of which are covered in this chapter. HTML::TreeBuilder is a class that parses HTML into a syntax tree. In a syntax tree, each element of the HTML, such as container elements with beginning and end tags, is stored relative to other elements. This preserves the nested structure and behavior of HTML and its hierarchy. A syntax tree of the TreeBuilder class is formed of connected nodes that represent each element of the HTML document. These nodes are saved as objects from the HTML::Element class. An HTML::Element object stores all the information from an HTML tag: the start tag, end tag, attributes, plain text, and pointers to any nested elements. The remaining classes of the HTML modules use the syntax trees and its nodes of element objects to output useful information from the HTML documents. The format classes, such as HTML::FormatText and HTML::FormatPS, allow you to produce text and PostScript from HTML. The HTML::LinkExtor class extracts all of the links from a document. Additional modules provide means for replacing HTML character entities and implementing HTML tags as subroutines. HTML::ParserThis module implements the base class for the other HTML modules. A parser object is created with the $p = HTML::Parser->new( ); The constructor takes no arguments. The parser object takes methods that read in HTML from a string or a file. The string-reading method can take data in several smaller chunks if the HTML is too big. Each chunk of HTML will be appended to the object, and the
$p->eof( ) Indicates the end of a document and flushes any buffered text. Returns the parser object.
$p->parse(string) Reads HTML into the parser object from a given
$p->parse_file(file) Reads HTML into the parser object from the given When the
The following list shows the internal methods contained in HTML::Parser.
comment(comment) Invoked on comments from HTML (text between
end(tag, origtext) Invoked on end tags (those with the
start(tag, $attr, attrseq, origtext) Invoked on start tags. The first argument,
text(text) Invoked on plain text in the document. The text is passed unmodified and may contain newlines. Character entities in the text are not expanded.
xml_mode(bool) Enabling this attribute changes the parser to allow some XML constructs such as empty element tags and XML processing instructions. It also disables forcing tag and attribute names to lowercase when they are reported by the HTML::TokeParserAs we said, you should use a subclassed HTML parser if you want a better interface to HTML parsing features than what HTML::Parser gives you. HTML::TokeParser by Gisle Aas is one such example. While HTML::TokeParser is actually a subclass of HTML::PullParser, it can help you do many useful things, such as link extraction and HTML checking. In short, HTML::TokeParser breaks an HTML document into tokens, attributes, and content, in which the HTML token: a attrib: href content: http://url content: link token /a For example, you can use HTML::TokeParser to extract links from a string that contains HTML:
HTML::TokeParser methods
new( ) Constructor. Takes a filename, filehandle, or reference to a scalar as arguments. Each argument represents the content that will be parsed. If a scalar is present,
get_tag( ) Returns the next start or end tag in a document. If there are no remaining start or end tags,
get_text( ) Returns all text found at the current position. If the next token is not text,
get_token( ) Returns the next token found in the HTML document, or
Consider the following code:
The items in each token (in the HTML) are displayed as follows: token[0]: S token[1]: a token[2]: HASH(0x8146d3c) token[3]: ARRAY(0x814a380) token[4]: <a href="http://web.archive.org/web/blah"> token[0]: T token[1]: My name is Nate! token[2]: token[0]: E token[1]: a token[2]: </a> token[0]: E token[1]: p token[2]: </p>
get_trimmed_text( ) Works the same as
unget_token( ) Useful for pushing tokens back to the parser so they can be reused the next time you call HTML::ElementThe HTML::Element module provides methods for dealing with nodes in an HTML syntax tree. You can get or set the contents of each node, traverse the tree, and delete a node. HTML::Element objects are used to represent elements of HTML. These elements include start and end tags, attributes, contained plain text, and other nested elements. The constructor for this class requires the name of the tag for its first argument. You may optionally specify initial attributes and values as hash elements in the constructor. For example:
The new element is created for the anchor tag, The following methods are provided for objects of the HTML::Element class.
$h->as_HTML( ) Returns the HTML string that represents the element and its children.
$h->attr(name [,value]) Sets or retrieves the value of attribute
$h->content( ) Returns the content contained in this element as a reference to an array that contains plain-text segments and references to nested element objects.
$h->delete( ) Deletes the current element and all of its child elements.
$h->delete_content( ) Removes the content from the current element.
$h->dump( ) Prints the tag name of the element and all its children to STDOUT. Useful for debugging. The structure of the document is shown by indentation.
$h->endtag( ) Returns the original text of the end tag, including the
$h->extract_links([types]) Retrieves the links contained within an element and all of its child elements. This method returns a reference to an array in which each element is a reference to an array with two values: the value of the link and a reference to the element in which it was found. You may specify the tags from which you want to extract links by providing their names in a list of
$h->implicit([boolean]) Indicates whether the element was contained in the original document (false) or whether it was assumed to be implicit (true) by the parser. Implicit tags are elements that the parser included to conform to proper HTML structure, such as an ending paragraph tag (
$h->insert_element($element, implicit) Inserts the object
$h->is_empty( ) Returns true if the current object has no content.
$h->is_inside(tag1 [,tag2, ...]) Returns true if the tag for this element is contained inside one of the tags listed as arguments.
$h->parent([$new]) Without an argument, returns the parent object for this element. If given a reference to another element object, this element is set as the new parent object and is returned.
$h->pos([$element]) Sets or retrieves the current position in the syntax tree of the current object. The returned value is a reference to the element object that holds the current position. The "position" object is an element contained within the tree that has the current object (
$h->push_content(content) Inserts the specified content into the current element.
$h->starttag( ) Returns the original text of the start tag for the element. This includes the
$h->tag([name]) Sets or retrieves the tag
$h->traverse(sub, [ignoretext]) Traverses the current element and all of its children, invoking the callback routine HTML::TreeBuilderThe HTML::TreeBuilder class provides a parser that creates an HTML syntax tree. Each node of the tree is an HTML::Element object. This class inherits both HTML::Parser and HTML::Elements, so methods from both of those classes can be used on its objects. The methods provided by HTML::TreeBuilder control how the parsing is performed. Values for these methods are set by providing a Boolean value for their arguments.
$p->ignore_text(boolean) If set to true, text content of elements will not be included in elements of the parse tree. The default is false.
$p->ignore_unknown(boolean) If set to true, unknown tags in the HTML will be represented as elements in the parse tree.
$p->implicit_tags(boolean) If set to true, the parser will try to deduce implicit tags such as missing elements or end tags that are required to conform to proper HTML structure. If false, the parse tree will reflect the HTML as is.
$p->warn(boolean) If set to true, the parser will make calls to HTML::FormatPSThe HTML::FormatPS module converts an HTML parse tree into PostScript. The formatter object is created with the
You can now give parsed HTML to the formatter and produce PostScript output for printing. HTML::FormatPS does not handle table or form elements at this time. The method for this class is use HTML::FormatPS; $html = HTML::TreeBuilder->parse_file(somefile); $formatter = HTML::FormatPS->new( ); print $formatter->format($html); The following list describes the attributes that can be set in the constructor:
HTML::FormatTextThe HTML::FormatText module takes a parsed HTML file and outputs a plain-text version of it. None of the character attributes will be usable, i.e., bold or italic fonts, font sizes, etc. This module is similar to FormatPS in that the constructor takes attributes for formatting, and the $formatter = HTML::FormatText->new(leftmargin => 10, rightmargin => 80); The constructor can take two parameters: The print $formatter->format($html); |