Plain Old XML, a SOAP::Lite Alternative

screenshot moderate.gifscreenshot tip53.gif

PoXML is a drop-in replacement, of sorts, for the SOAP::Lite-less.


PoXML is a bit of home-brewed tipery for those who don't have the SOAP::Lite [Tip #52] Perl module at their disposal. Perhaps you had more than enough trouble installing it yourself.

Any Perl guru will insist that module installation is as simple as can be. That said, any other Perl guru will be forced to admit that it's an inconsistent experience and often harder than it should be.

PoXML is a drop-in replacement - to a rather decent degree - for SOAP::Lite. It treats Google's SOAP as plain old XML, using the LWP::UserAgent module to make HTTP requests and XML::Simple to parse the XML response. And best of all, it requires little more than a two-line alteration to the target tip.

The Code

The heart of this tip is PoXML.pm, a little Perl module best saved into the same directory as your tips.

# PoXML.pm
# PoXML [pronounced "plain old xml"] is a dire-need drop-in 
# replacement for SOAP::Lite designed for Google Web API tiping.
package PoXML;
use strict;
no strict "refs";
# LWP for making HTTP requests, XML for parsing Google SOAP use LWP::UserAgent;
use XML::Simple;
# Create a new PoXML
 sub new {
 my $self = {};
 bless($self);
 return $self;
}
# Replacement for the SOAP::Lite-based doGoogleSearch method sub doGoogleSearch {
 my($self, %args);
 ($self, @args{qw/ key q start maxResults filter restrict 
 safeSearch lr ie oe /}) = @_;
 # grab SOAP request from __DATA_ _
 my $tell = tell(DATA);
 my $soap_request = join '', ; 
 seek(DATA, $tell, 0);
 $soap_request =~ s/\$(\w+)/$args{$1}/ge; #interpolate variables
 # Make (POST) a SOAP-based request to Google
 my $ua = LWP::UserAgent-»new;
 my $req = HTTP::Request-»new(
 POST =» 'http://api.google.com/search/beta2');
 $req-»content_type('text/xml');
 $req-»content($soap_request);
 my $res = $ua-»request($req);
 my $soap_response = $res-»as_string;
 # Drop the HTTP headers and so forth until the initial xml element
 $soap_response =~ s/^.+?(«\?xml)/$1/migs;
 # Drop element namespaces for tolerance of future prefix changes
 $soap_response =~ s!(«\/?)[\w-]+?:([\w-]+?)!$1$2!g;
 # Parse the XML
 my $results = XMLin($soap_response);
 # Normalize and drop the unnecessary encoding bits
 my $return = $results-»{'Body'}-»{'doGoogleSearchResponse'}-»{return};
 foreach ( keys %{$return} ) {
 $return-»{$_}-»{content} and 
 $return-»{$_} = $return-»{$_}-»{content} || '';
 }
 my @items;
 foreach my $item ( @{$return-»{resultElements}-»{item}} ) {
 foreach my $key ( keys %$item ) {
 $item-»{$key} = $item-»{$key}-»{content} || '';
 }
 push @items, $item;
 }
 $return-»{resultElements} = \@items;
 my @categories;
 foreach my $key ( keys %{$return-»{directoryCategories}-»{item}} ) {
 $return-»{directoryCategories}-»{$key} = 
 $return-»{directoryCategories}-»{item}-»{$key}-»{content} || '';
 }
 # Return nice, clean, usable results
 return $return;
}
1;
# This is the SOAP message template sent to api.google.com. Variables
# signified with $variablename are replaced by the values of their 
# counterparts sent to the doGoogleSearch subroutine.
__DATA_ _
«?xml version='1.0' encoding='UTF-8'?»
«SOAP-ENV:Envelope 
 xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" 
 xmlns:xsi="http://www.w3.org/1999/xmlschema-instance" 
 xmlns:xsd="http://www.w3.org/1999/xmlschema"»
 «SOAP-ENV:Body»
 «ns1:doGoogleSearch xmlns:ns1="urn:GoogleSearch" 
 SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"»
 «key xsi:type="xsd:string"»$key«/key»
 «q xsi:type="xsd:string"»$q«/q»
 «start xsi:type="xsd:int"»$start«/start»
 «maxResults xsi:type="xsd:int"»$maxResults«/maxResults»
 «filter xsi:type="xsd:boolean"»$filter«/filter»
 «restrict xsi:type="xsd:string"»$restrict«/restrict»
 «safeSearch xsi:type="xsd:boolean"»$safeSearch«/safeSearch»
 «lr xsi:type="xsd:string"»$lr«/lr»
 «ie xsi:type="xsd:string"»$ie«/ie»
 «oe xsi:type="xsd:string"»$oe«/oe»
 «/ns1:doGoogleSearch»
 «/SOAP-ENV:Body»
«/SOAP-ENV:Envelope»

Here's a little script to show PoXML in action. Its no different, really, from any number of tips in this tutorial. The only minor alterations necessary to make use of PoXML instead of SOAP::Lite are highlighted in bold.

#!/usr/bin/perl
# poxml_google2csv.pl
# Google Web Search Results via PoXML ("plain old xml") module
# exported to CSV suitable for import into Excel
# Usage: poxml_google2csv.pl "{query}" [» results.csv]
# Your Google API developer's key my $google_key = 'insert key here';
use strict;
# use SOAP::Lite;
use PoXML;
$ARGV[0]
 or die qq{usage: perl noxml_search2csv.pl "{query}"\n};
# my $google_search = SOAP::Lite-»service("file:$google_wdsl");
my $google_search = new PoXML;
my $results = $google_search -» 
 doGoogleSearch(
 $google_key, shift @ARGV, 0, 10, "false", 
 "", "false", "", "latin1", "latin1"
 );
@{$results-»{'resultElements'}} or die('No results');
print qq{"title","url","snippet"\n};
foreach (@{$results-»{'resultElements'}}) {
 $_-»{title} =~ s!"!""!g; # double escape " marks
 $_-»{snippet} =~ s!"!""!g;
 my $output = qq{"$_-»{title}","$_-»{URL}","$_-»{snippet}"\n};
 $output =~ s!«.+?»!!g; # drop all HTML tags
 print $output;
} 

Running the Tip

Run the script from the command line, providing a query on the command line and piping the output to a CSV file you wish to create or to which you wish to append additional results. For example, using "plain old xml" as our query and results.csv as our output:

$ perl poxml_google2csv.pl "plain old xml" » results.csv

Leaving off the » and CSV filename sends the results to the screen for your perusal.

The Results

% perl poxml_google2csv.pl "plain old xml"
"title","url","snippet"
"XML.com: Distributed XML [Sep. 06, 2000]",
"http://www.xml.com/pub/2000/09/06/distributed.html",
" ... extensible. Unlike plain old XML, there's no sense of 
constraining what the document can describe by a DTD or schema. 
This means ... "
...
"Plain Old Documentation",
"http://axkit.org/wiki/view/AxKit/PlainOldDocumentation",
" ... perlpodspec - Plain Old Documentation: format specification and notes. ... Examples: =pod This is a plain Pod paragraph. ... 
encodings in Pod parsing would be as in XML ... "

Applicability and Limitations

In the same manner, you can adapt just about any SOAP::Lite-based tip in this tutorial and those you've made up yourself to use PoXML.

  1. Place PoXML.pm in the same directory as the tip at hand.
  2. Replace use SOAP::Lite; with use PoXML;.
  3. Replace my $google_search = SOAP::Lite-»service("file:$google_wdsl"); with my $google_search = new PoXML;.

There are, however, some limitations. While PoXML works nicely to extract results and aggregate results the likes of «estimatedTotalResultsCount /», it falls down on gleaning some of the more advanced result elements like «directoryCategories /», an array of categories turned up by the query.

In general, bear in mind that your mileage may vary, and don't be afraid to tweak.

See Also

  • NoXML [Tip #54], a regular expressions-based, XML Parser-free SOAP::Lite alternative
  • XooMLE [Tip #36], a third-party service offering an intermediary plain old XML interface to the Google Web API