XooMLe: The Google API in Plain Old XML

advanced tipscreenshot tip36.gif

Getting Google results in XML using the XooMLe wrapper.

link

When Google released their Web APIs in April 2002, everyone agreed that it was fantastic, but some thought it could have been better. Google's API was to be driven by Simple Object Access Protocol (SOAP), which wasn't exactly what everyone was hoping for.

What's wrong with SOAP? Google made the biggest, best search engine in the world available as a true web service, so it must be a good thing, right? Sure, but a lot of people argued that by using SOAP, they had made it unnecessarily difficult to access Google's service. They argued that using simple HTTP-based technologies would have provided everything they needed, while also making it a much simpler service to use.

The irony of this was not lost on everyone - Google, being so well-known and widely used, in part because of its simplicity, was now being slammed for making their service difficult to access for developers.

The argument was out there: SOAP was bad, Google needed a REST! Representational State Transfer (REST) is a model for web services that makes use of existing protocols and technologies, such as HTTP GET requests, URIs, and XML to provide a transaction-based approach to web services. The argument was that REST provided a much simpler means of achieving the same results, given Google's limited array of functionality.

REST proponents claimed that Google should have made their API available through the simpler approach of requesting a defined URI, including query string-based parameters such as the search term and the output encoding. The response would then be a simple XML document that included results or an error of some sort.

After playing with the Google API, I had enough exposure to at least know my way around the WSDL and other bits and pieces involved in working with Google. I read a lot of the suggestions and proposals for how Google "should have done it" and set about actually doing it. The result was XooMLe (http://www.dentedreality.com.au/xoomle/).

The first step was to create a solid architecture for accessing the Google API. I was working with the popular and very powerful scripting language, PHP, so this was made very simple by grabbing a copy of Dietrich Ayala's SOAP access class called
NuSOAP. Once I had that in place, it was a simple process of writing a few functions and bits and pieces to call the SOAP class, query Google, then reformat the response to something a bit "lighter."

I chose to implement a system that would accept a request for a single URL (because at this stage I wasn't too familiar with the RESTful way of doing things) containing a number of parameters, depending on which method was being called from Google. The information returned would depend on the type of request, as outlined here:

Google method

Return type


doGoogleSearch

XML document containing structured information about the results and the actual search process


doGoogleSpellingSuggestion

Plain text response containing suggested spelling correction


doGetCachedPage 

HTML source for the page requested

All the methods would also optionally return a standardized, XML-encoded error message if something went wrong, which would allow developers to easily determine if their requests were successful.

Providing this interface required only a small amount of processing before returning the information back to the user. In the case of a call to doGoogleSearch, the results were just mapped across to the XML template then returned, doSpellingSuggestion just had to pull out the suggestion and send that back, while doGetCachedPage had to decode the result (from base-64 encoding) then strip off the first 5 lines of HTML, which contained a Google header. This allowed XooMLe to return just what the user requested; a clean, cached copy of a page, a simple spelling suggestion, or a set of results matching a search term. Searching was XooMLe's first hurdle - returning SOAP-encoded results from Google in clean, custom XML tags, minus the "fluff."

I chose to use an XML template rather than hardcoding the structure directly into my code. The template holds the basic structure of a result set returned from Google. It includes things like the amount of time the search took, the title of each result, their URLs, plus other information that Google tracks. This XML template is based directly on the structure outlined in the WSDL and obviously on the actual information returned from Google. It is parsed, and then sections of it are duplicated and modified as required, so that a clean XML document is populated with the results, then sent to the user. If something goes wrong, an error message is encoded in a different XML template and sent back instead.

Once searching was operational, spelling suggestions were quickly added, simply removing the suggestion from its SOAP envelope and returning it as plain text. Moving on to the cached pages proved to require a small amount of manipulation, where the returned information had to be converted back to a plain string (originally a base-64 encoded string from Google) and then the Google header, which is automatically added to the pages in their cache, had to be removed. Once that was complete, the page was streamed back to the user, so that if she printed the results of the request directly to the screen, a cached copy of the web page would be displayed directly.

After posting the results of this burst of development to the DentedReality web site, nothing much happened. No one knew about XooMLe, so no one used it. I happened to be reading Dave Winer's Scripting News, so I fired off an email to him about XooMLe, just suggesting that he might be interested in it. Five minutes later (literally) there was a link to it on Scripting News describing it as a "REST-style interface," and within 12 hours, I had received approximately 700 hits to the site! It didn't stop there; the next morning when I checked my email, I had a message from Paul Prescod with some suggestions for making it more RESTful and improving the general functionality of it as a service.

After exchanging a few emails directly with Prescod, plus receiving a few other suggestions and comments from people on the REST-discuss Yahoo! Group (which I quickly became a member of), I went ahead and made a major revision to XooMLe. This version introduced a number of changes:

  • Moved away from a single URI for all methods, introducing /search/, /cache/ and /spell/ so that there was a unique URI for each method.
  • Google's limit of 10 results per query was bypassed by making XooMLe loop through search requests, compiling the results, and sending them back in a single XML document.
  • Added a cachedVersion element to each result item, which contained a link to retrieve a cached copy of a document via XooMLe.
  • If related information was available via Google, an additional link was supplied that would retrieve those pages.
  • Added an XLink to the URL, relatedInformation and cachedVersion elements of each returned result, which could be used to automatically create a link via XML technologies.
  • Added the ability to specify an XSLT when performing a search, making it simple to use pure XML technologies to format the output in a human-readable form.

And thus a RESTful web service was born. XooMLe implements the full functionality of the Google API (and actually extends it in a few places), using a much simpler interface and output format. A XooMLe result set can be tutorialmarked, a spelling suggestion can be very easily obtained via a tutorialmarklet, results can be parsed in pretty much every coding language using simple, native functions, and cached pages are immediately usable upon retrieval.

XooMLe demonstrates that it was indeed quite feasible for Google to implement their API using the REST architecture, and provides a useful wrapper to the SOAP functionality they have chosen to expose. It is currently being used as an example of "REST done right" by a number of proponents of the model, including some Amazon/Google linked services being developed by one of the REST-discuss members.

On its own, XooMLe may not be particularly useful, but teamed with the imagination and coding prowess of the web services community, it will no doubt help create a new wave of toys, tools, and talking points.

How It Works

Basically, to use XooMLe you just need to "request" a web page, then do something with the result that gets sent back to you. Some people might call this a request-response architecture, whatever - you ask XooMLe for something, it asks Google for the same thing, then formats the results in a certain format and gives it to you, from there on, you can do what you like with it. Long story short - everything you can ask the Google SOAP API, you can ask XooMLe.

Google Method: doGoogleSearch
Extra Features
  • maxResults: Also supports setting maxResults well above the Google-limit of 10, performing looped queries to gather the results and then sending them all back in one hit.
  • cachedVersion: Each result item will include an element called "cachedVersion" which is a URI to retrieve the cached version of that particular result.
  • xsl: You may specify a variable called "xsl" in addition to the others in the querystring. The value of this variable will be used as a reference to an external XSLT Stylesheet, and will be used to format the XML output of the document.
  • relatedInformation: If it is detected that there is related information available for a particular resultElement, then this element will contain a link to retrieve those related items from XooMLe.
  • xlink: There is now an xlink attribute added to the cachedVersion and relatedInformation elements of each result.
Google Method: doSpellingSuggestion
  • XooMLe
    URI: http://xoomle.dentedreality.com.au/spell/
  • Successful Response Format: Returns a text-only response containing the suggested correction for the phrase that you passed to Google (through XooMLe). You will get HTTP headers and stuff like that as well, but assuming you are accessing XooMLe over HTTP in the first place, the body of the response is just text.
  • Failure Response: An XML-based error message, including all arguments you sent to XooMLe. The message will change, depending on what went wrong.
Google Method: doGetCachedPage
  • XooMLe

    URI: http://xoomle.dentedreality.com.au/cache/
  • Successful Response Format: Returns the complete contents of the cached page requested, WITHOUT THE GOOGLE INFORMATION- HEADER. The header that Google adds, which says it's a cached page, is stripped out BEFORE you are given the page, so don't expect it to be there. You should get nothing but the HTML required to render the page.
  • Failure Response: An XML-based error message, including all arguments you sent to XooMLe.
Asking XooMLe Something (Forming Requests)

Asking XooMLe something is really easy; you can do it in a normal hyperlink, a tutorialmark, a Favorite, whatever. A request to XooMLe exists as a URL, which contains some special information. It looks something like this:

http://xoomle.dentedreality.com.au/search/?key=YourGoogleDeveloperKey&q=dented+reality

Enough generic examples! If you are talking to XooMLe, the address you need is:

http://xoomle.dentedreality.com.au/«method keyword»/

Your requests might look something like the previous example, or they might be fully fleshed out like the following:

http://xoomle.dentedreality.com.au/search/
?key=YourKey
&q=dented+realty
&maxResults=1
&start=0
&hl=en
&ie=ISO-8859-1
&filter=0
&restrict=countryAU
&safeSearch=1
&lr=en
&ie=latin1
&oe=latin1
&xsl=myxsl.xsl

Note that each option is on a different line so they're easier to read; properly formatted they would be in one long string.

All the available parameters are defined in the Google documentation, but just to refresh your memory:

key means your Google Developer Key, go get one if you don't have one already (and remember to URL-encode it when passing it in the query string as well!).

Another thing you might like to know is that XooMLe makes use of some fancy looping to allow you to request more than the allowed 10 results in one request. If you ask XooMLe to get (for example) 300 results, it will perform multiple queries to Google and send back 300 results to you in XML format. Keep in mind that this still uses up your request limit (1,000 queries per day) in blocks of 10 though, so in this case, you'd drop 30 queries in one hit (and it would take a while to return that many results).

Error Messages

If you do something wrong, XooMLe will tell you in a nice little XML package. The errors all look something like this, but they have a different error message and contain that "arguments" array, which includes everything you asked it. Below are all the error messages that you can get, and why you will get them.

    Google API key not supplied
  • You forgot to provide XooMLe with your Google API key. You need this so that XooMLe can communicate with Google on your behalf. Specify it like this: key=insert key here and get one from Google if you don't have one already.
    Search string not specified
  • You were smart enough to specify that you wanted to do a search (using method=doGoogleSearch) but forgot to tell XooMLe what you were searching for. Fix this by adding something like q=Your+search+terms (your search phrase should be URL-encoded and is subject to the same limitations as the Google API).
    Invalid Google API key supplied
  • There's something wrong with your Google API key (did you URL-encode it like I told you?).
    Your search found no results
  • This one should be rather obvious.
    Phrase not specified
  • If you ask for a spelling suggestion (using method=doSpellingSuggestion), you should also tell XooMLe what you are trying to correct, using phrase=stoopid+speling+here. (URL-encode it.)
    No suggestion available
  • Hey, Google ain't perfect. Sometimes the attempts at spelling just don't even warrant a response (or possibly Google can't decipher your bad spelling).
    URL not specified
  • You want a cached page from Google? The least you could do is ask for it using url=http://thepagehere.com.
    Cached page not available
  • Something was wrong with the cached page that Google returned (or it couldn't find it in the database). Not all Google listings have cached pages available.
    Couldn't contact Google server
  • There was a problem contacting the Google server, so your request could not be processed.

    Putting XooMLe to Work: A SOAP::Lite Substitution Module


    XooMLe is not only a handy way to get Google results in XML, it's a handy way to replace the required SOAP::Lite module that a lot of ISPs don't support. XooMLe.pm is a little Perl module best saved into the same directory as your tips themselves.

    # XooMLe.pm 
    # XooMLe is a drop-in replacement for SOAP::Lite designed to use 
    # the plain old XML to Google SOAP bridge provided by the XooMLe 
    # service. 
    package XooMLe;
    use strict;
    use LWP::Simple; use XML::Simple;
    sub new { 
     my $self = {}; 
     bless($self); 
     return $self; 
    }
    sub doGoogleSearch {
     my($self, %args); ($self, @args{qw/ key q start maxResults 
     filter restrict safeSearch lr ie oe /}) = @_;
     my $xoomle_url = 'http://xoomle.dentedreality.com.au'; 
     my $xoomle_service = 'search';
     # Query Google via XooMLe 
     my $content = get( 
     "$xoomle_url/$xoomle_service/?" . 
     join '&', map { "$_=$args{$_}" } keys %args 
     );
     # Parse the XML my $results = XMLin($content);
     # Normalize 
     $results-»{GoogleSearchResult}-»{resultElements} =
     $results-»{GoogleSearchResult}-»{resultElements}-»{item}; 
     foreach (@{$results-»{GoogleSearchResult}-»{'resultElements'}}) { 
     $_-»{URL} = $_-»{URL}-»{content}; 
     ref $_-»{snippet} eq 'HASH' and $_-»{snippet} = ''; 
     ref $_-»{title} eq 'HASH' and $_-»{title} = ''; 
     }
     return $results-»{GoogleSearchResult}; 
    }
    1;
    

    Using the XooMLe Module

    Here's a little script to show our home-brewed XooMLe module in action. Its no different, really, from any number of tips in this tutorial. The only minor alterations necessary to make use of XooMLe instead of SOAP::Lite are highlighted in bold.

    #!/usr/bin/perl
    # xoomle_google2csv.pl 
    # Google Web Search Results via XooMLe 3rd party web service 
    # exported to CSV suitable for import into Excel 
    # Usage: xoomle_google2csv.pl "{query}" [» results.csv]
    # Your Google API developer's key 
    my $google_key = 'insert key here';
    use strict;
    # Uses our home-brewed XooMLe Perl module 
    # use SOAP::Lite 
    use XooMLe;
    $ARGV[0] or die qq{usage: perl xoomle_search2csv.pl "{query}"\n};
    # Create a new XooMLe object rather than using SOAP::Lite 
    # my $google_search = SOAP::Lite-»service("file:$google_wdsl"); 
    my $google_search = new XooMLe;
    my $results = $google_search -» doGoogleSearch( 
     $google_key, shift @ARGV, 0, 10, "false", "", 
     "false", "", "latin1", "latin1" 
    );
    @{$results-»{'resultElements'}} or warn 'No results';
    print qq{"title","url","snippet"\n};
    foreach (@{$results-»{'resultElements'}}) { 
     $_-»{title} =~ s!"!""!g; 
     # double escape " marks 
     $_-»{snippet} =~ s!"!""!g; 
     my $output = qq{"$_-»{title}","$_-»{URL}","$_-»{snippet}"\n}; 
     # drop all HTML tags 
     $output =~ s!«.+?»!!g; 
     print $output; 
    } 
    

    Running the Tip

    Run the script from the command line, providing a query and sending the output to a CSV file you wish to create or to which you wish to append additional results. For example, using "restful SOAP" as our query and results.csv as our output:

    $ perl xoomle_google2csv.pl "restful SOAP" » results.csv
    

    Leaving off the » and CSV filename sends the results to the screen for your perusal.

    Applicability

    In the same manner, you can adapt just about any SOAP::Lite-based tip in this tutorial and those you've made up yourself to use the XooMLe module.

    1. Place XooMLe.pm in the same directory as the tip at hand.
    2. Replace use SOAP::Lite; with use XooMLe;.
    3. Replace my $google_search = SOAP::Lite-»service("file:$google_wdsl"); with my $google_search = new XooMLe;.

    In general, bear in mind that your mileage may vary and don't be afraid to tweak.

    See Also

    • PoXML [Tip #53], a plain old XML alternative to SOAP::Lite
    • NoXML [Tip #54], a regular expressions-based, XML Parser-free SOAP::Lite alternative

    - Beau Lebens and Rael Dornfest

    link