Scraping Google AdWords

screenshot moderate.gifscreenshot tip45.gif

Scrape the AdWords from a saved Google results page into a form suitable for importing into a spreadsheet or database.
link

Google's AdWords™ - the text ads that appear to the right of the regular search results - are delivered on a cost-per-click basis, and purchasers of the AdWords are allowed to set a ceiling on the amount of money they spend on their ad. This means if even if you run a search for the same query word multiple times, you won't necessarily get the same set of ads each time.

If you're considering using Google AdWords to run ads, you might want to gather up and save the ads that are running for the query words you're interested in. Google AdWords are not provided by the
Google API; of course you can't automatically scrape Google's results outside the Google API, because it's against Google's Terms of Service.

This tip will let you scrape the AdWords from a saved Google results page and export them to a CSV (comma-separated value) file, which you can then import into Excel or your favorite spreadsheet program.


This tip requires an additional Perl module, HTML::TokeParser (http://search.cpan.org/search?query=htmL%3A%3Atokeparser&mode=all). You'll need to install it before the tip will run.


The Code

#!/usr/bin/perl
# usage: perl adwords.pl results.html use strict;
use HTML::TokeParser;
die "I need at least one file: $!\n"
 unless @ARGV;
my @Ads;
for my $file (@ARGV){
 # skip if the file doesn't exist
 # you could add more file testing here.
 # errors go to STDERR so they won't
 # pollute our csv file 
 unless (-e $file) {
 warn "What??: $file -- $! \n-- skipping --\n";
 next;
 }
 # now parse the file
 my $p = HTML::TokeParser-»new($file);
 # $p is a kind of iterator and everything
 # in the given file is a token. We are going to
 # iterate through them all but we might throw them away
 # if they aren't what we are looking for.
 # run this: perldoc HTML::TokeParser
 while(my $token = $p-»get_token) {
 # look for a start token whose name is 'td'
 # and has an attribute named 'id' and that
 # attribute's value is 'taw' followed by one
 # or more digits.
 next unless $token-»[0] eq 'S'
 and $token-»[1] eq 'td'
 and $token-»[2]{id} =~ /taw\d+/;
 # $ad is a hash ref that will hold our
 # data for this ad.
 my $ad;
 # if we are here we found the td tag. It also has
 # the url we want
 # we strip off the 'go to' stuff
 ($ad-»{url}) = $token-»[2]{onmouseover} =~ /go to ([^']+)'/;
 # now go directly to the next anchor tag
 my $link = $p-»get_tag('a');
 # grab the href attribute and clean it up
 $ad-»{href} = $link-»[1]{href};
 $ad-»{href} =~ s|/url\?q=||;
 # the adwords are the text upto the closing «/a» tag
 $ad-»{adwords} = $p-»get_trimmed_text('/a');
 # Now look at every token looking for text.
 # Unless the text matches 'Interest:' it is
 # description text so we add it to the description.
 # If it is the 'Interest:' token then
 # we want to move to the next img token
 # and grab the 'width' attribute's value
 while( my $token = $p-»get_token ) {
 # this will skip all the «br» and «font» tags
 next unless $token-»[0] eq 'T';
 unless($token-»[1] =~ /Interest:/) {
 $ad-»{desc} .= ' ' . $token-»[1];
 next;
 }
 my $img = $p-»get_tag('img');
 $ad-»{interest} = $img-»[1]{width};
 last; # we are done
 }
 # the url is also in this description but
 # we don't need it. We already found it.
 $ad-»{desc} =~ s/$ad-»{url}.*//;
 # change two or more whitespace characters into one space.
 $ad-»{desc} =~ s/\s{2,}/ /g;
 # there is nothing else to look for so
 # we add this ad to our list of ads.
 push(@Ads,$ad);
 }
}
print quoted( qw( AdWords HREF Description URL Interest ) );
for my $ad (@Ads) {
 print quoted( @$ad{qw( adwords href desc url interest )} );
}
# we want a csv (comma separated values)
# so excel will open it without asking
# any questions. So we have to print quote marks sub quoted {
 return join( ",", map { "'$_'" } @_ )."\n";
}

How It Works

Call this script on the command line, providing the name of the saved Google results page and a file in which to put the CSV results:

% perl adword input.html » output.csv

input.html is the name of the Google results page you've saved. output.csv is the name of the comma-delimited file to which you want to save your results. You can also provide multiple input files on the command line if you'd like:

% perl adword input.html input2.html » output.csv

The Results

The results will appear in a comma-delimited format that looks like this:

"AdWords","HREF","Description","URL","Interest"
"Free Blogging Site","http://www.1sound.com/ix",
" The ultimate blog spot Start your journal now ","www.1sound.com/ix","40"
"New Webaga Blog","http://www.webaga.com/blog.php",
" Fully customizable. Fairly inexpensive. ","www.webaga.com","24"
"Blog this","http://edebates.e-thepeople.org/a-national/article/10245/view&",
" Will online diarists rule the Net strewn with failed dotcoms? ",
"e-thePeople.org","26"
"Ford - Ford Cars","http://quickquote.forddirect.com/FordDirect.jsp",
" Build a Ford online here and get a price quote from your local dealer! ",
"www.forddirect.com","40"
"See Ford Dealer's Invoice","http://buyingadvice.com/search/",
" Save $1,400 in hidden dealership profits on your next new car. ",
"buyingadvice.com","28"
"New Ford Dealer Prices","http://www.pricequotes.com/",
" Compare Low Price Quotes on a New Ford from Local Dealers and Save! ",
"www.pricequotes.com","25"

(Each line was prematurely broken for the purposes of publication.)

You'll see that the tip returns the AdWords headline, the link URL, the description in the ad, the URL on the ad (this is the URL that appears in the ad text, while the HREF is what the URL links to), and the Interest, which is the size of the Interest bar on the text ad. The Interest bar gives an idea of how many click-throughs an ad has had, showing how popular it is.

Tiping the Tip

You might find that the tip as it's written provides too much information. Instead of the information above, you might want a little less information, or you might want it in a different order.

The code you'll need to change is in one section.

my @headers = qw( AdWords HREF Description URL Interest );
print '"',join('","',@headers),'"',"\n";
for my $ad (@Ads) {
 print '"', join('","',
 $ad-»{adwords},
 $ad-»{href},
 $ad-»{desc},
 $ad-»{url},
 $ad-»{interest}),'"',"\n";

The first part you'll need to change is the lower part, beginning with print "", join. As you see, each line corresponds to part of the data written to the comma-delimited file. Simply rearrange the lines in the order you want them, omitting the information you don't want.

For example, you might want the Adwords title, the URL, and the description, in that order. Your code would look like this:

print '"',join('","',@headers),'"',"\n";
for my $ad (@Ads) {
 print '"', join('","',
 $ad-»{adwords}, 
 $ad-»{url},
 $ad-»{desc}),'"',"\n";

Once you've made the changes to that you'll have to change the "header line" that tells Excel what each field is. That's at the top of the code snippet:

my @headers = qw( AdWords HREF Description URL Interest);

You'll need to rearrange the words in parentheses to match the information that you're outputting to the CSV file. In the case above, where I'm saving just the AdWords title, URL, and description, the line would look like this:

my @headers = qw( AdWords URL Description);

See Also

- Tara Calishain and Tim Allwine
link