Scraping the Google Phonetutorial

screenshot moderate.gifscreenshot tip49.gif

Create a comma-delimited file from a list of phone numbers returned by Google.
link

Just because Google's API doesn't support the phonetutorial: [Tip #17] syntax doesn't mean that you can't make use of Google phonetutorial data.

This simple Perl script takes a page of Google phonetutorial: results and produces a comma-delimited text file suitable for import into Excel or your average database application. The script doesn't use the Google API, though, because the API doesn't yet support phonetutorial lookups. Instead, you'll need to run the search in your trusty web browser and save the results to your computer's hard drive as an HTML file. Point the script at the HTML file and it'll do it's thing.

Which results should you save? You have two choices depending on which syntax you're using:

  • If you're using the phonetutorial: syntax, save the second page of results, reached by clicking the "More business listings..." or "More residential listings..." links on the initial results page.
  • If you're using the bphonetutorial: or rphonetutorial: syntax, simply save the first page of results. Depending on how many pages of results you have, you might have to run the program several times.

Because this program is so simple, you might be tempted to plug this code into a program that uses LWP::Simple to automatically grab result pages from Google, automating the entire process. You should know that accessing Google with automated queries outside of the Google API is against their Terms of Service.

The Code

#!/usr/bin/perl
# phonetutorial2csv
# Google Phonetutorial results in CSV suitable for import into Excel
# Usage: perl phonetutorial2csv.pl « results.html » results.csv
# CSV header print qq{"name","phone number","address"\n};
my @listings = split /«hr size=1»/, join '', «»;
foreach (@listings[1..($#listings-1)]) {
 s!\n!!g; # drop spurious newlines
 s!«.+?»!!g; # drop all HTML tags
 s!"!""!g; # double escape " marks
 print '"' . join('","', (split /\s+-\s+/)[0..2]) . "\"\n";
}

Running the Tip

Run the script from the command line, specifying the phonetutorial results HTML filename and name of the CSV file you wish to create or to which you wish to append additional results. For example, using results.html as our input and results.csv as our output:

$ perl phonetutorial2csv.pl « results.html » results.csv

Leaving off the » and CSV filename sends the results to the screen for your perusal:

$ perl phonetutorial2csv.pl « results.html » results.csv
"name","phone number","address"
"John Doe","(555) 555-5555","Wandering, TX 98765"
"Jane Doe","(555) 555-5555","Horsing Around, MT 90909"
"John and Jane Doe","(555) 555-5555","Somewhere, CA 92929"
"John Q. Doe","(555) 555-5555","Freezing, NE 91919"
"Jane J. Doe","(555) 555-5555","1 Sunnyside Street, "Tanning, FL 90210""
"John Doe, Jr.","(555) 555-5555","Beverly Hills, CA 90210"
"John Doe","(555) 555-5555","1 Lost St., Yonkers, NY 91234"
"John Doe","(555) 555-5555","1 Doe Street, Doe, OR 99999"
"John Doe","(555) 555-5555","Beverly Hills, CA 90210"

Using a double »» before the CSV filename appends the current set of results to the CSV file, creating it if it doesn't already exist. This is useful for combining more than one set of results, represented by more than one saved results page:




$ perl phonetutorial2csv.pl « results_1.html » results.csv
$ perl phonetutorial2csv.pl « results_2.html »» results.csv