SafeSearch Certifying URLs

advanced tipscreenshot tip81.gif

Feed URLs to Google's SafeSearch to determine whether or not they point at questionable content.
link

Only three things in life are certain: death, taxes, and accidentally visiting a once family-safe web site that now contains text and images that would make a horse blush.

As you probably know if you've ever put up a web site, domain names are registered for finite lengths of time. Sometimes registrations accidentally expire; sometimes businesses fold and allow the registrations to expire. Sometimes other companies take them over.

Other companies might just want the domain name, some companies want the traffic that the defunct site generated, and in a few cases, the new owners of the domain name try to hold it "hostage," offering to sell it back to the original owners for a great deal of money. (This doesn't work as well as it used to because of the dearth of Internet companies that actually have a great deal of money.)

When a site isn't what it once was, that's no big deal. When it's not what it once was and is now rated X, that's a bigger deal. When it's not what it once was, is now rated X, and is on the link list of a site you run, that's a really big deal.

But how to keep up with all the links? You can go visit every link periodically and see if it's still okay, or you can wait for the hysterical emails from site visitors, or you can just not worry about it. Or you can put the Google API to work.

This program lets you give provide a list of URLs and check them in Google's SafeSearch Mode. If they appear in the SafeSearch mode, they're probably okay. If they don't appear, they're either not in Google's index or not good enough to pass Google's filter. The program then checks the URLs missing from a SafeSearch with a nonfiltered search. If they do not appear in a nonfiltered search, they're labeled as unindexed. If they do appear in a nonfiltered search, they're labeled as "suspect."

Danger Will Robinson

While Google's SafeSearch filter is good, it's not infallible. (I have yet to see an automated filtering system that is infallible.) So if you run a list of URLs through this tip and they all show up in a SafeSearch query, don't take that as a guarantee that they're all completely inoffensive. Take it merely as a pretty good indication that they are. If you want absolute assurance, you're going to have to visit every link personally and often.


Here's a fun idea if you need an Internet-related research project. Take 500 or so domain names at random and run this program on the list once a week for several months, saving the results to a file each time. It'd be interesting to see how many domains/URLs end up being filtered out of SafeSearch over time.


The Code

#!/usr/local/bin/perl
# suspect.pl
# Feed URLs to a Google SafeSearch. If inurl: returns results, the
# URL probably isn't questionable content. If inurl: returns no 
# results, either it points at questionable content or isn't in
# the Google index at all. 
# Your Google API developer's key my $google_key = 'put your key here';
# Location of the GoogleSearch WSDL file my $google_wdsl = "./GoogleSearch.wsdl";
use strict;
use SOAP::Lite;
$|++; # turn off buffering 
my $google_search = SOAP::Lite-»service("file:$google_wdsl");
# CSV header print qq{"url","safe/suspect/unindexed","title"\n};
while (my $url = «») {
 chomp $url;
 $url =~ s!^\w+?://!!;
 $url =~ s!^www\.!!;
 # SafeSearch
 my $results = $google_search -» 
 doGoogleSearch(
 $google_key, "inurl:$url", 0, 10, "false", "", "true",
 "", "latin1", "latin1"
 );
 print qq{"$url",};
 if (grep /$url/, map { $_-»{URL} } @{$results-»{resultElements}}) {
 print qq{"safe"\n};
 } 
 else {
 # unSafeSearch
 my $results = $google_search -» 
 doGoogleSearch(
 $google_key, "inurl:$url", 0, 10, "false", "", "false",
 "", "latin1", "latin1"
 );
 # Unsafe or Unindexed?
 print (
 (scalar grep /$url/, map { $_-»{URL} } @{$results-»{resultElements}}) 
 ? qq{"suspect"\n}
 : qq{"unindexed"\n}
 );
 }
} 

Running the Tip

To run the tip, you'll need a text file that contains the URLs you want to check, one line per URL. For example:

http://www.oracle.com/catalog/essblogging/
http://www.xxxxxxxxxx.com/preview/home.htm hipporhinostricow.com

The program runs from the command line. Enter the name of the script , a less-than sign, and the name of the text file that contains the URLs you want to check. The program will return results that look like this:

% perl suspect.pl « urls.txt
"url","safe/suspect/unindexed"
"oracle.com/catalog/essblogging/","safe"
"xxxxxxxxxx.com/preview/home.htm","suspect"
"hipporhinostricow.com","unindexed"

The first item is the URL being checked. The second is it's probable safety rating as follows:

    safe
  • The URL appeared in a Google SafeSearch for the URL.
  • suspect
  • The URL did not appear in a Google SafeSearch, but did in an unfiltered search.
  • unindexed
  • The URL appeared in neither a SafeSearch nor unfiltered search.
  • You can redirect output from the script to a file for import into a spreadsheet or database:

    % perl suspect.pl « urls.txt » urls.csv
    

    Tiping the Tip

    You can use this tip interactively, feeding it URLs one at a time. Invoke the script with perl suspect.pl, but don't feed it a text file of URLs to check. Enter a URL and hit the return key on your keyboard. The script will reply in the same manner as it did when fed multiple URLs. This is handy when you just need to spot-check a couple of URLs on the command line. When you're ready to quit, break out of the script using Ctrl-D under Unix or Ctrl-Break on a Windows command line.

    Here's a transcript of an interactive session with suspect.pl:

    % perl suspect.pl
    "url","safe/suspect/unindexed","title"
    http://www.oracle.com/catalog/essblogging/
    "oracle.com/catalog/essblogging/","safe"
    http://www.xxxxxxxxxx.com/preview/home.htm
    "xxxxxxxxxx.com/preview/home.htm","suspect"
    hipporhinostricow.com
    "hipporhinostricow.com","unindexed"
    ^d
    %