background image
HomeRecent PostsDrupalSearchTagsRSSContactAboutAccount
Eric.London's picture

Here's a simple PHP class I wrote to crawl a URL and return a list of internal and external URLs. I've used it in the past for development purposes [only] to find 404s and repetition in URL structure. IE: It does not read in robots.txt files or obey any similar rules. Just thought I'd pull it out of the archives and share on the web..

#!/usr/bin/php

<?php
class Crawl {

  protected
$regex_link;
  protected
$website_url;
  protected
$website_url_base;
  protected
$urls_processed;
  protected
$urls_external;
  protected
$urls_not_processed;
  protected
$urls_ignored;

  public function
__construct($website_url = NULL) {
 
   
// enable error tracking, grr.
   
ini_set('track_errors', true);
   
   
// setup variables
   
$this->regex_link = "/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]/isU";
   
$this->urls_processed = array();
   
$this->urls_external = array();
   
$this->urls_not_processed = array();
   
$this->urls_ignored = array(
     
'/search/apachesolr_search/',
     
'/comment/reply/',
    );
   
   
// validate argument(s)
   
$result = $this->validate_arg_website_url($website_url);
       
   
// error check
   
if (!$result) {
      return
FALSE;
    }
   
   
// set website argument
   
$this->website_url = $website_url;
   
   
// get url base
   
$url_base = $this->get_url_base($this->website_url);
   
   
// error check
   
if (!$url_base) {
      return;
    }
   
   
// set website url base
   
$this->website_url_base = $url_base;
   
   
// add url to list of urls to process
   
$this->urls_not_processed[] = $this->website_url;
   
    while(
count($this->urls_not_processed)) {
     
$this->process_urls_not_processed();
    }
   
   
// sort data
   
sort($this->urls_processed);
   
sort($this->urls_external);
   
  }
 
  protected function
validate_arg_website_url($website_url = NULL) {
 
   
// validate argument
   
if (!(is_string($website_url) && (substr($website_url,0,7)=='http://' || substr($website_url,0,8)=='https://'))) {
      return
FALSE;
    }

    return
TRUE;   
     
  }
 
  protected function
get_url_base($url = NULL) {
 
   
// validate url
   
if (!$url || !strlen($url)) {
      return
FALSE;
    }
   
   
$url_parts = parse_url($url);
   
   
// validate
   
if (!is_array($url_parts)) {
      return
FALSE;
    }
   
   
// explode host on '.'
   
$exploded = explode('.', $url_parts['host']);
   
   
// return host and domain extension
   
$url_base = $exploded[count($exploded)-2] . '.' . $exploded[count($exploded)-1];
   
   
    return
$url_base;

  }

  protected function
scan_url($url) {

   
// validate url
   
if (!is_string($url) || !$url || !strlen($url)) {
      return
FALSE;
    }

   
// ensure url has not already been processed
   
if (in_array($url, $this->urls_processed)) {
      return
FALSE;
    }
   
   
// add url to processed list
   
$this->urls_processed[] = $url;

   
// remove any previously saved errors
   
unset($php_errormsg);
   
   
// load page contents
   
$page_contents = file_get_contents($url);       

   
// check for error when loading url; text starting with "file_get_contents"
   
$error_text = 'file_get_contents';
    if (isset(
$php_errormsg) && substr($php_errormsg,0,strlen($error_text))==$error_text) {
      return
FALSE;
    }

   
// check for additional errors
   
elseif ($page_contents === false || !strlen($page_contents)) {
      return
FALSE;
    }

   
// execute regex
   
preg_match_all($this->regex_link, $page_contents, $matches);
  
    if (
is_array($matches) && isset($matches[1])) {
      return
array_unique($matches[1]);
    }
  
    return
FALSE;

  }
 
  protected function
process_matches($matches = NULL) {
 
   
// validate
   
if (!$matches || !is_array($matches) || empty($matches)) {
      return
FALSE;
    }
   
    foreach (
$matches as $match) {
     
     
// ensure match exists
     
if (empty($match)) {
        continue;
      }
     
// ignore anchors
     
elseif (substr($match,0,1)=='#') {
        continue;
      }
     
// ignore javascript
     
elseif (substr($match,0,11)=='javascript:') {
        continue;
      }
     
// ignore mailto
     
elseif (substr($match,0,7)=='mailto:') {
        continue;
      }

     
// check for internal urls that begin with '/'
     
if (substr($match,0,1)=='/') {
       
$match = 'http://' . $this->website_url_base . $match;
      }
     
     
// remove trailing slash
     
if (substr($match, -1)=='/') {
       
$match = substr($match, 0, -1);
      }
     
     
// ensure href starts with http or https
      // NOTE: this needs work, URL could begin with relative paths like '../', ftp://, etc.
     
if (!(substr($match,0,7)=='http://' || substr($match,0,8)=='https://')) {
       
$match = 'http://' . $this->website_url_base . '/' . $match;
      }

     
// check if url is to be ignored
     
foreach ($this->urls_ignored as $ignored) {
        if (
stripos($match, $ignored) !== FALSE) {
          continue
2;
        }
      }

     
// get url base
     
$url_base = $this->get_url_base($match);
     
     
// check for external url
     
if ($url_base != $this->website_url_base) {
     
        if (!
in_array($match, $this->urls_external)) {
         
$this->urls_external[] = $match;
        }
        continue;
     
      }
     
     
// check if url has already been processed
     
if (in_array($match, $this->urls_processed)) {
        continue;
      }

     
// add url to list of urls to process
     
if (!in_array($match, $this->urls_not_processed)) {
       
$this->urls_not_processed[] = $match;
      }     
   
   
// end: foreach 
   
}
   
    return
TRUE;
 
  }
 
  protected function
process_urls_not_processed() {
 
    if (empty(
$this->urls_not_processed)) {
      return
FALSE;
    }
 
   
// get unprocessed url
   
$url = array_shift($this->urls_not_processed);
   
   
// scan url
   
$matches = $this->scan_url($url);

   
// error check
   
if (!$matches || !is_array($matches) || empty($matches)) {
      return
FALSE;
    }
 
   
$this->process_matches($matches);
 
  }
 
  public function
output_all_urls() {
 
    echo
"===== INTERNAL URLS =====\n";
    foreach (
$this->urls_processed as $url) {
      print
$url . "\n";
    }
 
    echo
"===== EXTERNAL URLS =====\n";
    foreach (
$this->urls_external as $url) {
      print
$url . "\n";
    }
 
  }

}
?>

It can be used as such..

<?php
$website_url
= 'http://www.example.com';
$crawl = new Crawl($website_url);
$crawl->output_all_urls();
?>

In the future, I will try to elaborate more on how to improve SEO in a Drupal. But for right now, here are some notes on what I have done with this site.

1. Install the Google Analytics module (http://drupal.org/project/google_analytics). You'll need to create a Google account if you have not already done so. This will monitor your visitors and web traffic.

2. Configure your URLs
- Enabe clean URLs. This uses Apache mod_rewrite to create virtual directory structure in your query strings.
- Enable the path module so you can rename URLs to whatever you like. Instead of node/#, you can make them more descriptive.
- Install the Pathauto module (http://drupal.org/project/pathauto). This module can be configured to automatically create a URL path alias based off of taxonomy, node title, menu structure, etc. I find it useful to configure path aliases based off menu structure and node titles. For example, here is a sample menu structure and the follow aliases I would use:

Home
>> My Hobbies
   >> Photography
      >> node/67
      >> node/68

My-Hobbies/Photography/Hiking
My-Hobbies/Photography/Ralphie-the-Cat

3. Install the XML Sitemap module (http://drupal.org/project/xmlsitemap). This module allows you to generate an XML sitemap that can be submitted to search engines automatically. You can see mine here: (http://ericlondon.com/sitemap.xml). I set my site to submit the sitemap to each available search engine. I also recommend signing up for the Google Webmaster Tools (http://www.google.com/webmasters/tools). This will allow you to monitor and configure the way Google analyzes your XML sitemap.

4. Install the Global Redirect module (http://drupal.org/project/globalredirect). This module will check to see if a path alias exists are redirect the user as necessary. For instance, if a user went to node/#, this module would redirect them to the more search engine friendly URL alias.

5. Use the taxonomy module. It's a great way to categorize your content. For this site, I use free tagging, so I do not have to maintain a definitive list of terms. It enables me to type in comma separated lists of terms that relate to my content. This also allows your users to click on your taxonomy terms and view contact what has been tagged with the same term.

Additional Notes
- If you are using automatic path aliases via pathauto, be careful when editing your nodes; you're aliases may be updated which can affect your menu structure and page links.
- I think it's important to setup pathauto immediately after installing Drupal. That way, when you're ready to start adding content to your site, you'll already be following a standard naming convention.
- I had some difficulty getting XML Sitemap and Pathauto to work together. At first, not all of my pages and taxonomy terms where showing up in the sitemap. I found a module called Module Weight (http://drupal.org/project/moduleweight) which helped alleviate some of my headache.

Syndicate content