background image
HomeRecent PostsDrupalSearchTagsRSSContactAboutAccount
Eric.London's picture

Here's a simple PHP class I wrote to crawl a URL and return a list of internal and external URLs. I've used it in the past for development purposes [only] to find 404s and repetition in URL structure. IE: It does not read in robots.txt files or obey any similar rules. Just thought I'd pull it out of the archives and share on the web..

#!/usr/bin/php

<?php
class Crawl {

  protected
$regex_link;
  protected
$website_url;
  protected
$website_url_base;
  protected
$urls_processed;
  protected
$urls_external;
  protected
$urls_not_processed;
  protected
$urls_ignored;

  public function
__construct($website_url = NULL) {
 
   
// enable error tracking, grr.
   
ini_set('track_errors', true);
   
   
// setup variables
   
$this->regex_link = "/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]/isU";
   
$this->urls_processed = array();
   
$this->urls_external = array();
   
$this->urls_not_processed = array();
   
$this->urls_ignored = array(
     
'/search/apachesolr_search/',
     
'/comment/reply/',
    );
   
   
// validate argument(s)
   
$result = $this->validate_arg_website_url($website_url);
       
   
// error check
   
if (!$result) {
      return
FALSE;
    }
   
   
// set website argument
   
$this->website_url = $website_url;
   
   
// get url base
   
$url_base = $this->get_url_base($this->website_url);
   
   
// error check
   
if (!$url_base) {
      return;
    }
   
   
// set website url base
   
$this->website_url_base = $url_base;
   
   
// add url to list of urls to process
   
$this->urls_not_processed[] = $this->website_url;
   
    while(
count($this->urls_not_processed)) {
     
$this->process_urls_not_processed();
    }
   
   
// sort data
   
sort($this->urls_processed);
   
sort($this->urls_external);
   
  }
 
  protected function
validate_arg_website_url($website_url = NULL) {
 
   
// validate argument
   
if (!(is_string($website_url) && (substr($website_url,0,7)=='http://' || substr($website_url,0,8)=='https://'))) {
      return
FALSE;
    }

    return
TRUE;   
     
  }
 
  protected function
get_url_base($url = NULL) {
 
   
// validate url
   
if (!$url || !strlen($url)) {
      return
FALSE;
    }
   
   
$url_parts = parse_url($url);
   
   
// validate
   
if (!is_array($url_parts)) {
      return
FALSE;
    }
   
   
// explode host on '.'
   
$exploded = explode('.', $url_parts['host']);
   
   
// return host and domain extension
   
$url_base = $exploded[count($exploded)-2] . '.' . $exploded[count($exploded)-1];
   
   
    return
$url_base;

  }

  protected function
scan_url($url) {

   
// validate url
   
if (!is_string($url) || !$url || !strlen($url)) {
      return
FALSE;
    }

   
// ensure url has not already been processed
   
if (in_array($url, $this->urls_processed)) {
      return
FALSE;
    }
   
   
// add url to processed list
   
$this->urls_processed[] = $url;

   
// remove any previously saved errors
   
unset($php_errormsg);
   
   
// load page contents
   
$page_contents = file_get_contents($url);       

   
// check for error when loading url; text starting with "file_get_contents"
   
$error_text = 'file_get_contents';
    if (isset(
$php_errormsg) && substr($php_errormsg,0,strlen($error_text))==$error_text) {
      return
FALSE;
    }

   
// check for additional errors
   
elseif ($page_contents === false || !strlen($page_contents)) {
      return
FALSE;
    }

   
// execute regex
   
preg_match_all($this->regex_link, $page_contents, $matches);
  
    if (
is_array($matches) && isset($matches[1])) {
      return
array_unique($matches[1]);
    }
  
    return
FALSE;

  }
 
  protected function
process_matches($matches = NULL) {
 
   
// validate
   
if (!$matches || !is_array($matches) || empty($matches)) {
      return
FALSE;
    }
   
    foreach (
$matches as $match) {
     
     
// ensure match exists
     
if (empty($match)) {
        continue;
      }
     
// ignore anchors
     
elseif (substr($match,0,1)=='#') {
        continue;
      }
     
// ignore javascript
     
elseif (substr($match,0,11)=='javascript:') {
        continue;
      }
     
// ignore mailto
     
elseif (substr($match,0,7)=='mailto:') {
        continue;
      }

     
// check for internal urls that begin with '/'
     
if (substr($match,0,1)=='/') {
       
$match = 'http://' . $this->website_url_base . $match;
      }
     
     
// remove trailing slash
     
if (substr($match, -1)=='/') {
       
$match = substr($match, 0, -1);
      }
     
     
// ensure href starts with http or https
      // NOTE: this needs work, URL could begin with relative paths like '../', ftp://, etc.
     
if (!(substr($match,0,7)=='http://' || substr($match,0,8)=='https://')) {
       
$match = 'http://' . $this->website_url_base . '/' . $match;
      }

     
// check if url is to be ignored
     
foreach ($this->urls_ignored as $ignored) {
        if (
stripos($match, $ignored) !== FALSE) {
          continue
2;
        }
      }

     
// get url base
     
$url_base = $this->get_url_base($match);
     
     
// check for external url
     
if ($url_base != $this->website_url_base) {
     
        if (!
in_array($match, $this->urls_external)) {
         
$this->urls_external[] = $match;
        }
        continue;
     
      }
     
     
// check if url has already been processed
     
if (in_array($match, $this->urls_processed)) {
        continue;
      }

     
// add url to list of urls to process
     
if (!in_array($match, $this->urls_not_processed)) {
       
$this->urls_not_processed[] = $match;
      }     
   
   
// end: foreach 
   
}
   
    return
TRUE;
 
  }
 
  protected function
process_urls_not_processed() {
 
    if (empty(
$this->urls_not_processed)) {
      return
FALSE;
    }
 
   
// get unprocessed url
   
$url = array_shift($this->urls_not_processed);
   
   
// scan url
   
$matches = $this->scan_url($url);

   
// error check
   
if (!$matches || !is_array($matches) || empty($matches)) {
      return
FALSE;
    }
 
   
$this->process_matches($matches);
 
  }
 
  public function
output_all_urls() {
 
    echo
"===== INTERNAL URLS =====\n";
    foreach (
$this->urls_processed as $url) {
      print
$url . "\n";
    }
 
    echo
"===== EXTERNAL URLS =====\n";
    foreach (
$this->urls_external as $url) {
      print
$url . "\n";
    }
 
  }

}
?>

It can be used as such..

<?php
$website_url
= 'http://www.example.com';
$crawl = new Crawl($website_url);
$crawl->output_all_urls();
?>

Eric.London's picture

When Centos came out with php53* packages, I promptly upgraded to test them out. I did not get around to installing PECL and memcache until recently, and soon realized they were no longer available. This article shows how I was able to install PECL and memcache on a Centos (5.6) system using the IUS repository.

Since I am running the php53* packages, the provided php-pecl-memcache package is not compatible. Checking what PECL packages are available:

$ yum list | grep -i ^php.*pecl
php-pecl-Fileinfo.x86_64                 1.0.4-3.el5.centos     extras         
php-pecl-fileinfo.x86_64                 1.0.4-2.el5.rf         rpmforge       
php-pecl-http.x86_64                     1.6.5-2.el5.rf         rpmforge       
php-pecl-mailparse.x86_64                2.1.5-2.el5.rf         rpmforge       
php-pecl-memcache.x86_64                 2.2.5-2.el5.rf         rpmforge       
php-pecl-session_mysql.x86_64            1.9-2.el5.rf           rpmforge       
php-pecl-ssh2.x86_64                     0.11.0-1.el5.rf        rpmforge       
php-pecl-zip.x86_64                      1.8.10-2.el5.rf        rpmforge       

I decided to try out the IUS repository which provides a new set of php53 packages, along with pecl and memcache.

# downloading packages
$ wget http://dl.iuscommunity.org/pub/ius/stable/Redhat/5.5/x86_64/ius-release-...
$ wget http://dl.iuscommunity.org/pub/ius/stable/Redhat/5.5/x86_64/epel-release...

# installing packages
$ rpm -Uvh ius-release-1.0-6.ius.el5.noarch.rpm
$ rpm -Uvh epel-release-1-1.ius.el5.noarch.rpm

The IUS packages are also not compatible with the installed php53* packages, so I removed them and installed the new php53u* packages.

# checking which are currently installed
$ yum list | grep -i ^php.*installed
php53.x86_64                            5.3.3-1.el5_6.1         installed      
php53-cli.x86_64                        5.3.3-1.el5_6.1         installed      
php53-common.x86_64                     5.3.3-1.el5_6.1         installed      
php53-devel.x86_64                      5.3.3-1.el5_6.1         installed      
php53-gd.x86_64                         5.3.3-1.el5_6.1         installed      
php53-mbstring.x86_64                   5.3.3-1.el5_6.1         installed      
php53-mysql.x86_64                      5.3.3-1.el5_6.1         installed      
php53-pdo.x86_64                        5.3.3-1.el5_6.1         installed

# removing existing:
$ yum remove php53*

# installing IUS packages:
$ yum install php53u php53u-cli php53u-common php53u-devel php53u-gd php53u-mbstring php53u-mysql php53u-pdo php53u-pear php53u-pecl-apc php53u-xml php53u-xmlrpc php53u-pecl-memcache

Next, I installed the memcached service.

# install
$ yum install memcached

# set run levels
$ chkconfig --level 2345 memcached on

# start service
$ /etc/init.d/memcached start

After installation, I restarted Apache and checked to ensure memcache was not working.

$ php -i | grep -i memcache\ support
memcache support => enabled

I created a tiny script to test for memcache support:

<?php
$memcache
= new Memcache;
$memcache->connect('127.0.0.1', 11211);
print_r($memcache);
?>

$ php memcachetest.php
Memcache Object
(
    [connection] => Resource id #4
)

Now, my system is ready to begin work with the Memcache API and Integration Drupal module :)

Eric.London's picture

This morning, I encountered a PHP fatal error on my development environment. Upon further inspection, one of my third party modules (XML Sitemap) required a later version of PHP. A fresh installation of Centos 5.3 comes with version 5.1.6 of PHP. Here is an easy way to upgrade PHP to a later version by using the Utter Ramblings Yum repository.

I created a new yum repo file:

$ sudo emacs /etc/yum.repos.d/utterramblings.repo

# FILE CONTENTS - START
[utterramblings]
name=Jason's Utter Ramblings Repo
baseurl=http://www.jasonlitka.com/media/EL$releasever/$basearch/
enabled=1
gpgcheck=1
gpgkey=http://www.jasonlitka.com/media/RPM-GPG-KEY-jlitka
# FILE CONTENTS - END

Ran a yum update:

$ sudo yum update
# ...snip...
Updated: apr.i386 0:1.2.12-2.jason.1 apr-util.i386 0:1.2.12-5.jason.1 curl.i386 0:7.15.5-2.1.el5_3.5 httpd.i386 0:2.2.8-jason.3 ksh.i386 0:20080202-2.el5_3.1 mod_ssl.i386 1:2.2.8-jason.3 mysql.i386 0:5.0.58-jason.2 mysql-server.i386 0:5.0.58-jason.2 pcre.i386 0:7.6-jason.1 php.i386 0:5.2.6-jason.1 php-cli.i386 0:5.2.6-jason.1 php-common.i386 0:5.2.6-jason.1 php-gd.i386 0:5.2.6-jason.1 php-mbstring.i386 0:5.2.6-jason.1 php-mssql.i386 0:5.2.6-jason.1 php-mysql.i386 0:5.2.6-jason.1 php-odbc.i386 0:5.2.6-jason.1 php-pdo.i386 0:5.2.6-jason.1 php-pear.noarch 1:1.6.2-1.jason.1 php-xml.i386 0:5.2.6-jason.1 php-xmlrpc.i386 0:5.2.6-jason.1 subversion.i386 0:1.4.4-jason.1 tzdata.noarch 0:2009k-1.el5
Complete!

After updating all these packages, I checked out my new PHP version:

$ php -v | head -1
PHP 5.2.6 (cli) (built: May  5 2008 10:32:59)

Now, my PHP fatal error has been resolved.

NOTE: This blog entry is a re-post of a previous article.

Eric.London's picture

I just encountered a PHP fatal error when running my cron.php:

Fatal error: Call to undefined function timezone_open() in /MYSERVERPATH/httpdocs/sites/all/modules/date/date_api.module on line 607

A quick Google search, and I found the issue is documented here. The solution is to enable the Date PHP4 module. But, this issue does not happen in our production environment, so I decided to compare PHP versions:

# on the production server:
$ php -v | head -1
PHP 5.2.8 (cli) (built: Dec  9 2008 14:03:11)

It turns out, a fully updated installation of Centos 5.2 only supplies PHP 5.1.x. So, I decided to upgrade PHP in my development environment according to this documentation.

I created a new yum repo file:

$ sudo emacs /etc/yum.repos.d/utterramblings.repo

# FILE CONTENTS - START
[utterramblings]
name=Jason's Utter Ramblings Repo
baseurl=http://www.jasonlitka.com/media/EL$releasever/$basearch/
enabled=1
gpgcheck=1
gpgkey=http://www.jasonlitka.com/media/RPM-GPG-KEY-jlitka
# FILE CONTENTS - END

And, ran another yum update:

$ sudo yum update
# ...snip...
Resolving Dependencies
# ...snip...
Dependencies Resolved

=============================================================================
Package                 Arch       Version          Repository        Size
=============================================================================
Updating:
apr                     i386       1.2.12-2.jason.1  utterramblings    257 k
apr-util                i386       1.2.12-5.jason.1  utterramblings    159 k
httpd                   i386       2.2.8-jason.3    utterramblings    2.5 M
mod_ssl                 i386       1:2.2.8-jason.3  utterramblings    314 k
mysql                   i386       5.0.58-jason.2   utterramblings    6.4 M
mysql-server            i386       5.0.58-jason.2   utterramblings     10 M
pcre                    i386       7.6-jason.1      utterramblings    562 k
php                     i386       5.2.6-jason.1    utterramblings    3.7 M
php-cli                 i386       5.2.6-jason.1    utterramblings    2.6 M
php-common              i386       5.2.6-jason.1    utterramblings    481 k
php-devel               i386       5.2.6-jason.1    utterramblings    568 k
php-gd                  i386       5.2.6-jason.1    utterramblings    320 k
php-ldap                i386       5.2.6-jason.1    utterramblings     56 k
php-mbstring            i386       5.2.6-jason.1    utterramblings    1.3 M
php-mssql               i386       5.2.6-jason.1    utterramblings     61 k
php-mysql               i386       5.2.6-jason.1    utterramblings    258 k
php-odbc                i386       5.2.6-jason.1    utterramblings    112 k
php-pdo                 i386       5.2.6-jason.1    utterramblings    159 k
php-pear                noarch     1:1.6.2-1.jason.1  utterramblings    418 k
php-soap                i386       5.2.6-jason.1    utterramblings    342 k
php-xml                 i386       5.2.6-jason.1    utterramblings    316 k
php-xmlrpc              i386       5.2.6-jason.1    utterramblings    130 k
subversion              i386       1.4.4-jason.1    utterramblings    4.3 M

Transaction Summary
=============================================================================
Install      0 Package(s)        
Update      23 Package(s)        
Remove       0 Package(s)        

Total download size: 35 M
Is this ok [y/N]:

After updating all these packages, I checked out my new PHP version:

$ php -v | head -1
PHP 5.2.6 (cli) (built: May  5 2008 10:32:59)

Now, my PHP fatal error has been resolved :)

Sometimes I am forced to edit PHP files outside Eclipse. Here's a quick guide to make your text editor (in this case, Emacs) a little more user friendly by enabled php-mode and syntax highlighting.

First, download php-mode and stick it in your ~/.emacs.d folder:

cd ~/.emacs.d
wget http://php-mode.svn.sourceforge.net/svnroot/php-mode/tags/php-mode-1.4.0/php-mode.el

Next, paste the following code into your ~/.emacs file. This will enable php-mode and syntax highlighting. As you can see, I also added a default file extension for .module files.

(global-font-lock-mode 1)

(require 'php-mode)
(setq auto-mode-alist
  (append '(("\\.php$" . php-mode)
            ("\\.module$" . php-mode))
              auto-mode-alist))

Now, when you open .php or .module files, your code will be syntax highlighted and emacs will be tailored to editing PHP code. Screen shot:

Syndicate content