A Ruby class to crawl a website using Nokogiri, MongoDB database, and MongoMapper ORM

In this post I’m going to elaborate on a previous blog post: A simple HTTP Ruby class that uses Nokogiri to crawl a URL for internal and external URLs, and incorporate a few new concepts: a MongoDB NoSQL database, MongoMapper ORM, and a class structure that allows for scanning, resuming, and querying the data independently.

I installed MongoDB via Homebew.

# install
$ brew install mongodb

# start mongo daemon
$ mongod run --config /usr/local/etc/mongod.conf

I installed the Ruby gems:

$ gem install mongo mongo_mapper nokogiri bson bson_ext

Next I created a MongoMapper class to represent the properties of a crawled URL.

require 'mongo_mapper'

class NG_URL
  include MongoMapper::Document

  connection Mongo::Connection.new('localhost', 27017)
  set_database_name 'ng_crawl'

  timestamps!
  key :url, String
  key :host, String
  key :host_base, String
  key :scheme_host, String
  key :error_exists, Boolean
  key :content_type, String
  key :http_status_code, String
  key :content_length, Integer
  key :document, String
  key :a_hrefs_unprocessed, Array
  key :a_hrefs_processed, Array
  key :a_hrefs_external, Array
  key :a_hrefs_ignored, Array
  key :scanned_at, Time
end

The above class allows me to easily create and query URL documents, for example:

# create new object
n = NG_URL.new({:url => 'http://example.com'})
n.save!

# querying:
NG_URL.find_by_url('http://example.com')
NG_URL.where(:url => 'http://example.com').first

I then created a new crawler class that uses Nokogiri to extract <a> tags from HTML documents and store the results in the MongoDB database.

require 'open-uri'
require 'nokogiri'

class NG_Crawl

  @url_initial = nil
  @ng_url_initial = nil

  def initialize(url)

    unless url_valid? url
      puts "Initial URL is not valid.\n"
      exit
    end

    # set intial url to class instance variable
    @url_initial = url
    @ng_url_initial = url_to_object url

    # scan initial
    scan_ng_url @ng_url_initial

  end

  # method to recursively crawl unprocessed URLs
  def crawl
    while NG_URL.where(:a_hrefs_unprocessed => { :$not => { :$size => 0}}).count > 0 do
      next_unprocessed_url
    end
  end

  # method that returns an Array of all URLs
  def all_urls
    urls = NG_URL.all.collect {|n| n.url}
    urls.sort!
  end

  # method that returns an Array of all external URLs
  def all_urls_external
    ngurls = NG_URL.all(:a_hrefs_external => { :$not => { :$size => 0}})
    urls_external = []
    ngurls.each {|n| urls_external += n.a_hrefs_external }
    urls_external.uniq!.sort!
  end

  def url_valid?(url)
    (url =~ URI::regexp).nil? ? false : true
  end

  # returns NG_URL object for URL string, creates as necessary
  def url_to_object(url)

    # load existing object
    ngurl = NG_URL.last(:url => url)
    if ngurl.nil?
      uri = URI url

      ngurl = NG_URL.new
      ngurl.url = url
      ngurl.host = uri.host
      ngurl.scheme_host = "#{uri.scheme}://#{uri.host}"
      if uri.port == 3000
        ngurl.scheme_host = ngurl.scheme_host + ':3000'
      end
      ngurl.host_base = get_url_host_base uri.host
      ngurl.a_hrefs_unprocessed = []
      ngurl.a_hrefs_processed = []

      ngurl.save!
    end

    ngurl

  end

  # returns hostname without any subdomains
  def get_url_host_base(host)
    host_split = host.split '.'

    # check for hostnames without periods, like "localhost"
    if host_split.size == 1
      return host
    end

    host_split.pop(2).join('.')

  end

  def scan_ng_url(ngurl)

    # check if url has already been scanned
    unless ngurl.scanned_at.nil?
      return true
    end

    begin
      openuri = open ngurl.url
    rescue
      ngurl.error_exists = true
      ngurl.save!
      return false
    end

    ngurl.content_type = openuri.content_type
    ngurl.http_status_code = openuri.status.first
    ngurl.content_length = openuri.meta['content-length']
    ngurl.scanned_at = Time.new

    # check for text content types
    if openuri.content_type =~ /^text/

      doc = Nokogiri::HTML(openuri)
      a_hrefs = doc.css('a').collect {|a| a['href']}

      ngurl.document = doc
      ngurl.a_hrefs_unprocessed = a_hrefs

    end

    ngurl.save!

  end

  def next_unprocessed_url

    # find ng_url object with unprocessed a_hrefs
    ngurl = NG_URL.where(:a_hrefs_unprocessed => { :$not => { :$size => 0}}).sort(:created_at.asc).first

    url = ngurl.a_hrefs_unprocessed.shift

    if url.nil?
      ngurl.save!
      return
    end

    # debug
    p url

    uri = URI url

    # check for urls to ignore
    if url =~ /^#/ || url =~ /^javascript:/ || url =~ /^mailto:/
      ngurl.a_hrefs_ignored << url
      ngurl.save!
      return
    end

    # check scheme
    scheme = uri.scheme
    if !scheme.nil? && !['http', 'https'].include?(scheme)
      ngurl.a_hrefs_ignored << url
      ngurl.save!
      return
    end

    # check for urls starting with '/'
    if url =~ /^\//
      url = @ng_url_initial.scheme_host + url
    end

    # check for relative links beginning with '../'
    # todo: ensure this is working
    if url =~ /^\.\.\//
      url = fix_relative_parent_url(ngurl.url, url)
    end

    # check for relative links
    if not url =~ /^(http|https):\/\//
      parent_url = ngurl.url
      if parent_url[-1..-1] != '/'
        parent_url = parent_url[0..parent_url.rindex('/')]
      end
      url = "#{parent_url}#{url}"
    end

    # check if url is external
    if url_external? ngurl.host_base, url
      ngurl.a_hrefs_external << url
      ngurl.save!
      return
    end

    # remove trailing slash from url
    if url[-1..-1] == '/'
      url = url[0..-2]
    end

    # check if url object has not yet been created
    ngurl_count = NG_URL.where(:url => url).count
    if ngurl_count == 0
      # scan unprocessed url
      new_ngurl = url_to_object url
      scan_ng_url new_ngurl
    end

    # add url to processed list
    ngurl.a_hrefs_processed << url
    ngurl.save!

  end

  def url_external?(host_base, url)
    uri = URI url
    url_host_base = get_url_host_base(uri.host)
    not host_base == url_host_base
  end

  def fix_relative_parent_url(parent_url, url)

    if parent_url[-1..-1] != '/'
      parent_url = parent_url[0..parent_url.rindex('/')]
    end
    uri = URI parent_url
    uri_path_split = uri.path.split '/'

    url_split = url.split '../'
    url_remainder = ''
    url_split.each do |s|
      if s.empty?
        uri_path_split.pop
      else
        url_remainder = s
      end
    end

    new_url = "#{uri.scheme}://#{uri.host}"
    if uri.port == 3000
      new_url = new_url + ':3000'
    end
    new_url = "#{new_url}#{uri_path_split.join('/')}/#{url_remainder}"

  end

end

I used my class to crawl a website like this:

# instantiate crawler class object
ngc = NG_Crawl.new 'http://example.com'

# recursively crawl unprocessed URLs
ngc.crawl

# output all scanned URLs
puts ngc.all_urls

# output all external URLs
puts ngc.all_urls_external

Source code on GitHub

Eric London