Ruby Selenium script to authenticate to Github and download all RailsCasts Pro videos

Here’s a Ruby script I wrote that uses Selenium to authenticate and crawl railscasts.com, and Curb to download every RailsCasts Pro video. This assumes you have a paid RailsCasts account (http://railscasts.com/pro), dowit.

#!/usr/bin/env ruby

require 'selenium-webdriver'
require 'set'
require 'curb'

# define github credentials
github_email = 'YOUR GITHUB EMAIL ADDRESS'
github_password = 'YOUR GITHUB PASSWORD'

# define download dir, and get a list of existing files
download_dir = './downloads'
Dir.mkdir download_dir
Dir.chdir download_dir
existing_files = Dir.entries download_dir

# create new webdriver
driver = Selenium::WebDriver.for :firefox

# log into railscasts via github
driver.navigate.to 'http://railscasts.com/login'
driver.find_element(:id, 'login_field').send_keys github_email
element = driver.find_element(:id, 'password')
element.send_keys github_password
element.submit

# setup variables to contain navigation and episode links
nav_links_unscanned = ['http://railscasts.com/?page=1&view=list']
nav_links_scanned = []
episode_links = Set.new

# get a unique list of episode links
while !nav_links_unscanned.empty?
  link = nav_links_unscanned.shift
  nav_links_scanned << link

  driver.navigate.to link
  a_tags = driver.find_elements(:tag_name, 'a')
  a_tags.each do |a|
    # check for episode link
    if a[:href] =~ /.*\/episodes\/[0-9]+.*(?<!view=comments)$/
      episode_links << a[:href]
    # check for navigation link
    elsif a[:href] =~ /page.*view=list/ && !nav_links_unscanned.include?(a[:href]) && !nav_links_scanned.include?(a[:href])
      nav_links_unscanned << a[:href]
    end
  end

end

# loop through episode links and download movies
episode_links.each do |link|
  driver.navigate.to link

  # get movie link
  e = driver.find_element(:link_text, 'mp4')

  # download file
  file_name = e[:href].split('/').last
  if !existing_files.include?(file_name)
    existing_files << file_name
    puts "Downloading: #{e[:href]}\n"
    curld = Curl.get(e[:href])
    File.open(file_name, 'w') {|f| f.write(curld.body_str) }
  end

end

driver.quit

Source code on GitHub

Updated: