Using PHP and MD5 to find duplicate images in iPhoto and view/compare the results

For this article I'll share some old school procedural PHP scripts I used to scan a directory for duplicate images and display the results for comparison. A while back, I had a hardware failure and had to write some rsync commands to manually pull my iPhoto images off of a dying Time Machine external harddrive. The basic gist of these script is simple: find all the images, create a unique MD5 hash of the image, collect some other details, write the records to a MySQL database, execute some SQL to find MD5 duplicates, and show the results side by side for comparison. Since I was executing this code on my iMac, I used MAMP to provide the Apache and MySQL services.

The first script which will be included all the rest just sets up a MySQL database connection.

Script: db.php

<?php
// define mysql credentials
$db_user = 'picture_data';
$db_pass = 'picture_data';
$db_database = 'picture_data';
$db_table = 'picture_data';
$db_host = 'localhost';

// connect to mysql database
$db = mysql_connect($db_host, $db_user, $db_pass);

// check for mysql connection
if (!$db) {
  die('Could not connect to database.');
}
?>

I then created the script to find all the images, create the md5 hash, and store the data in MySQL. I put this script outside my Apache vhost docroot and only had to execute it once.

Script: scan.php

<?php
//////////////////////////////////////////////////
// DATABASE SETUP

require_once('db.php');

setup_database();

//////////////////////////////////////////////////
// PROCESSING IMAGES

// specify path to images
$images_path = '/Users/Eric/Pictures/iPhoto Library/Originals';

// ensure directory exists
if (!is_dir($images_path)) {
  die('Directory does not exist.');
}

// change directory
chdir($images_path);

// get a list of files
$files = `find . -type f | sed 's/^\.\///'`;

// explode files list on newline
$files = explode("\n", trim($files));

// define a list of file extensions to process
$file_extensions = array(
  'jpg',
  'jpeg',
  'png',
  'bmp',
  'gif',
  'tiff',
);

// loop through files
foreach ($files as $file_path) {

  // get path info
  $path_info = pathinfo($file_path);
  $file_name = $path_info['basename'];
  $file_extension = strtolower($path_info['extension']);

  // check file extension
  if (!in_array($file_extension, $file_extensions)) {
    continue;
  }

  // get md5 hash of file
  $file_md5 = md5_file($file_path);

  // get file modified time
  $file_modified = date('Y-m-d H:i:s', filemtime($file_path));

  // create sql to insert record
  $sql = sprintf(
    "insert into `%s` (file_path, file_name, file_extension, file_md5, file_modified) values ('%s','%s','%s','%s','%s')",
    mysql_real_escape_string($db_table),
    mysql_real_escape_string($images_path . '/' . $file_path),
    mysql_real_escape_string($file_name),
    mysql_real_escape_string($file_extension),
    mysql_real_escape_string($file_md5),
    mysql_real_escape_string($file_modified)
  );

  // execute sql
  $result = mysql_query($sql, $db);

}

//////////////////////////////////////////////////
// FUNCTIONS

function setup_database() {

  global $db;
  global $db_database;
  global $db_table;

  // create database if it is does not exist
  $sql = sprintf(
    "create database if not exists `%s`",
    mysql_real_escape_string($db_database)
  );
  $result = mysql_query($sql, $db);

  // check for error
  if (!$result) {
    die(mysql_error());
  }

  // select database
  $result = mysql_select_db($db_database, $db);

  // check for error
  if (!$result) {
    die(mysql_error());
  }

  // create table if it does not exist
  $sql = sprintf("
    CREATE TABLE IF NOT EXISTS `%s` (
      `fid` int(11) NOT NULL AUTO_INCREMENT,
      `file_path` varchar(255) NOT NULL,
      `file_name` varchar(255) NOT NULL,
      `file_extension` varchar(10) NOT NULL,
      `file_md5` varchar(32) NOT NULL,
      `file_modified` datetime NOT NULL,
      PRIMARY KEY (`fid`),
      KEY `idx_file_md5` (`file_md5`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1",
    mysql_real_escape_string($db_table)
  );
  $result = mysql_query($sql, $db);

  // check for error
  if (!$result) {
    die(mysql_error());
  }

  // drop existing records from table
  $sql = sprintf(
    "truncate table `%s`",
    mysql_real_escape_string($db_table)
  );
  $result = mysql_query($sql, $db);

  // check for error
  if (!$result) {
    die(mysql_error());
  }

}
?>

I then ran the script on the command line. It took a while to go through all 25K+ images in my directory.

$ php scan.php

The next script I wrote will aid in the display of the images. I wrote this script because the absolute path of my images was outside my Apache vhost docroot. It checks for 2 $_GET variables: the md5 hash and a integer representing which duplicate image to show. The images is read and displayed, so this script can be inserted into the "scr" attribute of an img tag.

Script: view-image.php

<?php
//////////////////////////////////////////////////
// DATABASE

require_once('db.php');

// select database
$result = mysql_select_db($db_database, $db);

// check for error
if (!$result) {
  die(mysql_error());
}

//////////////////////////////////////////////////
// PROCESS REQUEST

$md5 = $_GET['md5'];
$index = intval($_GET['index']);

// fetch images with md5 index
$sql = sprintf("
  select *
  from `%s`
  where file_md5 = '%s'
  order by fid asc
  ",
  mysql_real_escape_string($db_table),
  mysql_real_escape_string($md5)
);

$result = mysql_query($sql, $db);

// check for error
if (!$result) {
  die(mysql_error());
}

// fetch results
$rows = array();
while ($row = mysql_fetch_object($result)) {
  $rows[] = $row;
}

// get image data
$file_path = $rows[$index]->file_path;
$file_extension = $rows[$index]->file_extension;

header("Content-type: image/$file_extension");
readfile($file_path);
?>

The last script ties everything together. It determines which duplicates exist and allows you to view them. For my environment, I decided to store the list of MD5 duplicates in the $_SESSION, to prevent repeat SQL.

Script: view.php

<?php
//////////////////////////////////////////////////
// DATABASE SETUP

require_once('db.php');

// select database
$result = mysql_select_db($db_database, $db);

// check for error
if (!$result) {
  die(mysql_error());
}

//////////////////////////////////////////////////
// FETCHING MD5S

// start session
session_start();

// check for session data
if (!is_array($_SESSION['md5s']) || empty($_SESSION['md5s'])) {
  fetch_md5s();
}

// determine which md5 to show
$md5_index = intval($_GET['md5_index']);

// fetch images with md5 index
$sql = sprintf("
  select *
  from `%s`
  where file_md5 = '%s'
  order by fid asc
  ",
  mysql_real_escape_string($db_table),
  mysql_real_escape_string($_SESSION['md5s'][$md5_index])
);

$result = mysql_query($sql, $db);

// check for error
if (!$result) {
  die(mysql_error());
}

// fetch results
$rows = array();
while ($row = mysql_fetch_object($result)) {
  $rows[] = $row;
}

// create image output in a table. note the image scr is calling the view-image.php script with $_GET arguments.
$output = "";
$output .= "<table><tr>";
foreach ($rows as $index => $data) {
  $output .= "<td style='width: " . (100/count($rows)) . "%'>";
  $output .= "<img style='width: 100%' src='/view-image.php?md5=" . $data->file_md5 . "&index=" . $index . "' />";
  $output .= $data->file_name . "<br/>";
  $output .= $data->file_path . "<br/>";
  $output .= "</td>";
}
$output .= "</tr></table>";

$output .= "<a href='/view.php?md5_index=" . ($md5_index+1) . "'>Next >></a>";

print $output;

//////////////////////////////////////////////////
// FUNCTIONS

function fetch_md5s() {

  global $db;
  global $db_table;

  // get a list of md5 hashes with dupes
  $sql = sprintf("
    select file_md5
    from `%s`
    group by file_md5
    having count(*) > 1
    ",
    mysql_real_escape_string($db_table)
  );

  $result = mysql_query($sql, $db);

  // check for error
  if (!$result) {
    die(mysql_error());
  }

  // fetch results
  $md5s = array();
  while ($row = mysql_fetch_object($result)) {
    $md5s[] = $row->file_md5;
  }

  // store md5s in session
  $_SESSION['md5s'] = $md5s;

}
?>

Now I went to my browser to execute the view.php script and view the results.

Picture Duplicates