Using PHP and MD5 to find duplicate images in iPhoto and view/compare the results

Avatar-eric-london
Created by Eric.London on 2011-07-28
Tags:
New Comment
 
Please note: the content on this page orginates from ericlondon.com.
For this article, I'll share some old school procedural PHP scripts I used to scan a directory for duplicate images and display the results for comparison. A while back, I had a hardware failure and had to write some rsync commands to manually pull my iPhoto images off of a dying Time Machine external harddrive. The basic gist of these script is simple: find all the images, create a unique MD5 hash of the image, collect some other details, write the records to a MySQL database, execute some SQL to find MD5 duplicates, and show the results side by side for comparison. Since I was executing this code on my iMac, I used MAMP to provide the Apache and MySQL services.

The first script, which will be included all the rest just sets up a MySQL database connection.

Script: db.php
<?php
// define mysql credentials
$db_user = 'picture_data';
$db_pass = 'picture_data';
$db_database = 'picture_data';
$db_table = 'picture_data';
$db_host = 'localhost';

// connect to mysql database
$db = mysql_connect($db_host, $db_user, $db_pass);

// check for mysql connection
if (!$db) {
  die('Could not connect to database.');
}
?>


I then created the script to find all the images, create the md5 hash, and store the data in MySQL. I put this script outside my Apache vhost docroot and only had to execute it once.

Script: scan.php
<?php
//////////////////////////////////////////////////
// DATABASE SETUP

require_once('db.php');

setup_database();

//////////////////////////////////////////////////
// PROCESSING IMAGES

// specify path to images
$images_path = '/Users/Eric/Pictures/iPhoto Library/Originals';

// ensure directory exists
if (!is_dir($images_path)) {
  die('Directory does not exist.');
}

// change directory
chdir($images_path);

// get a list of files
$files = `find . -type f | sed 's/^\.\///'`;

// explode files list on newline
$files = explode("\n", trim($files));

// define a list of file extensions to process
$file_extensions = array(
  'jpg',
  'jpeg',
  'png',
  'bmp',
  'gif',
  'tiff',
);

// loop through files
foreach ($files as $file_path) {

  // get path info
  $path_info = pathinfo($file_path);
  $file_name = $path_info['basename'];
  $file_extension = strtolower($path_info['extension']);
  
  // check file extension 
  if (!in_array($file_extension, $file_extensions)) {
    continue;
  }

  // get md5 hash of file
  $file_md5 = md5_file($file_path);

  // get file modified time
  $file_modified = date('Y-m-d H:i:s', filemtime($file_path));

  // create sql to insert record
  $sql = sprintf(
    "insert into `%s` (file_path, file_name, file_extension, file_md5, file_modified) values ('%s','%s','%s','%s','%s')",
    mysql_real_escape_string($db_table),
    mysql_real_escape_string($images_path . '/' . $file_path),
    mysql_real_escape_string($file_name),
    mysql_real_escape_string($file_extension),
    mysql_real_escape_string($file_md5),
    mysql_real_escape_string($file_modified)
  );

  // execute sql
  $result = mysql_query($sql, $db);

}

//////////////////////////////////////////////////
// FUNCTIONS

function setup_database() {

  global $db;
  global $db_database;
  global $db_table;

  // create database if it is does not exist
  $sql = sprintf(
    "create database if not exists `%s`",
    mysql_real_escape_string($db_database)
  );
  $result = mysql_query($sql, $db);
  
  // check for error
  if (!$result) {
    die(mysql_error());
  }
  
  // select database
  $result = mysql_select_db($db_database, $db);
  
  // check for error
  if (!$result) {
    die(mysql_error());
  }
  
  // create table if it does not exist
  $sql = sprintf("
    CREATE TABLE IF NOT EXISTS `%s` (
      `fid` int(11) NOT NULL AUTO_INCREMENT,
      `file_path` varchar(255) NOT NULL,
      `file_name` varchar(255) NOT NULL,
      `file_extension` varchar(10) NOT NULL,
      `file_md5` varchar(32) NOT NULL,
      `file_modified` datetime NOT NULL,
      PRIMARY KEY (`fid`),
      KEY `idx_file_md5` (`file_md5`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1",
    mysql_real_escape_string($db_table)
  );
  $result = mysql_query($sql, $db);
  
  // check for error
  if (!$result) {
    die(mysql_error());
  }
  
  // drop existing records from table
  $sql = sprintf(
    "truncate table `%s`",
    mysql_real_escape_string($db_table)
  );
  $result = mysql_query($sql, $db);
  
  // check for error
  if (!$result) {
    die(mysql_error());
  }

}
?>


I then ran the script on the command line. It took a while to go through all 25K+ images in my directory.


$ php scan.php


The next script I wrote will aid in the display of the images. I wrote this script because the absolute path of my images was outside my Apache vhost docroot. It checks for 2 $_GET variables: the md5 hash and a integer representing which duplicate image to show. The images is read and displayed, so this script can be inserted into the "scr" attribute of an img tag.

Script: view-image.php
<?php
//////////////////////////////////////////////////
// DATABASE

require_once('db.php');

// select database
$result = mysql_select_db($db_database, $db);

// check for error
if (!$result) {
  die(mysql_error());
}

//////////////////////////////////////////////////
// PROCESS REQUEST

$md5 = $_GET['md5'];
$index = intval($_GET['index']);

// fetch images with md5 index
$sql = sprintf("
  select *
  from `%s`
  where file_md5 = '%s'
  order by fid asc
  ",
  mysql_real_escape_string($db_table),
  mysql_real_escape_string($md5)
);

$result = mysql_query($sql, $db);

// check for error
if (!$result) {
  die(mysql_error());
}

// fetch results
$rows = array();
while ($row = mysql_fetch_object($result)) {
  $rows[] = $row;
}

// get image data
$file_path = $rows[$index]->file_path;
$file_extension = $rows[$index]->file_extension;

header("Content-type: image/$file_extension");
readfile($file_path);
?>


The last script ties everything together. It determines which duplicates exist and allows you to view them. For my environment, I decided to store the list of MD5 duplicates in the $_SESSION, to prevent repeat SQL.

Script: view.php
<?php
//////////////////////////////////////////////////
// DATABASE SETUP

require_once('db.php');

// select database
$result = mysql_select_db($db_database, $db);

// check for error
if (!$result) {
  die(mysql_error());
}

//////////////////////////////////////////////////
// FETCHING MD5S

// start session
session_start();

// check for session data
if (!is_array($_SESSION['md5s']) || empty($_SESSION['md5s'])) {
  fetch_md5s();
}

// determine which md5 to show
$md5_index = intval($_GET['md5_index']);

// fetch images with md5 index
$sql = sprintf("
  select *
  from `%s`
  where file_md5 = '%s'
  order by fid asc
  ",
  mysql_real_escape_string($db_table),
  mysql_real_escape_string($_SESSION['md5s'][$md5_index])
);

$result = mysql_query($sql, $db);

// check for error
if (!$result) {
  die(mysql_error());
}

// fetch results
$rows = array();
while ($row = mysql_fetch_object($result)) {
  $rows[] = $row;
}

// create image output in a table. note the image scr is calling the view-image.php script with $_GET arguments.
$output = "";
$output .= "<table><tr>";
foreach ($rows as $index => $data) {
  $output .= "<td style='width: " . (100/count($rows)) . "%'>";
  $output .= "<img style='width: 100%' src='/view-image.php?md5=" . $data->file_md5 . "&index=" . $index . "' />";
  $output .= $data->file_name . "<br/>";
  $output .= $data->file_path . "<br/>";
  $output .= "</td>";
}
$output .= "</tr></table>";

$output .= "<a href='/view.php?md5_index=" . ($md5_index+1) . "'>Next >></a>";

print $output;

//////////////////////////////////////////////////
// FUNCTIONS

function fetch_md5s() {

  global $db;
  global $db_table;

  // get a list of md5 hashes with dupes
  $sql = sprintf("
    select file_md5
    from `%s`
    group by file_md5
    having count(*) > 1
    ",
    mysql_real_escape_string($db_table)
  );
  
  $result = mysql_query($sql, $db);
  
  // check for error
  if (!$result) {
    die(mysql_error());
  }
  
  // fetch results
  $md5s = array();
  while ($row = mysql_fetch_object($result)) {
    $md5s[] = $row->file_md5;
  }
  
  // store md5s in session
  $_SESSION['md5s'] = $md5s;

}
?>


Now, I went to my browser to execute the view.php script and view the results.

Picture Duplicates

Comments

 
  • Thanks!
    Created by Anonymous on 2011-07-29
    Thanks Eric! That's great. I merged my wife's iMac iphoto library with mine a while back and I know we have hundreds of the same photos. I'm gonna try this script for sure. However I might have to stop using iPhoto because it's unbearably slow with more than 50K of images.

    I'm curious as to the next step and what you're doing about it. The merge. Perhaps a button on either image to delete or move the image to the other directory.

    I'd hope to use fileframework with drupal someday to be a better photomanager. I'm curious if anyone has written a iPhone and camera photo importer into such a drupal photo database.

    Thanks again for your post.
  • Many thanks
    Created by Anonymous on 2011-12-22
    I have been working on some tools for assembling a large database of photos, but with automated assembly in my tests I found about 10% duplication. I came up with the idea of using md5 and wanted to check around what other people were doing. Seeing your post gives me a sliver of hope that I am in fact not insane.

    I am curious though if you would take this same approach again?