Using PHP and MD5 to find duplicate images in iPhoto and view/compare the results
For this article I’ll share some old school procedural PHP scripts I used to scan a directory for duplicate images and display the results for comparison. A while back, I had a hardware failure and had to write some rsync commands to manually pull my iPhoto images off of a dying Time Machine external harddrive. The basic gist of these script is simple: find all the images, create a unique MD5 hash of the image, collect some other details, write the records to a MySQL database, execute some SQL to find MD5 duplicates, and show the results side by side for comparison. Since I was executing this code on my iMac, I used MAMP to provide the Apache and MySQL services.
The first script which will be included all the rest just sets up a MySQL database connection.
Script: db.php
<?php
// define mysql credentials
$db_user = 'picture_data';
$db_pass = 'picture_data';
$db_database = 'picture_data';
$db_table = 'picture_data';
$db_host = 'localhost';
// connect to mysql database
$db = mysql_connect($db_host, $db_user, $db_pass);
// check for mysql connection
if (!$db) {
die('Could not connect to database.');
}
?>
I then created the script to find all the images, create the md5 hash, and store the data in MySQL. I put this script outside my Apache vhost docroot and only had to execute it once.
Script: scan.php
<?php
//////////////////////////////////////////////////
// DATABASE SETUP
require_once('db.php');
setup_database();
//////////////////////////////////////////////////
// PROCESSING IMAGES
// specify path to images
$images_path = '/Users/Eric/Pictures/iPhoto Library/Originals';
// ensure directory exists
if (!is_dir($images_path)) {
die('Directory does not exist.');
}
// change directory
chdir($images_path);
// get a list of files
$files = `find . -type f | sed 's/^\.\///'`;
// explode files list on newline
$files = explode("\n", trim($files));
// define a list of file extensions to process
$file_extensions = array(
'jpg',
'jpeg',
'png',
'bmp',
'gif',
'tiff',
);
// loop through files
foreach ($files as $file_path) {
// get path info
$path_info = pathinfo($file_path);
$file_name = $path_info['basename'];
$file_extension = strtolower($path_info['extension']);
// check file extension
if (!in_array($file_extension, $file_extensions)) {
continue;
}
// get md5 hash of file
$file_md5 = md5_file($file_path);
// get file modified time
$file_modified = date('Y-m-d H:i:s', filemtime($file_path));
// create sql to insert record
$sql = sprintf(
"insert into `%s` (file_path, file_name, file_extension, file_md5, file_modified) values ('%s','%s','%s','%s','%s')",
mysql_real_escape_string($db_table),
mysql_real_escape_string($images_path . '/' . $file_path),
mysql_real_escape_string($file_name),
mysql_real_escape_string($file_extension),
mysql_real_escape_string($file_md5),
mysql_real_escape_string($file_modified)
);
// execute sql
$result = mysql_query($sql, $db);
}
//////////////////////////////////////////////////
// FUNCTIONS
function setup_database() {
global $db;
global $db_database;
global $db_table;
// create database if it is does not exist
$sql = sprintf(
"create database if not exists `%s`",
mysql_real_escape_string($db_database)
);
$result = mysql_query($sql, $db);
// check for error
if (!$result) {
die(mysql_error());
}
// select database
$result = mysql_select_db($db_database, $db);
// check for error
if (!$result) {
die(mysql_error());
}
// create table if it does not exist
$sql = sprintf("
CREATE TABLE IF NOT EXISTS `%s` (
`fid` int(11) NOT NULL AUTO_INCREMENT,
`file_path` varchar(255) NOT NULL,
`file_name` varchar(255) NOT NULL,
`file_extension` varchar(10) NOT NULL,
`file_md5` varchar(32) NOT NULL,
`file_modified` datetime NOT NULL,
PRIMARY KEY (`fid`),
KEY `idx_file_md5` (`file_md5`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1",
mysql_real_escape_string($db_table)
);
$result = mysql_query($sql, $db);
// check for error
if (!$result) {
die(mysql_error());
}
// drop existing records from table
$sql = sprintf(
"truncate table `%s`",
mysql_real_escape_string($db_table)
);
$result = mysql_query($sql, $db);
// check for error
if (!$result) {
die(mysql_error());
}
}
?>
I then ran the script on the command line. It took a while to go through all 25K+ images in my directory.
$ php scan.php
The next script I wrote will aid in the display of the images. I wrote this script because the absolute path of my images was outside my Apache vhost docroot. It checks for 2 $_GET variables: the md5 hash and a integer representing which duplicate image to show. The images is read and displayed, so this script can be inserted into the “scr” attribute of an img tag.
Script: view-image.php
<?php
//////////////////////////////////////////////////
// DATABASE
require_once('db.php');
// select database
$result = mysql_select_db($db_database, $db);
// check for error
if (!$result) {
die(mysql_error());
}
//////////////////////////////////////////////////
// PROCESS REQUEST
$md5 = $_GET['md5'];
$index = intval($_GET['index']);
// fetch images with md5 index
$sql = sprintf("
select *
from `%s`
where file_md5 = '%s'
order by fid asc
",
mysql_real_escape_string($db_table),
mysql_real_escape_string($md5)
);
$result = mysql_query($sql, $db);
// check for error
if (!$result) {
die(mysql_error());
}
// fetch results
$rows = array();
while ($row = mysql_fetch_object($result)) {
$rows[] = $row;
}
// get image data
$file_path = $rows[$index]->file_path;
$file_extension = $rows[$index]->file_extension;
header("Content-type: image/$file_extension");
readfile($file_path);
?>
The last script ties everything together. It determines which duplicates exist and allows you to view them. For my environment, I decided to store the list of MD5 duplicates in the $_SESSION, to prevent repeat SQL.
Script: view.php
<?php
//////////////////////////////////////////////////
// DATABASE SETUP
require_once('db.php');
// select database
$result = mysql_select_db($db_database, $db);
// check for error
if (!$result) {
die(mysql_error());
}
//////////////////////////////////////////////////
// FETCHING MD5S
// start session
session_start();
// check for session data
if (!is_array($_SESSION['md5s']) || empty($_SESSION['md5s'])) {
fetch_md5s();
}
// determine which md5 to show
$md5_index = intval($_GET['md5_index']);
// fetch images with md5 index
$sql = sprintf("
select *
from `%s`
where file_md5 = '%s'
order by fid asc
",
mysql_real_escape_string($db_table),
mysql_real_escape_string($_SESSION['md5s'][$md5_index])
);
$result = mysql_query($sql, $db);
// check for error
if (!$result) {
die(mysql_error());
}
// fetch results
$rows = array();
while ($row = mysql_fetch_object($result)) {
$rows[] = $row;
}
// create image output in a table. note the image scr is calling the view-image.php script with $_GET arguments.
$output = "";
$output .= "<table><tr>";
foreach ($rows as $index => $data) {
$output .= "<td style='width: " . (100/count($rows)) . "%'>";
$output .= "<img style='width: 100%' src='/view-image.php?md5=" . $data->file_md5 . "&index=" . $index . "' />";
$output .= $data->file_name . "<br/>";
$output .= $data->file_path . "<br/>";
$output .= "</td>";
}
$output .= "</tr></table>";
$output .= "<a href='/view.php?md5_index=" . ($md5_index+1) . "'>Next >></a>";
print $output;
//////////////////////////////////////////////////
// FUNCTIONS
function fetch_md5s() {
global $db;
global $db_table;
// get a list of md5 hashes with dupes
$sql = sprintf("
select file_md5
from `%s`
group by file_md5
having count(*) > 1
",
mysql_real_escape_string($db_table)
);
$result = mysql_query($sql, $db);
// check for error
if (!$result) {
die(mysql_error());
}
// fetch results
$md5s = array();
while ($row = mysql_fetch_object($result)) {
$md5s[] = $row->file_md5;
}
// store md5s in session
$_SESSION['md5s'] = $md5s;
}
?>
Now I went to my browser to execute the view.php script and view the results.