Comparing two files via MD5 hash on Amazon S3 using Ruby

This technique is helpful if you are using Amazon S3 as a file repository and want to detect duplicate files as they are uploaded to your application. Amazon S3 gives each file an ETag property, which is an MD5 hash of the file, although, in some cases this is not true (multipart and >5GB, so it seems). Let’s get started with a new directory, a file, and the Amazon S3 gem.

> mkdir amazon-compare && cd amazon-compare
> touch compare.rb
> sudo gem i aws-s3

The gem you will be using is straight from Amazon and connects to their S3 REST API- it comes with great documentation. Make sure you have setup an S3 bucket and have access to your API credentials. Open “compare.rb” and use the following code.

require 'digest/md5'
require 'aws/s3'

#set your AWS credentials
  :access_key_id     => 'XXX',
  :secret_access_key => 'XXX'

#get the S3 file (object)
object = AWS::S3::S3Object.find('02185773dcb5a468df6b.pdf', 'your_bucket')
#separate the etag object, and remove the extra quotations
etag = object.about['etag'].gsub('"', '')

#get the local file
f = '/Users/matt/Desktop/02185773dcb5a468df6b.pdf'
digest = Digest::MD5.hexdigest(File.read(f))

#lets see them both
puts digest + ' vs ' + etag

#a string comparison to finish it off
if digest.eql? etag
  puts 'same file!'
  puts 'different files.'

As you can see, we are just doing a simple comparison of two MD5 hashes, you can run the program using the ruby command.

> ruby compare.rb

View the Github Gist

Creating a hash (checksum) for an external file in Ruby

The Problem

We have a file on a server and want to create a hash (or checksum) of it, so we can compare it to the hash of other files down to road, to see if the files are the same.

The Solution

Ruby has a class called “Tempfile”, which allows you to read a file into a temporary location that already is assigned a unique name, can be accessed for normal file operations, and is exposed to Ruby’s native garbage collection. Since we are only concerned about storing the hash, we will write the file using the net/http library, and unlink (delete) the file when we are done. By including the digest library we are able to use an MD5 hash algorithm to produce a hash from the file, which is read in as a string. The final hash is stored, and would likely be put into a database, referencing or belonging to the external file.

Why use a hash?

A hashing algorithm is a lossy type of data compression (yes, it losses data), but is a formidable way to give a file a fingerprint. There indeed exists the possibility of two files generating the same hash, however the likelihood is astronomical. Hashes and checksums are commonly used to check the integrity of files you download, ensuring that the file a website intended to serve is the file you put on your computer.