Comparing two files via MD5 hash on Amazon S3 using Ruby

By Matt GaidicaJune 7, 2012In Uncategorized

Comparing two files via MD5 hash on Amazon S3 using Ruby

This technique is helpful if you are using Amazon S3 as a file repository and want to detect duplicate files as they are uploaded to your application. Amazon S3 gives each file an ETag property, which is an MD5 hash of the file, although, in some cases this is not true (multipart and >5GB, so it seems). Let’s get started with a new directory, a file, and the Amazon S3 gem.

> mkdir amazon-compare && cd amazon-compare
> touch compare.rb
> sudo gem i aws-s3

The gem you will be using is straight from Amazon and connects to their S3 REST API- it comes with great documentation. Make sure you have setup an S3 bucket and have access to your API credentials. Open “compare.rb” and use the following code.

require 'digest/md5'
require 'aws/s3'

#set your AWS credentials
AWS::S3::Base.establish_connection!(
  :access_key_id     => 'XXX',
  :secret_access_key => 'XXX'
)

#get the S3 file (object)
object = AWS::S3::S3Object.find('02185773dcb5a468df6b.pdf', 'your_bucket')
#separate the etag object, and remove the extra quotations
etag = object.about['etag'].gsub('"', '')

#get the local file
f = '/Users/matt/Desktop/02185773dcb5a468df6b.pdf'
digest = Digest::MD5.hexdigest(File.read(f))

#lets see them both
puts digest + ' vs ' + etag

#a string comparison to finish it off
if digest.eql? etag
  puts 'same file!'
else
  puts 'different files.'
end

As you can see, we are just doing a simple comparison of two MD5 hashes, you can run the program using the ruby command.