This technique is helpful if you are using Amazon S3 as a file repository and want to detect duplicate files as they are uploaded to your application. Amazon S3 gives each file an ETag property, which is an MD5 hash of the file, although, in some cases this is not true (multipart and >5GB, so it seems). Let’s get started with a new directory, a file, and the Amazon S3 gem.
> mkdir amazon-compare && cd amazon-compare > touch compare.rb > sudo gem i aws-s3
The gem you will be using is straight from Amazon and connects to their S3 REST API- it comes with great documentation. Make sure you have setup an S3 bucket and have access to your API credentials. Open “compare.rb” and use the following code.
require 'digest/md5' require 'aws/s3' #set your AWS credentials AWS::S3::Base.establish_connection!( :access_key_id => 'XXX', :secret_access_key => 'XXX' ) #get the S3 file (object) object = AWS::S3::S3Object.find('02185773dcb5a468df6b.pdf', 'your_bucket') #separate the etag object, and remove the extra quotations etag = object.about['etag'].gsub('"', '') #get the local file f = '/Users/matt/Desktop/02185773dcb5a468df6b.pdf' digest = Digest::MD5.hexdigest(File.read(f)) #lets see them both puts digest + ' vs ' + etag #a string comparison to finish it off if digest.eql? etag puts 'same file!' else puts 'different files.' end
As you can see, we are just doing a simple comparison of two MD5 hashes, you can run the program using the ruby command.
> ruby compare.rb