ruby

Simple Status Ticker for API Endpoints

When you are deploying code left-and-right, even in a test-driven development cycle, sometimes you still want the piece of mind that your website is responding. In my case, this is specific to API endpoints and different API environments we run our products on.

Status Ticker

This script allows you to leave a mini terminal window on your screen that will refresh the status of a website at an interval of your choice. I use two gems, the first is to make the text colors pretty, and the second is for making the HTTP calls- install them like so:

> gem install httparty 
> gem install terminal-display-colors 

Below is the code, which can also be found on this Github gist.

To use the ticker, take the code and put it into a file named `status.rb` and run the following in terminal from the directory the file is located:

> ruby status.rb 

To exit the ticker, use control+c on your keyboard.

Forcing SSL in a Sinatra App

When deploying on Heroku, you can piggy back on their SSL certificate, which allows you to have a secure connection right away without any SSL configurations of your own. I think this is a great solution for a lot of people until you need a really pretty URL. Because this is possible, you should use it, and if your building an API you should also think about forcing different environments to require SSL. Here is a simple implementation in my `app.rb` file:

I have left some of my app-specific code in there as well, but I am sure you can dig around that to see how SSL is forced. Notice that because downloading `.ics` files shouldn’t require SSL, or in other words, it shouldn’t fail if the user uses `http`, it is included in a whitelist array.

Multi-environment post_build_hook using tddium, Heroku, and Ruby Sinatra

We have been using tddium as a deployment tool for our Ruby Sinatra API for some time now. It has been working great and now manages deployment for 7 of our API environments. We have ran into a few small issues, but there are some sharp engineers on support, so issues get fixed fast. The most recent develop in our system is to implement a post build hook to run migrations after the app is deployed on Heroku.

One of the straight-forward adjustments I had to make to their gist is to dynamically set the app name based on the branch that was pushed to Github. In our case, each Heroku app (“environment”) has a git branch, and this is also how tddium sets up test suites, so everything jives.

The portion that required some support was apparently due to some dependency issues with my gems and the way the git repo was being pushed from tddium. The first step was to add the heroku gem to my gemfile, and from there, it was to modify the tddium post_build_hook a little bit. Here is the full version:

Lazy Levenshtein: Using Abbreviations and Spellchecked Inputs in Ruby

I have been spending a lot of time writing Ruby programs that take in data through the terminal. One of the problems is that mis-spelling something can cause the program to crash, and I want to be as quick as possible when doing data entry.

One of my programs asks which server environment I would like to use before I start messing with any data (development, integration, staging, production). It would be great if all of the following abbreviations or misspellings would choose the development environment, and keep the program rolling:

  • dev

  • development

  • devel

  • deevleopmnt

You get the idea- abbreviations and spellchecking from known inputs. To accomplish this I leverage the Levenshtein distance algorithm, more commonly known as “edit distance”. This algorithm compares two strings and returns an integer that is equal to the amount of edits needed to transform the first string into the second.

Here is the Github Gist for Lazy Levenshtein, with the sample code below so we can dig through it.

The three parameters are the input itself, an array of possible matches, and a boolean that tells the method whether or not you want to match abbreviations. The method sets up a Levenshtein comparison for each potential match (using the Ruby Amatch library), and scores the comparison. We are playing golf here, because the lowest score wins the game!

The method also reverses the array in the main loop, which puts priority to the first items in the array if there happens to be a tie between matches. Unlike typical “spellcheck”, this method will never return “not found”, it will always return a match, and if the “matches” array is empty, it simply returns the provided input.

This has helped me make inputting much faster with smarter defaults, and given me the piece of mind that my misspellings will always turn into known/safe values.

Article Analysis: Matching People's Names to Email Addresses

*code examples are based in ruby

The problem

Consider you are scraping web articles building a list of contacts for a PR company. Getting email addresses is as simple as a regular expression.

string = 'John Smith can be contacted at john.smith@gmail.com'
emails = []
string.scan(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}/i).each {|x| emails << x}

However, an email address is only so powerful, it would be really great if we could match a name to that email address. The problem statement is pretty simple- match email addresses in the article to the names in the article.

Finding names

So if you are thinking that finding names in text can’t be that hard, lets take a stroll down that dark alley really quick. Maybe just a regular expression that matches two capitalized strings in a row could do the trick.

/([A-Z]+[a-zA-Z]*)\s+([A-Z]+[a-zA-Z]*)/

Well, that’s cool, but what if there is a middle name, or even worse, an abbreviated name? So now, we add an optional third string to the regular expression, and allow for abbreviations.

/([A-Z]+[a-zA-Z.]*)\s+([A-Z]+[a-zA-Z.]*)+(\s+([A-Z]+[a-zA-Z]*))?/

This is looking great, and then you hit a name like Michael D'hunt, or De'Angelo Munez. Okay, so now we allow some apostrophes.

/([A-Z']+[a-zA-Z'.]+)\s+([A-Z]+[a-zA-Z'.]*)+(\s+([A-Z']+[a-zA-Z']*))?/

So right now, you run this against a list of 10,000 common names, and a boatload of Lorum Ipsum, and you have some great accuracy. However, in the real world you get a sentence like this, “And Will Smith was not alone, he included his wife Jada on his trip to San Francisco, where they stayed at Hotel Palomar”. Your regular expression just got roasted in so many ways.

Natural Language Processing (NPL)

To parse out things like names, places, dates, and going into things like differentiating languages within text, we have to get a little more fancy. Natural language processing concerns itself with the study of linguistics, using a mixture of machine learning, statistics, and artificial intelligence to provide a meaningful analysis of human languages. For all intensive purposes, we can say that it breaks apart sentences so we can interpret them better using a computer.

This is the crucial link between knowing if ‘San Francisco’ is a person’s name, or a physical place. The Stanford Natural Language Processing Group has a set of core open-source tools that can take care of some of this by implementing known language patterns, and using massive libraries of common naming schemes for people, places, and things.

Finding names, the right way

By implementing the Stanford CoreNLP Toolset, we can essentially throw some text at it, and with a couple filters we can have a list of names contained within the text. So from the sentence above, we may get a result like this.  

[  
  {
    :name => "Will",
    :start => 4,
    :end => 8
  },
  {
    :name => "Smith",
    :start => 9,
    :end => 14
  },
  {
    :name => "Jada",
    :start => 51,
    :end => 55
  }
]

The 'start’ and 'end’ numbers refer to the string position of the name itself, and with a little magic it is possible to concatenate names if they appear together, giving a final result as follows.

["Will Smith", "Jada"]

Names to emails

This matching problem is quite problematic considering the format that an article or piece of text might come in. In a perfect world, a person’s name would appear right next to their email address.

'John Smith can be contacted at john.smith@gmail.com, and Jill Ruth can be contacted at jill.ruth@gmail.com.'

If we know the position of the name, and know the position of the email address, this is no problem- we just write a routine to find the closest email to the persons name. However this breaks down pretty quickly.

'John Smith and Jill Ruth can be contacted respectively at john.smith@gmail.com, and jill.ruth@gmail.com.'

Or even worse…

'Article written by Edward Jones

... article body ...

Contact the writer at ejones@gmail.com'

The article body itself could also contain important names and email addresses as well, scattered as they please, so this calls for some more advanced parsing techniques.

The Levenshtein distance

My take on this problem is that we can throw name-position vs. email-position out the window; it is not reliable. The one thing we can rely on is that, in most professional situations, a person’s email has some reference to their name. 

This is where the Levenshtein distance algorithm comes in- it calculates the numbers of edits needed to transform one string into another. In our case, we are comparing a persons name to their email. It quickly becomes apparent that the email extension and any numbers can be removed from the email address, and normalizing the case is helpful before making the comparisons. Let’s looks at some results for John Smith (or more specifically in lower case, “john smith”).

john.smith@gmail.com -> john.smith : 1 edit
j.smith123@gmail.com -> j.smith : 4 edits
john.walker.smith@gmail.com -> john.walker.smith : 8 edits
jwsmith@gmail.com -> jwsmith : 4 edits
jws3319@gmail.com -> jws : 8 edits

So that is pretty neat, and the next step to the problem is pretty clear- we need to test a bunch of common email address patterns against the persons name, and use the best score. So instead of making comparisons with just “john smith”, we can abstract the name into some common formats.

person = "John Smith"
email = "jsmith44@gmail.com"

# remove the email extension and everything besides characters
m = Levenshtein.new(email.split('@').first.downcase.scan(/[a-z]/).join(''))

# run a standard set of tests against the persons name
tests = []
tests << m.match(person.downcase.scan(/[a-z]/).join('')) 
tests << m.match("#{person.split(' ').first.downcase[0]}#{person.split(' ').last.downcase[0]}")
tests << m.match(person.split(' ').first.downcase)
tests << m.match(person.split(' ').last.downcase)
...

best_result = tests.min

If this is run for every person and every email address found in an article, it will provide the best score for each person vs. email address.

Scores into results

With any type of artificial intelligence, there is rarely a concept of “passing a test”, there are only various levels of failure. The goal is to simply minimize failure in the best way possible, and developing with any other intention can be a destructive process. Using our scores from the previous step, we attempt to award all the emails we found to the person most deserving. Consider the following sample set.

people = ["Matt Gaidica", "Brad Birdsall", "John Smith", "Grant Olidapo", "Minh Nguyen"]
emails = ["mattyg@gmail.com", "bradbirdman17@gmail.com", "grant.olidapo@gmail.com", "mn1@gmail.com"]

The scores we produced account for every name vs. every email, or 20 (5x4) unique values. We look to some sort of complexity reduction algorithm to reduce this set of 20 data points, to only 4, which directly relate names to emails, leaving one of our people email-less. After about 20 lines of magic, our algorithm spits out the results.

{
  "Matt Gaidica" => "mattg@gmail.com",
  "Brad Birdsall" => "bradbirdman17@gmail.com",
  "Grant Olidapo" => "grant.olidapo@gmail.com",
  "Minh Nguyen" => "mn1@gmail.com"
}

Tip of the iceberg

I look at this as just one of the ways to accomplish this goal. This process can be heavily supplemented with machine learning techniques to produce better name recognition, and further develop common email address patterns for your specific type of article, document, or data set.

I have opened a library on Github called Textract, which includes the code for this entire process. My goal is to keep the problems simple, and the solutions simpler.

A Better time_ago_in_words for Ruby on Rails

def time_ago time, append = ' ago'
  return time_ago_in_words(time).gsub(/about|less than|almost|over/, '').strip << append
end

I love the time_ago_in_words method that comes bundled with rails, but it often produces a pretty lengthy string that doesn’t fit on narrow columns, or mobile views. This is a simple fix I use, placed in “helpers/application_helper.rb”.

time_ago(Time.now)
#a minute ago

time_ago(5.minutes.from_now)
#5 minutes ago

time_ago(Time.now - 20.years, '!')
#20 years!

Making files public in Amazon S3 and the owner problem

The most common way to make files publicly accessible for an Amazon S3 bucket is to add a bucket policy (Bucket Properties -> Add Bucket Policy).

{
  "Version":"2008-10-17",
  "Statement":[{
    "Sid":"AllowPublicRead",
        "Effect":"Allow",
      "Principal": {
            "AWS": "*"
         },
      "Action":["s3:GetObject"],
      "Resource":["arn:aws:s3:::bucket/*"
      ]
    }
  ]
}

As you should expect, you would replace “bucket” in the resource line to the name of your bucket- the syntax is actually quite familiar, using an asterisk as a wildcard parameter for all files inside of the specified bucket.

This still leaves some people in trouble though, because the bucket policy only applies to files that are owned by the bucket’s administrator. If you have an external application uploading files to your bucket, the policy does not apply, and you could be left with private/unaccessible files. There is a Stack Overflow post that explains this in a bit more depth.

Since my application is already using the Ruby Library for Amazon S3, the easiest solution was to change the policy for the file itself.  The solution is not very clear or elegant in the librarie’s documentation, so here is the best way.

#get the amazon object
amazon_object = AWS::S3::S3Object.find('golden_gate_bridge.png', 'my_photo_bucket')
#grant a public_read policy to the object grants
amazon_object.acl.grants << AWS::S3::ACL::Grant.grant(:public_read)
#write the changes to the policy
amazon_object.acl(amazon_object.acl)

The last way to get public access to a private file is to create a public url for the object, which defaults to only being accessible for 5 minutes, but can be set to a time in the future that is likely beyond the needs of your application. The documentation outlines a “doomsday” example.

doomsday = Time.mktime(2038, 1, 18).to_i
url = amazon_object.url(:expires => doomsday)

Using this method exposes three URL paramters in the public url- “AWSAccessKeyID”, “Expires”, and “Signature”, which junks it up a bit. Check up on the docs for more, and if you are just getting started with Ruby and Amazon S3, one of my previous posts might be of some help.

Comparing two files via MD5 hash on Amazon S3 using Ruby

This technique is helpful if you are using Amazon S3 as a file repository and want to detect duplicate files as they are uploaded to your application. Amazon S3 gives each file an ETag property, which is an MD5 hash of the file, although, in some cases this is not true (multipart and >5GB, so it seems). Let’s get started with a new directory, a file, and the Amazon S3 gem.

> mkdir amazon-compare && cd amazon-compare
> touch compare.rb
> sudo gem i aws-s3

The gem you will be using is straight from Amazon and connects to their S3 REST API- it comes with great documentation. Make sure you have setup an S3 bucket and have access to your API credentials. Open “compare.rb” and use the following code.

require 'digest/md5'
require 'aws/s3'

#set your AWS credentials
AWS::S3::Base.establish_connection!(
  :access_key_id     => 'XXX',
  :secret_access_key => 'XXX'
)

#get the S3 file (object)
object = AWS::S3::S3Object.find('02185773dcb5a468df6b.pdf', 'your_bucket')
#separate the etag object, and remove the extra quotations
etag = object.about['etag'].gsub('"', '')

#get the local file
f = '/Users/matt/Desktop/02185773dcb5a468df6b.pdf'
digest = Digest::MD5.hexdigest(File.read(f))

#lets see them both
puts digest + ' vs ' + etag

#a string comparison to finish it off
if digest.eql? etag
  puts 'same file!'
else
  puts 'different files.'
end

As you can see, we are just doing a simple comparison of two MD5 hashes, you can run the program using the ruby command.

> ruby compare.rb

View the Github Gist

Creating a hash (checksum) for an external file in Ruby

The Problem

We have a file on a server and want to create a hash (or checksum) of it, so we can compare it to the hash of other files down to road, to see if the files are the same.

The Solution

Ruby has a class called “Tempfile”, which allows you to read a file into a temporary location that already is assigned a unique name, can be accessed for normal file operations, and is exposed to Ruby’s native garbage collection. Since we are only concerned about storing the hash, we will write the file using the net/http library, and unlink (delete) the file when we are done. By including the digest library we are able to use an MD5 hash algorithm to produce a hash from the file, which is read in as a string. The final hash is stored, and would likely be put into a database, referencing or belonging to the external file.

Why use a hash?

A hashing algorithm is a lossy type of data compression (yes, it losses data), but is a formidable way to give a file a fingerprint. There indeed exists the possibility of two files generating the same hash, however the likelihood is astronomical. Hashes and checksums are commonly used to check the integrity of files you download, ensuring that the file a website intended to serve is the file you put on your computer.