Article Analysis: Matching People's Names to Email Addresses

*code examples are based in ruby

The problem

Consider you are scraping web articles building a list of contacts for a PR company. Getting email addresses is as simple as a regular expression.

string = 'John Smith can be contacted at john.smith@gmail.com'
emails = []
string.scan(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}/i).each {|x| emails << x}

However, an email address is only so powerful, it would be really great if we could match a name to that email address. The problem statement is pretty simple- match email addresses in the article to the names in the article.

Finding names

So if you are thinking that finding names in text can’t be that hard, lets take a stroll down that dark alley really quick. Maybe just a regular expression that matches two capitalized strings in a row could do the trick.

/([A-Z]+[a-zA-Z]*)\s+([A-Z]+[a-zA-Z]*)/

Well, that’s cool, but what if there is a middle name, or even worse, an abbreviated name? So now, we add an optional third string to the regular expression, and allow for abbreviations.

/([A-Z]+[a-zA-Z.]*)\s+([A-Z]+[a-zA-Z.]*)+(\s+([A-Z]+[a-zA-Z]*))?/

This is looking great, and then you hit a name like Michael D'hunt, or De'Angelo Munez. Okay, so now we allow some apostrophes.

/([A-Z']+[a-zA-Z'.]+)\s+([A-Z]+[a-zA-Z'.]*)+(\s+([A-Z']+[a-zA-Z']*))?/

So right now, you run this against a list of 10,000 common names, and a boatload of Lorum Ipsum, and you have some great accuracy. However, in the real world you get a sentence like this, “And Will Smith was not alone, he included his wife Jada on his trip to San Francisco, where they stayed at Hotel Palomar”. Your regular expression just got roasted in so many ways.

Natural Language Processing (NPL)

To parse out things like names, places, dates, and going into things like differentiating languages within text, we have to get a little more fancy. Natural language processing concerns itself with the study of linguistics, using a mixture of machine learning, statistics, and artificial intelligence to provide a meaningful analysis of human languages. For all intensive purposes, we can say that it breaks apart sentences so we can interpret them better using a computer.

This is the crucial link between knowing if ‘San Francisco’ is a person’s name, or a physical place. The Stanford Natural Language Processing Group has a set of core open-source tools that can take care of some of this by implementing known language patterns, and using massive libraries of common naming schemes for people, places, and things.

Finding names, the right way

By implementing the Stanford CoreNLP Toolset, we can essentially throw some text at it, and with a couple filters we can have a list of names contained within the text. So from the sentence above, we may get a result like this.  

[  
  {
    :name => "Will",
    :start => 4,
    :end => 8
  },
  {
    :name => "Smith",
    :start => 9,
    :end => 14
  },
  {
    :name => "Jada",
    :start => 51,
    :end => 55
  }
]

The 'start’ and 'end’ numbers refer to the string position of the name itself, and with a little magic it is possible to concatenate names if they appear together, giving a final result as follows.

["Will Smith", "Jada"]

Names to emails

This matching problem is quite problematic considering the format that an article or piece of text might come in. In a perfect world, a person’s name would appear right next to their email address.

'John Smith can be contacted at john.smith@gmail.com, and Jill Ruth can be contacted at jill.ruth@gmail.com.'

If we know the position of the name, and know the position of the email address, this is no problem- we just write a routine to find the closest email to the persons name. However this breaks down pretty quickly.

'John Smith and Jill Ruth can be contacted respectively at john.smith@gmail.com, and jill.ruth@gmail.com.'

Or even worse…

'Article written by Edward Jones

... article body ...

Contact the writer at ejones@gmail.com'

The article body itself could also contain important names and email addresses as well, scattered as they please, so this calls for some more advanced parsing techniques.

The Levenshtein distance

My take on this problem is that we can throw name-position vs. email-position out the window; it is not reliable. The one thing we can rely on is that, in most professional situations, a person’s email has some reference to their name. 

This is where the Levenshtein distance algorithm comes in- it calculates the numbers of edits needed to transform one string into another. In our case, we are comparing a persons name to their email. It quickly becomes apparent that the email extension and any numbers can be removed from the email address, and normalizing the case is helpful before making the comparisons. Let’s looks at some results for John Smith (or more specifically in lower case, “john smith”).

john.smith@gmail.com -> john.smith : 1 edit
j.smith123@gmail.com -> j.smith : 4 edits
john.walker.smith@gmail.com -> john.walker.smith : 8 edits
jwsmith@gmail.com -> jwsmith : 4 edits
jws3319@gmail.com -> jws : 8 edits

So that is pretty neat, and the next step to the problem is pretty clear- we need to test a bunch of common email address patterns against the persons name, and use the best score. So instead of making comparisons with just “john smith”, we can abstract the name into some common formats.

person = "John Smith"
email = "jsmith44@gmail.com"

# remove the email extension and everything besides characters
m = Levenshtein.new(email.split('@').first.downcase.scan(/[a-z]/).join(''))

# run a standard set of tests against the persons name
tests = []
tests << m.match(person.downcase.scan(/[a-z]/).join('')) 
tests << m.match("#{person.split(' ').first.downcase[0]}#{person.split(' ').last.downcase[0]}")
tests << m.match(person.split(' ').first.downcase)
tests << m.match(person.split(' ').last.downcase)
...

best_result = tests.min

If this is run for every person and every email address found in an article, it will provide the best score for each person vs. email address.

Scores into results

With any type of artificial intelligence, there is rarely a concept of “passing a test”, there are only various levels of failure. The goal is to simply minimize failure in the best way possible, and developing with any other intention can be a destructive process. Using our scores from the previous step, we attempt to award all the emails we found to the person most deserving. Consider the following sample set.

people = ["Matt Gaidica", "Brad Birdsall", "John Smith", "Grant Olidapo", "Minh Nguyen"]
emails = ["mattyg@gmail.com", "bradbirdman17@gmail.com", "grant.olidapo@gmail.com", "mn1@gmail.com"]

The scores we produced account for every name vs. every email, or 20 (5x4) unique values. We look to some sort of complexity reduction algorithm to reduce this set of 20 data points, to only 4, which directly relate names to emails, leaving one of our people email-less. After about 20 lines of magic, our algorithm spits out the results.

{
  "Matt Gaidica" => "mattg@gmail.com",
  "Brad Birdsall" => "bradbirdman17@gmail.com",
  "Grant Olidapo" => "grant.olidapo@gmail.com",
  "Minh Nguyen" => "mn1@gmail.com"
}

Tip of the iceberg

I look at this as just one of the ways to accomplish this goal. This process can be heavily supplemented with machine learning techniques to produce better name recognition, and further develop common email address patterns for your specific type of article, document, or data set.

I have opened a library on Github called Textract, which includes the code for this entire process. My goal is to keep the problems simple, and the solutions simpler.