Surgical Tool for Cranial Drill Alignment in Rodents

Patent Pending #62369132

This enables the surgeon to drill a hole that travels longitudinally along the lateral bone structures, thereby increasing surface area for a bone screw to purchase to the skull. Often, these bone screws are used along with some type of epoxy or cement to attach larger implantable devices. This method of bone screw placement improves on many other methods that only drill into the cranial plates which are very thin and preclude proper threading and purchase of a screw. Developing the surgical skills to properly traverse the cranial ridge may take a surgeon a very long time and failures in placement can result in brain damage or improper purchase of the bone screw/s, leading to an eventual failure of the implant, often resulting in the premature sacrificing of the animal. Therefore, proper bone screw placement is both a technical challenge, and furthermore, a concern for animal well-being.

Other methods for placing bone screws along the cranial ridge involve stereotactic tools that are cumbersome and time consuming to employ. Minimizing surgical complexity and the time an animal spends anesthetized are crucial factors that contribute to operational success and animal recovery. This surgical tool has been designed to quickly align to anatomical features of the rodent cranium and guides a drill and bit at the proper angle into the cranial ridge bone. The dimensions of the tool are such that a surgeon may place multiple screws on each side of the skull, following the cranial ridge (circled) from the anterior/frontal aspect (F), to the posterior/parietal aspect (P). The cranial ridge described is conserved in several rodent species and is thick enough, in the described areas, to drill holes that do not enter the brain cavity on the medial aspect, or protrude out of the skull on the lateral aspect. The “angle of attack” for the drill bit has been determined through consultation of anatomical drawings as well as trial in the operating room with anesthetized animals.

Finding Nearest Value in MATLAB Using min()

I'm often converting between samples or video frames and time. For example, I'll have some video of animal behavior and an electrophysiological recording that each run on their own clocks (e.g. 30 frames-per-second for video and 30,000 samples-per-second for ephys). Therefore, it makes a lot of sense to use some absolute unit (like seconds) in the software so frame or sampling rates can disappear. This does create some syncing issues, when say, you want to find a video frame that occurs at t = 2.3539 seconds based on ephys events and your frame rate only resolved frames at t = 2.3333 and t = 2.3666. Given an array of values and a target value, this function returns the nearest index within that array, and the value which is nearest the target value.

Let's use my example and create an array of thirty values between 2 and 3:

frameTimes = [2, 2.0345, 2.069, 2.1034, 2.1379, 2.1724, 2.2069, 2.2414, 2.2759, 2.3103, 2.3448, 2.3793, 2.4138, 2.4483, 2.4828, 2.5172, 2.5517, 2.5862, 2.6207, 2.6552, 2.6897, 2.7241, 2.7586, 2.7931, 2.8276, 2.8621, 2.8966, 2.931, 2.9655, 3];

And now we can use that stupid-simple function to find a nearest index and value:

[idx, val] = closest(frameTimes,2.3539);

Here's the output and a figure to show what's going on:

idx =
val =

Now, I would know that whatever important thing happened at 2.3539 seconds can probably be revealed by looking at frame 11, or 2.3448 seconds into my video.

The [Simpler] Tetrode Spinner

We wanted a simpler, off-the-shelf tetrode twister that could be bought, built, and assembled by anyone. We took some tips from the "big brother" Twister that was already on Open Ephys, and made a few feature cuts. The simple twister is based on Arduino microcontroller hooked to a stepper motor, and we hope that our parts list helps you build your own semi-automated twister for about $300.

  1. [$8.94] Hammond Manufacturing Clear Polycarbonate Enclosure
  2. [$4.95] Momentary Pushbutton Switch
  3. [$4.99] 10kOHm Resistors 1/4 Watt (100pcs)

  1. [1$85.00] Aluminum Breadboard 6"x12"x1/2" with 1/4"-20 Taps
  2. [$26.00] Sorbothane Feet 1/4"-20 Thread
  3. [$10.82] Ø1/2"x12" Optical Post 1/4"-20 Tap
  4. [$6.77]  Ø1/2"x6" Optical Post 1/4"-20 Tap
  5. [$12.64] Ø1/2"x6" Post Holder, Spring-Loaded Hex-Locking Thumbscrew
  6. [$9.76] Right-Angle Post Clamp Fixed 90° Adapter
  7. [$25.70] Slim Right-Angle Bracket with Counterbored & M6 Tapped Holes
  8. [$9.50] 1/4"-20 Stainless Steel Cap Screw, 1" Long, Pack of 25

  1. [$24.95] Arduino Uno R3
  2. [$19.95] Adafruit Motor/Stepper/Servo Shield for Arduino v2 Kit - v2.3
  3. [$14.00] Stepper Motor 200 steps/rev, 12V 350mA
  4. [8.95] Stepper Motor Mount with Hardware


You mat need a few other parts, like hookup wire, a small needle, and some standoffs to mount the Arduino, but you might already have those. Let's get started with the Arduino.

  1. Head over to Adafruit's wonderful documentation on how to install the motor shield.
  2. Install a 10kOhm resistor between IO pin 12 and ground (see photo below), and install one end of the switch directly to pin 12, and the other side to 3.3V+.
  3. Download the Arduino Software.
  4. Clone our Tetrode Arduino repository on GitHub (or download the ZIP) and compile it onto the Arduino.
  5. Test the stepper motor by pressing the switch.

If you get the Arduino to run the stepper motor, all you have left to do is mount the components and stick a bent (and blunted) needle onto the 8/32" screw of the horizontal Ø1/2" post. You'll notice that we are leaving a few things up to you, which may change based on your abilities:

  • We used the right-angle mount from Thor Labs to place the motor, but we had to drill two small holes in the steel motor mount for it to work

  • The momentary switch shown is different than the one recommended in the parts list, it's all we had lying around and can be found at All Electronics

  • You will need to decide how you wan the tetrode clip to interface with the motor, and you can reference the original Open Ephys Twister for 3D parts, or do what we did, and scrap one together


The momentary switch will spin the motor 80-turns in one direction, and 20-turns in the other, and then stop. These parameters can be adjusted in the INO file provided in Step 4 (above). We found that the stepper motor introduces a slight vibration into our plastic "goal posts" used to hold the tetrode clamp, although we were able to dampen this using some rubber heat-shrink tubing. It is recommended that you unplug the Arduino from the AC charger when the spinner is not in use.

We hope this serves as a starting point for making your own spinner, and please share any innovations you have made in your quest for the perfect tetrode.

Extracting Spikes from Neural Electrophysiology in MATLAB

Neural spikes extracted using this method.

Neural spikes extracted using this method.

One hour of neural recordings amasses to nearly forty-three gigabytes of raw data for me. This is streamed through fiber optics onto our storage system and accounts for 128-channels, sampled at just over 24 kHz. If it was a good day, these files contain hundreds of thousands of spikes, so how to extract them?

Extracting spikes is just one step in our lab's multi-step protocol to analyze animal behavior. With the flux of everyone from undergraduates to post-docs working with the code base, extracting spikes can’t be something only one person understands. This imposes a few constraints on the algorithm we implement—we want something that is reliable, but more importantly, simple.

The first step is the same no matter what extraction you use: we want to exclude the low frequency content of the signal. This will “flatten” the signal, and hopefully begin to highlight the high frequency neural spikes. I use a butterworth bandpass filter, with cutoff frequencies (Fc) between 244 Hz and 6.104 kHz. Depending on your sampling frequency (Fs), you can easily calculate your own Wn values (Wn = Fc/(Fs/2)).

>> [b,a] = butter(4, [0.02 0.5]);
>> filteredData = filtfilt(b,a,double(data));

At this point, you might think about removing high amplitude artifacts. Movement-related potentials will often soar above your spike amplitude, so everything above (and below) a certain level can be removed. It makes most sense to me to just apply a zero-amplitude segment in place of the artifact. You can find my code for artifactThresh.m, and here’s what it’s doing:

  1. Identify peaks above given threshold.
  2. Move forward and backward in time and identify when the signal reaches reasonable amplitude (when it "resets").
  3. Replace the artifact spans with zeros.

Finally, it’s time to do the detection itself. What we really want to know is, at what times (or sample numbers) are there spikes? Bestel et al. reviewed some of the detection methods in use by others [1], they include:

If you have a great signal to noise ratio, just drawing a threshold is by far the simplest method. You can even get pretty darn close just using peakseek.m by Peter O’Connor. Although we’ve done our best to eliminate any signal not within the “spiking” frequency band, unless your recordings are perfect you will still find that threshold method will catch a lot of noise. Being constrained to simple ideas, the Nonlinear Energy Operator (NEO) is the next obvious candidate, because as Sudipta & Ray show [2], you are basically squaring your signal but also subtracting the amplitude of neighboring samples. This ensures that the waveform is, indeed, spikey.

Where  ψ  is nonlinear energy,  x  is your data, and  n  is your sample number.

Where ψ is nonlinear energy, x is your data, and n is your sample number.

This serves to first, make big signals bigger, and small signals smaller (squaring), but also acts as a more selective high pass filter, as signals that are broad will have large amplitudes for neighboring samples. Now, you perform a threshold on the nonlinear energy value. Choosing a threshold is still a subjective process, although I have had success using a multiple of the median of nonlinear energy values. You can find all these operations in my function, getSpikeLocations.m (with two dependencies: peekseak.m and snle.m). The next step, which I don't cover here, will be extracting the waveforms based on the timestamps and spike sorting them using either more MATLAB software, or a commercial product, like Plexon's Offline Sorter.

1.   Bestel, R., Daus, A. W. & Thielemann, C. A novel automated spike sorting algorithm with adaptable feature extraction. J. Neurosci. Methods 211, 168–78 (2012).

2.   Mukhopadhyay, S. & Ray, G. C. A new interpretation of nonlinear energy operator and its efficacy in spike detection. IEEE Trans. Biomed. Eng. 45, 180–187 (1998).


Mouse and Rat Brain Atlas: An Interactive Online Tool

I've created online tools for the Rat Brain Atlas and Mouse Brain Atlas based on the Paxinos et al. work, in stereotactic coordinates. All the sections have been extracted from the original PDF documents. My software accounts for the subtle differences in page formatting and finds the nearest section match to your desired coordinate. Once you plot a point, you can send or save the URL—the tool is purposefully very stateful. Finally, I have included print style sheets, so you can print out your sections, should you need them for surgery (this is why I built the tools).

Sending Emails from LabVIEW with the Mailgun API

Emails can give you play-by-play information from a program, or just act as a notifier that it is done running. Mailgun is a great web platform that can send emails via a web API. This is a LabVIEW VI that integrates with Mailgun, all you will need is a [free] account and to enter in your API information. I will not be keeping the source on Github, so here it is.

Use like this:


Change the defaults to reflect your Mailgun account settings/info:


This is the back panel:


Moving LabVIEW and Windows Around on Multiple Hard Drives

LabVIEW and Windows can eat up a lot of space, so you may find yourself stretching your hard drives thin and having to consider swapping in a new (and bigger) hard drive, or simply adding another hard drive to the mix. Let’s consider some options.

Starting Afresh

As much of a headache as re-installing Windows and LabVIEW sounds, it is the only way to ensure the operating system and LabVIEW play nice. You risk losing some of your configurations, but at least you can rely on documentation and know your order of operations were done correctly. Start with Windows.

  1. If you have a Windows installation disk, pop it in and start installing it onto the new drive. Skip to #6.
  2. If not, Google “Windows 7 ISO” and download an ISO file of your intended Windows installation.
  3. Download the Windows 7 USB/DVD Tool and use it with a USB key to create a bootable version of Windows 7 with the ISO file.
  4. Restart your computer. Most computers will attempt to boot from a USB drive before anything else, but to makes sure, press F7 when starting your computer (or whichever key takes you to the boot menu) and select the USB drive.
  5. Proceed with the Windows installation on the new hard drive.
  6. You should still have access to the old hard drive, so at this point you can just drag over files you still need (don’t drag over programs, just re-install them).

Once Windows is installed you may choose to download LabVIEW or use installation disks. If you have disks, you should use them, and LabVIEW will provide updates once you are complete in case the disks are out of date. See below for more on this.

Merging onto a New Hard Drive

If you want to retain you Windows installation and all your files you will want to “clone” the old hard drive to the new one. There are plenty of paid-for solutions, but this is the free way.

Factory Windows installations usually partition a hard drive with a separate area (and disk letter) called “SYSTEM” which makes the drive bootable. Without this, the hard drive is just data and doesn’t represent a bootable operating system. If you choose to maintain the “SYSTEM” partition on the new hard drive, it needs to be the first partition, and you should use the Windows partition manager to format and create this partition before cloning.

  1. Download XXCLONE.
  2. If you are maintaining a boot partition (explained above), first clone that partition, using the “SYSTEM” partition as the source, and the new “SYSTEM” partition as the destination.
  3. Next, clone the primary partition of the old hard drive to the primary partition of the new hard drive (this is all the data that includes Windows, documents, etc.).
  4. If you are not using a boot partition, you will need to make sure your single partitioned hard drive is recognized as a bootable operating system. Download the free version of EasyBCD and use it’s “BCD Backup and Restore” tool on the new hard drive to create the proper files needed for booting. (XXCLONE offers something like this in their Tools but it didn’t work for me).
  5. Upon restarting your computer—making sure to boot from the new hard drive—you should be back in action, loading Windows as if nothing had changed.

Using Two Hard Drives

It is not ideal to use two hard drives to split the storage of program-specific files (like .exe files, or different LabVIEW components). The reason is that someday you may need to consolidate physical hardware space, and need to merge into one larger drive, but more importantly, the best reason to stay away from this is because you will start creating references across hard drives. If the supplemental hard drive fails, or the drive letter gets changed, your programs (including LabVIEW) start breaking. If you insist on using two (or more) drives, the best suggestion is to make sure one drive is for all operating system related files, and use the others for document and file storage.

LabVIEW 32-bit & 64-bit Order of Operations with FPGA Card and Basler Camera

The reason it can be a pain to move LabVIEW stuff around (especially .exe’s) is because LabVIEW’s NI MAX keeps record of the installation locations for all NI components. LabVIEW use’s the references inside of NI MAX to decide where to open the things it needs. This is why it can be easier to just re-install everything when shit hits the fan.

For re-installing LabVIEW with an FPGA card and Vision hardware (connected to a Basler camera) follow this order. This is specifically for users who have the installation discs.

  1. Make sure all the NI software is removed (use Window’s “uninstall a program” tool).
  2. Make sure all hardware cards are removed. If they are not, shut down the computer, unplug it, and remove the cards (set them on an anti-static bag). Plug the computer back in and start it.
  3. Insert the first LabVIEW discs and install LabVIEW (this will be the 32-bit version) and the FPGA module. You will also want to make sure that Xilinx compiling tools are installed along with this. When it prompts you for Drivers you will want to continue the installation without driver support—these will be installed when installing 64-bit LabVIEW.
  4. After the installation it will prompt you for a restart (don’t do anything with the hardware yet).
  5. Download LabVIEW 64-bit and begin the installation. When prompted for the location of the Drivers, place the driver CD into the computer and browse to the location in the NI prompt. You will want to select all of the Vision drivers in the next dialog.
  6. After the 64-bit installation you will need to restart the computer. Once you have done this, NI will check for updates. If there are any, download them (and restart the computer) before moving on.
  7. Once everything is installed, shutdown the computer, unplug the power cord, and install the FPGA and Vision hardware. Plug the power back in and start the computer.

The FPGA hardware should show up in NI MAX now (also see Getting Started with the R-Series Multifunction RIO and Getting Started with the NI PCIe-1433). If you were using a RTSI line between the FPGA and Vision hardware, be sure to re-install it in NI MAX.

  1. Open NI MAX.
  2. Right-click “Devices and Interfaces” and click “Create New…”.
  3. Choose “NI-RTSI Cable”.
  4. Locate the camera under “Devices and Interfaces” now and modify the RTSI Lines in the “Camera Attributes” tab.

It is also worth ensuring the camera attributes from your original projects is the same (for instance, we leave “Enable Serial Commands” unchecked). You now want to make sure the FPGA hardware is a use-able target in a LabVIEW project.

  1. Open a blank LabVIEW project.
  2. Right-click “My Computer” within the project and under “New” make sure you see “Target and Devices”. If it is available, click it.
  3. Now make sure your FPGA device can be added to the project.

If you fail to get through these 3 steps, you will want to review the installation procedure, and your best option may be to re-install everything with particular attention paid to the order of operations. It is worth either recompiling your FPGA VI’s or at least opening the “FPGA Compile Worker” in Windows (find it by searching or under the National Instruments folder) and making sure the compiler is installed correctly.

Brandr Open Sourced

I have no reason to call this “intellectual property” anymore. I have released the code that drives this cool little demo: This project was about 2 weeks of coding, testing, and understanding—having late nights, and a lot of white-boarding to figure out the best way to extract colors from brands. Notice, I say “brands”. It’s not just colors from an image, it’s the human interpretation of a gestalt; and that’s why it is a tough problem. Click on the title above, or here for the Github repo.

Installing suPHP with Plesk 11 on Media Temple DV4

I ran into major dependency issues trying to follow this tutorial provided in the Media Temple knowledge-base. Normally it is recommended to install suPHP by compiling it from source (as mentioned in the tutorial), but mainly because of the mode that suPHP is put into during the installation. The reckoning for suPHP is well-put here, with the three modes of operation being:

  • owner: Run scripts with owner UID/GID
  • force: Run scripts with UID/GID specified in Apache configuration
  • paranoid: Run scripts with owner UID/GID but also check if they match the UID/GID specified in the Apache configuration

The advantage of compiling from source is that suPHP can run in paranoid mode—however, as the previous link states: Although suPHP states that the default mode is “paranoid”, the libapache2-mod-suphp is installed in “owner” mode by default. When suPHP is installed in “owner” mode, the directive suPHP_UserGroup is not recognized which is required for “force” or “paranoid” mode.

Running suPHP in owner mode doesn’t seem all-that-bad, considering it is in fact the default for some installations. However, the comment about not having access to the “suPHP_UserGroup” within your configuration file is true, and if you try to restart Apache with it in there (as the Media Temple tutorial suggests), it will result in an error, and possibly crash your server.

My workaround is to remove any of the lines that include “suPHP_UserGroup”, and simply use yum to install suPHP, which lets you skip steps 1-4 in the tutorial.

yum install mod_suphp

Creating Rows with ExpressionEngine and a Grid

Below is an image from my digital library that I developed to help me organize my research for an upcoming book. It uses a responsive grid, and I’ll show you how to create it using Bootstrap and methods available within ExpressionEngine (no modulo!).


Let’s take a look at how Bootstrap implements their grid (knowing other grid systems are similar), and in particular, a 4-column grid.

<div class="row"> <div class="span3"></div> <div class="span3"></div> <div class="span3"></div> <div class="span3"></div> </div>

The only complexity within ExpressionEngine you face is that you have repeating elements for the “span3” class, but the rows also need to repeat as the page continues! The switch statement comes to the rescue in a somewhat unexpected way.

<div class="row"> {exp:channel:entries channel="books" orderby="title" sort="asc" dynamic="no"} <div class="span3"></div> <div class="span3"></div> <div class="span3"></div> <div class="span3"></div> {switch='|||</div><div class="row">'} {/exp:channel:entries} </div>

Why is this a little tricky? Well, even though the entire EE channel loop appears to be wrapped by a row element, it’s not. The closing (and opening) of the first, and all subsequent rows, is handled in the switch statement. The last and final closing “</div>” is actually closing a row that came from the switch statement.

If you are used to hardcoding this type of stuff in PHP, your mind may instantly jump to using some type of modulo plugin to do this, but fortunately you can scratch that complexity right out. Here is what my entire page looks like, using Assets for the image.

{embed="partials/_head"} <h2>Books</h2> <div class="row"> {exp:channel:entries

channel="books" orderby="title" sort="asc" dynamic="no"

disable="custom_fields | member_data | pagination | trackbacks"}

{if no_results} <div class="span12"><p>No entries yet.</p></div> {/if} <div class="span3"> {research_cover} <a href="{path='book/entry/{url_title}'}"><img class="img-polaroid img-rounded" src="{url:tall}" alt="{alt_text}"/></a> {/research_cover} <h5><a href="{path='book/entry/{url_title}'}">{title}</a></h5> <p class="center"><small>{if book_bought}✓ you own this{/if}</small></p> </div> {switch='|||</div><div class="row">'} {/exp:channel:entries} </div> {embed="partials/_foot"}

Timebomb is a Time-sensitive Short Link for Passwords

Sorry, but I've had to discontinue support for my password sharing application,, or simply, "x" as it came to be. My focus has been entirely shifted away from supporting SSL certificates and PHP subdirectories. I apologize if this comes as any inconvenience! If you still need a solution, I'll give you two suggestions, beyond using a standard password manager:

  1. Launch it yourself! The code has always been open source, and you can give me a shout if you run into issues. Visit the Timbomb Github Repo for the code.
  2. CloudApp has always been a great little application and they recently incorporated a timebomb-like feature called "auto destroy". You can either share a text document, or take a screen shot of the credentials you want to share, and send a private link.

I built a long time ago and gave it a quick refresh this afternoon. I still see this problem popup all over the place– people sending passwords in emails. Worse yet, starting the email with, “here is the password”. All it takes is access to someones email account to retrieve all these passwords, or bank account numbers, or social security numbers.


This mobile-friendly website allows you to make confidential information available for only 1 hour, 1 day, or 1 week. When the time is up, so is the data, and it is erased forever. Everything runs under an SSL certificate (HTTPS) so everything transmitted is encrypted, and secure. Also, every link comes in a clean, sendable format (eg. “”). Give it a try, and save yourself (and your clients!) from a security catastrophe down the road.

Git Video - Supplementing Github Commits with Video!

Ever come back to a project and completely forget how all the pieces fit together? Supplement your commits with Git Video.


That half-sentence you throw into the commit message usually does no help when you have walked away from a project for more than a day. Projects take major evolutions— architecture, ideologies, and approaches to problems are constantly changing, and documenting them becomes hard. A short video connected to each commit on Github is perfect, and lets you use gestures, inflection, and visuals to help you remember what state the project is in at that point in time.


Doing this once a day will help you battle that incessant, twenty-minute, “back into the groove” you usually have to entertain when switching between projects. It was created in a night on Rails, uses Github’s API, designed with Bootstrap, launched on Heroku, monitored with New Relic, and relies on Nimbb for the video embedding. Special thanks to the team over at Nimbb for sponsoring this project with some free video space. A company willing to go out of the way for independent developers deserves a look— so check them out.

Simple Status Ticker for API Endpoints

When you are deploying code left-and-right, even in a test-driven development cycle, sometimes you still want the piece of mind that your website is responding. In my case, this is specific to API endpoints and different API environments we run our products on.

Status Ticker

This script allows you to leave a mini terminal window on your screen that will refresh the status of a website at an interval of your choice. I use two gems, the first is to make the text colors pretty, and the second is for making the HTTP calls- install them like so:

> gem install httparty 
> gem install terminal-display-colors 

Below is the code, which can also be found on this Github gist.

To use the ticker, take the code and put it into a file named `status.rb` and run the following in terminal from the directory the file is located:

> ruby status.rb 

To exit the ticker, use control+c on your keyboard.

Forcing SSL in a Sinatra App

When deploying on Heroku, you can piggy back on their SSL certificate, which allows you to have a secure connection right away without any SSL configurations of your own. I think this is a great solution for a lot of people until you need a really pretty URL. Because this is possible, you should use it, and if your building an API you should also think about forcing different environments to require SSL. Here is a simple implementation in my `app.rb` file:

I have left some of my app-specific code in there as well, but I am sure you can dig around that to see how SSL is forced. Notice that because downloading `.ics` files shouldn’t require SSL, or in other words, it shouldn’t fail if the user uses `http`, it is included in a whitelist array.

Multi-environment post_build_hook using tddium, Heroku, and Ruby Sinatra

We have been using tddium as a deployment tool for our Ruby Sinatra API for some time now. It has been working great and now manages deployment for 7 of our API environments. We have ran into a few small issues, but there are some sharp engineers on support, so issues get fixed fast. The most recent develop in our system is to implement a post build hook to run migrations after the app is deployed on Heroku.

One of the straight-forward adjustments I had to make to their gist is to dynamically set the app name based on the branch that was pushed to Github. In our case, each Heroku app (“environment”) has a git branch, and this is also how tddium sets up test suites, so everything jives.

The portion that required some support was apparently due to some dependency issues with my gems and the way the git repo was being pushed from tddium. The first step was to add the heroku gem to my gemfile, and from there, it was to modify the tddium post_build_hook a little bit. Here is the full version:

Lazy Levenshtein: Using Abbreviations and Spellchecked Inputs in Ruby

I have been spending a lot of time writing Ruby programs that take in data through the terminal. One of the problems is that mis-spelling something can cause the program to crash, and I want to be as quick as possible when doing data entry.

One of my programs asks which server environment I would like to use before I start messing with any data (development, integration, staging, production). It would be great if all of the following abbreviations or misspellings would choose the development environment, and keep the program rolling:

  • dev

  • development

  • devel

  • deevleopmnt

You get the idea- abbreviations and spellchecking from known inputs. To accomplish this I leverage the Levenshtein distance algorithm, more commonly known as “edit distance”. This algorithm compares two strings and returns an integer that is equal to the amount of edits needed to transform the first string into the second.

Here is the Github Gist for Lazy Levenshtein, with the sample code below so we can dig through it.

The three parameters are the input itself, an array of possible matches, and a boolean that tells the method whether or not you want to match abbreviations. The method sets up a Levenshtein comparison for each potential match (using the Ruby Amatch library), and scores the comparison. We are playing golf here, because the lowest score wins the game!

The method also reverses the array in the main loop, which puts priority to the first items in the array if there happens to be a tie between matches. Unlike typical “spellcheck”, this method will never return “not found”, it will always return a match, and if the “matches” array is empty, it simply returns the provided input.

This has helped me make inputting much faster with smarter defaults, and given me the piece of mind that my misspellings will always turn into known/safe values.

Article Analysis: Matching People's Names to Email Addresses

*code examples are based in ruby

The problem

Consider you are scraping web articles building a list of contacts for a PR company. Getting email addresses is as simple as a regular expression.

string = 'John Smith can be contacted at'
emails = []
string.scan(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}/i).each {|x| emails << x}

However, an email address is only so powerful, it would be really great if we could match a name to that email address. The problem statement is pretty simple- match email addresses in the article to the names in the article.

Finding names

So if you are thinking that finding names in text can’t be that hard, lets take a stroll down that dark alley really quick. Maybe just a regular expression that matches two capitalized strings in a row could do the trick.


Well, that’s cool, but what if there is a middle name, or even worse, an abbreviated name? So now, we add an optional third string to the regular expression, and allow for abbreviations.


This is looking great, and then you hit a name like Michael D'hunt, or De'Angelo Munez. Okay, so now we allow some apostrophes.


So right now, you run this against a list of 10,000 common names, and a boatload of Lorum Ipsum, and you have some great accuracy. However, in the real world you get a sentence like this, “And Will Smith was not alone, he included his wife Jada on his trip to San Francisco, where they stayed at Hotel Palomar”. Your regular expression just got roasted in so many ways.

Natural Language Processing (NPL)

To parse out things like names, places, dates, and going into things like differentiating languages within text, we have to get a little more fancy. Natural language processing concerns itself with the study of linguistics, using a mixture of machine learning, statistics, and artificial intelligence to provide a meaningful analysis of human languages. For all intensive purposes, we can say that it breaks apart sentences so we can interpret them better using a computer.

This is the crucial link between knowing if ‘San Francisco’ is a person’s name, or a physical place. The Stanford Natural Language Processing Group has a set of core open-source tools that can take care of some of this by implementing known language patterns, and using massive libraries of common naming schemes for people, places, and things.

Finding names, the right way

By implementing the Stanford CoreNLP Toolset, we can essentially throw some text at it, and with a couple filters we can have a list of names contained within the text. So from the sentence above, we may get a result like this.  

    :name => "Will",
    :start => 4,
    :end => 8
    :name => "Smith",
    :start => 9,
    :end => 14
    :name => "Jada",
    :start => 51,
    :end => 55

The 'start’ and 'end’ numbers refer to the string position of the name itself, and with a little magic it is possible to concatenate names if they appear together, giving a final result as follows.

["Will Smith", "Jada"]

Names to emails

This matching problem is quite problematic considering the format that an article or piece of text might come in. In a perfect world, a person’s name would appear right next to their email address.

'John Smith can be contacted at, and Jill Ruth can be contacted at'

If we know the position of the name, and know the position of the email address, this is no problem- we just write a routine to find the closest email to the persons name. However this breaks down pretty quickly.

'John Smith and Jill Ruth can be contacted respectively at, and'

Or even worse…

'Article written by Edward Jones

... article body ...

Contact the writer at'

The article body itself could also contain important names and email addresses as well, scattered as they please, so this calls for some more advanced parsing techniques.

The Levenshtein distance

My take on this problem is that we can throw name-position vs. email-position out the window; it is not reliable. The one thing we can rely on is that, in most professional situations, a person’s email has some reference to their name. 

This is where the Levenshtein distance algorithm comes in- it calculates the numbers of edits needed to transform one string into another. In our case, we are comparing a persons name to their email. It quickly becomes apparent that the email extension and any numbers can be removed from the email address, and normalizing the case is helpful before making the comparisons. Let’s looks at some results for John Smith (or more specifically in lower case, “john smith”). -> john.smith : 1 edit -> j.smith : 4 edits -> john.walker.smith : 8 edits -> jwsmith : 4 edits -> jws : 8 edits

So that is pretty neat, and the next step to the problem is pretty clear- we need to test a bunch of common email address patterns against the persons name, and use the best score. So instead of making comparisons with just “john smith”, we can abstract the name into some common formats.

person = "John Smith"
email = ""

# remove the email extension and everything besides characters
m ='@').first.downcase.scan(/[a-z]/).join(''))

# run a standard set of tests against the persons name
tests = []
tests << m.match(person.downcase.scan(/[a-z]/).join('')) 
tests << m.match("#{person.split(' ').first.downcase[0]}#{person.split(' ').last.downcase[0]}")
tests << m.match(person.split(' ').first.downcase)
tests << m.match(person.split(' ').last.downcase)

best_result = tests.min

If this is run for every person and every email address found in an article, it will provide the best score for each person vs. email address.

Scores into results

With any type of artificial intelligence, there is rarely a concept of “passing a test”, there are only various levels of failure. The goal is to simply minimize failure in the best way possible, and developing with any other intention can be a destructive process. Using our scores from the previous step, we attempt to award all the emails we found to the person most deserving. Consider the following sample set.

people = ["Matt Gaidica", "Brad Birdsall", "John Smith", "Grant Olidapo", "Minh Nguyen"]
emails = ["", "", "", ""]

The scores we produced account for every name vs. every email, or 20 (5x4) unique values. We look to some sort of complexity reduction algorithm to reduce this set of 20 data points, to only 4, which directly relate names to emails, leaving one of our people email-less. After about 20 lines of magic, our algorithm spits out the results.

  "Matt Gaidica" => "",
  "Brad Birdsall" => "",
  "Grant Olidapo" => "",
  "Minh Nguyen" => ""

Tip of the iceberg

I look at this as just one of the ways to accomplish this goal. This process can be heavily supplemented with machine learning techniques to produce better name recognition, and further develop common email address patterns for your specific type of article, document, or data set.

I have opened a library on Github called Textract, which includes the code for this entire process. My goal is to keep the problems simple, and the solutions simpler.