Archive for April, 2008

Improving a SOUNDEX search

Tuesday, April 8th, 2008
We were recently working on a project for a client and needed to be able to provide a useful search of the products database. In addition, we wanted to catch some common misspellings. My first thought was that this was going to be simple - just use a quick SOUNDEX() in MySQL. I’ve done this a million times and it works great. Generally when I’ve used it, it’s been to find potential duplicates, etc. Let’s take a look at that…
 
When looking for potential duplicates, a SOUNDEX() can be a great tool. For example, say you have a database of people. You check for a SOUNDEX() of the first name and last name, and pull any results that have 2 or more matching records. So, in this case, the following two names would match:
 
Johnny Smithe
John Smith
 
Are they the same person? Maybe. Maybe not. You can check on additional data, too. For example, you can check the city to see if they seem similar, etc. Then, you can provide an administrative user with a merge/remove/ignore feature. Anyway, that’s beside the point… The point here is really just that SOUNDEX() will provide you a pretty good way to see if two things sound similar.
 
Where SOUNDEX() fails is when there is more than one word. It wants to calculate the SOUNDEX() of the entire field, not each word. Yes, there are some clever ways around this out there, but, in the end, they all just end up producing a lot of false positives.
 
We ended up using a combination of FULL TEXT indexes (using MATCH … AGAINST) and PHP’s Pspell functions. Basically, we took the search as the user typed it and used it against a full-text index. If no results were found, the search term(s) was passed to Pspell to get suggestions. The top 5 suggestions were searched until results were found. Overall, it worked pretty well, as we were still able to take advantage of the FULL TEXT index and rank the results according to their relevance.
 
We were working when a pretty small database, so building in thresholds wasn’t really necessary at this point, however, you could certainly build in a threshold on the number of search results you needed before you wanted to get suggestions and re-run the search. Also, if you’re working with a very large database and busy site, you could easily return the user any search results, as well as a ‘did you mean ____?’ type of screen. This way, you avoid unnecessary searches on the database.
Joe Koenig
Creative Anvil
A Web Design & Search Engine Marketing Firm

Share/Save/Bookmark