kiddock's blog

How fuzzy search works

Would it be easy to use YouTube if you had to search for the exact words in the video title? Can you imagine searching YouTube without making any spelling mistakes? It would be so frustrating! Thanks to mathematics research, this isn’t a problem.

Recently, I found out how fuzzy searching works. Let me explain how you might add fuzzy matching to your search box.

A basic search will take each word that I enter into the search box, convert it to lowercase, and show video titles that contain that word. Great! I can search for "puppies playing" to get a video called "Playing with my puppies" and I can search for "mountain biking" to watch "Biking on the mountain".

But if you have millions of videos, searching through them all could take a long time. So you need a table in your database with two columns: (1) word number and (2) video number. For example, Let’s say that "puppies" has a word number of 20 and "Playing with my puppies" is video number 99. When I search for "puppies", you convert it to the number 20 and then find all video numbers that have a word number of 20 in your new table, including video number 99.

This is a really fast lookup for a computer to do - even with millions of videos! Just make sure you keep this table up to date whenever a video title changes.

Similar words

But, what if I search for "puppy"? Or "mountain bike"? Similar words should show similar results.

If you can figure out the base word (or stem), you will know which words are similar. For English, there exists the Porter Stemming Algorithm. Other algorithms exist, so search online for your speaking and programming language. This stemming algorithm will reduce "puppies" to "puppy" and "biking" to "bike". Make sure you stem every word before putting it in your index and make sure you stem every one of my search words before finding matches in your index.

Now I don’t have to type the exact words. Thank you!

Spelling mistakes

But what if I make a spelling mistake? Here is what you can do for each word that I enter into the search box.

First, take the first 2 or 3 letters and find every word in your index that starts with the same letters. Then, using the Levenshtein distance, if the word is similar enough you can use it to find matching video titles in your index.

The Levenshtein distance tells you how many letters you need to change, add or remove to get from my searched word to the word in your index. For example, to get from "puppies" to "puppy" you need to remove the "s", remove the "e", and change the "i" for a "y". Because that’s only 3 edits, you could say it’s a close enough match. You might allow more or fewer edits depending on the words in your video titles.

Final thoughts

And that’s a quick explanation of how fuzzy search works.

For extra relevant search results, you could automatically correct common misspellings and only index common phrases. For example, if you have lots of dog videos you could index "border collie" and "border terrier" instead of indexing the single word "border". Or if you have lots of mountain biking videos you could index "cross-country" instead of "cross" and "country".

In the past, if I did not understand how something works I was not sure how to copy it in my own programming. But, now that I understand how fuzzy search works, I want to add it to my software in the future.