Problem of autocorrect - A quick solution using Approximate String Matching

I’ve always been curious about Google’s auto-correct feature. Let’s face it, it looks ultra cool, especially when we type in some vague, disjointed letters and Google suggests exactly what we had in mind. It almost feels like Google is reading our thoughts!

With some free time on my hands, I decided to dig into how it actually works. Of course, I don’t have the monstrous server farms that Google has at its disposal to build an equivalent web-scale solution. Hence I thought of and implementing a version more suited to a local/desktop setup.

A quick Google search on this topic yields tons of results. String matching and searching, as it turns out, is a very popular research area. But I wanted something simpler than most of the solutions I found. And what do you know - the classic algorithms book -‘The Algorithm Design Manual’ by Skiena had just the thing I was looking for.

By combining the approach described in the book with some of my own code, I was able to put together a decent auto-correct solution. Of course, it has plenty of flaws, but it’s a good first step in my opinion.

Towards the end of this post, I’ve also added a link to my GitHub repo with the full working code for two of the approaches we discuss here.

So, without further ado, here’s what it looks like:

1. Basic idea:

Approximate String matching is achieved by finding the minimum distance between two strings. Here, distance between two strings is calculated by the no. of operations (substitutions/insertions/deletions) it takes to convert one string to another.

For ex: lets assume our input string is abiliti and we have to convert it to either ability or capability. We would need to substitute i with y, to convert abiliti to ability. But it would take much more effort to convert from abiliti to capability. So the distance between abiliti and ability is much lesser than abiliti and capability.

Now if we have a dictionary(text file) of grammatically correct words, all we need to do is to compare our input string (the one which needs a correction) against each of the word in the dictionary , calculating the distance in each case. The word with the minimum distance is the one that should be the correct one.

Easy isn’t it ? Not so fast. Its easier said than done. And moreover, this method does have its flaws, which I will cover below.

2. Design and Implementation

First, lets see how to go about implementing this basic approach.

A dictionary class stores words in a std::map<char, std::vector<std::string>>, grouping words by their first letter for efficient lookup.
When a user inputs a word, the dictionary retrieves all words starting with the same first letter.
For each candidate word, the system calculates the edit distance (Levenshtein distance) between the input and the candidate.
Candidates are sorted by edit distance, and the closest matches (lowest distance) are returned as suggestions.
The top N suggestions are displayed to the user as possible corrections.

std::string SimpleDictionary::getClosestWord(const std::string& input) const
{
    if (input.empty())
    {
        return "";
    }
    char first = std::tolower(input[0]);
    auto it = dict_map.find(first);
    if (it == dict_map.end() || it->second.empty())
    {
        return "No suggestion";
    }
    int min_dist = INT_MAX;
    std::string best_match;
    for (const auto& word : it->second)
    {
        int dist = StringUtils::editDistance(input, word);
        if (dist < min_dist)
        {
            min_dist = dist;
            best_match = word;
        }
    }
    return best_match;
}

What are the drawbacks of this approach?

Performance: It uses a map of vectors, so candidate lookup is fast for the first letter, but searching for the closest match requires comparing the input to every word in the vector, which is slow for large dictionaries.
Memory Usage: Storing all words in vectors can be inefficient, especially with many words sharing the same prefix.
Scalability: As the dictionary grows, both lookup and suggestion become slower.

So, a better design would be to use Trie instead of a simple map.

Efficient Prefix Search: Trie allows fast retrieval of all words sharing a prefix, making candidate lookup much faster.
Memory Efficiency: Common prefixes are stored only once, reducing memory usage for large dictionaries.
Scalability: Trie handles large word sets efficiently, keeping lookup and suggestion times low even as the dictionary grows.

With Trie, the implementation changes as below :

Trie structure : Each node (TrieNode) contains a map of child nodes (children), a flag (ex:isWord) to indicate if it’s the end of a word, and the word itself if it’s a terminal node.
Insertion : Words are inserted character by character. For each character, if a child node doesn’t exist, a new one is created. The process continues until the end of the word, where isWord is set to true and the word is stored.
Candidate Retrieval: To get candidates for auto-correct, the trie is traversed to the node matching the first character of the input. All words under this node (sharing the prefix) are collected recursively.
Edit Distance Calculation : For each candidate word, the edit distance to the input is calculated. Candidates are then ranked by this distance.
Suggestion: The closest matches (lowest edit distance) are returned as suggestions.

std::string OptimizedDictionary::getClosestWord(const std::string& input) const
{
    if (input.empty())
    {
        return "";
    }
    char first = std::tolower(input[0]);
    auto it = root->children.find(first);
    std::vector<std::string> candidates;
    if (it != root->children.end())
    {
        collectWordsWithPrefix(it->second, candidates);
    }
    int min_dist = INT_MAX;
    std::string best_match;
    for (const auto& word : candidates)
    {
        int dist = StringUtils::editDistance(input, word);
        if (dist < min_dist)
        {
            min_dist = dist;
            best_match = word;
        }
    }
    return best_match;
}

What about the editDistance function, that’s like the core driver in both the above solutions ? Here’s a quick DP based solution:

Create a 2D table dp where dp[i][j] is the edit distance between the first i characters of string a and the first j characters of string b.
Initialize the first row and column to represent the cost of inserting or deleting characters.
Fill the table by comparing characters and choosing the minimum cost among insert, delete, or substitute:
- Insert: Move from dp[i][j-1] (insert a character into string a to match string b).
- Delete: Move from dp[i-1][j] (delete a character from string a).
- Substitute: Move from dp[i-1][j-1] (substitute a character in string a to match string b).
- If the current characters match (a[i-1] == b[j-1]), no operation is needed, so dp[i][j] = dp[i-1][j-1]. If they differ, the minimum cost of the three operations is chosen and incremented by 1.
The final answer is in dp[m][n], where m and n are the lengths of the two strings.

int StringUtils::editDistance(const std::string& a, const std::string& b)
{
    int m = a.size(), n = b.size();
    std::vector<std::vector<int>> dp(m + 1, std::vector<int>(n + 1));
    for (int i = 0; i <= m; ++i) dp[i][0] = i;
    for (int j = 0; j <= n; ++j) dp[0][j] = j;
    for (int i = 1; i <= m; ++i)
    {
        for (int j = 1; j <= n; ++j)
        {
            if (a[i - 1] == b[j - 1])
                dp[i][j] = dp[i - 1][j - 1];
            else
                dp[i][j] = 1 + std::min({dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]});
        }
    }
    return dp[m][n];
}

Link to my github repo: https://github.com/vinitvr/AutoCorrect

So is the problem solved?

No, not exactly. This doesn’t work for all scenarios, since we’re relying only on a first-letter search approach. But what if the word differs at the first letter itself? For example, the user might have intended to type stadium but instead typed adium. In this case, the search would be limited to words starting with ‘a’ and would return entirely false results.

Few approaches to solve this :

Full Trie Traversal: Instead of limiting candidates to words starting with the same first letter, traverse the entire trie to collect all words as potential candidates.
Prefix/Substring Matching: Consider collecting candidates based on common substrings or prefixes, not just the first character.
Edit Distance Filtering: For efficiency, first filter candidates by a maximum allowed edit distance, or use a more advanced algorithm (like BK-tree or fuzzy search) to quickly find close matches.
Heuristic Ranking: Rank candidates by edit distance, but also consider other heuristics (frequency, context, etc.) to improve suggestions.

I’ll leave the implementation of the above to the reader :)

1. Basic idea:#

2. Design and Implementation#

1. Basic idea:

2. Design and Implementation