Introduction
Spelling Correction is a very important task in Natural Language Processing. It is used in various tasks like search engines, sentiment analysis, text summarization, etc. As the name suggests, we try to detect and correct spelling errors in spelling correction. In real-world NLP tasks, we often deal with data having typos, and their spelling correction comes to the rescue to improve model performance. For example, if we want to search apple and type “aple,” we will wish that the search engine suggests “apple” instead of giving no results.
Different approaches to spelling correction
Norvig’s approach
In this approach, we consider all possible words that can be formed using edits - insert, delete, replace, split and transpose using the brute-force method.
For example, for word abc, all the possible words after edit are - ab ac bc bac cba acb a_bc ab_c aabc abbc acbc adbc aebc etc.
These words are then added to a list. We again repeat the same thing for the words still having errors. We then estimate all the words using the unigram language model. We pre-calculate the frequency of every word from a large corpus. The word having the highest frequency is chosen.
-
Adding more context
We can use n-gram models with n>1, instead of the unigram language model, providing better accuracy. But, at the same time, the model becomes heavy due to a higher number of calculations.
-
Increasing speed: SymSpell Approach (Symmetric Delete Spelling Correction)
Here we pre-calculate all delete typos, unlike extracting all possible edits every time we encounter any error. This will cost additional memory consumption.
-
Improving memory consumption
We require large datasets ( at least some GBs) to obtain good accuracy. If we train the n-gram language models on small datasets like a few MBs, it leads to high memory consumption. The symspell index and the language models occupy half-half of the size. This increased usage is because we store frequencies instead of simple text. We can compress our n-gram model by using paper. A perfect hash is a hash that can never have collisions. We can use a perfect hash to store the counts of n-grams. As we don’t have any collisions, we store the count frequencies instead of the original n-grams. We need to make sure that the hash of unknown words does not match with those of known words, so for that, we use a bloom filter with known words. A bloom filter utilizes space efficiently to check whether an element belongs to a set or not. We can reduce the memory usage by using a nonlinear quantization to pack the 32-bit long count frequencies into 16-bit values.
-
Improving accuracy
We can use machine learning algorithms to improve accuracy further. We can start with a machine learning classifier that decides whether a given the word has an error or not. Then we can use a regression model to rank the words. This technique mimics the role of smoothing in language models as it uses all the grams as input, and then a classifier decides the rank, meaning how powerful each gram is. For word ranking, we will train a catboost(gradient boosted decision trees) ranking model with multiple features like word frequency, word length, edit distance between the source word and modified word(i.e., number of changes required to convert the source to modified), n-gram language model prediction, the frequencies of neighboring words, etc.
-
Further thoughts
We can gather a large corpus with errors and their corresponding corrected text and then train a model to improve accuracy.
We can also have dynamic learning options. We can learn on the way while making corrections. We can also have two pass corrections by learning some statistics in the first pass and then making the actual correction in the second pass.