Introduction
Language modelling is a way of determining the probability of any sequence of words. Before getting started with N-Gram modelling, we need to know something about the Markov chain. We can consider a Markov chain a chain of States. we can have a chain of different states, say a b c d e f g, and so on. So when we write them in a sequence, we get a chain of different states, that is, we can go from state a to state b, b to state c, and so on. so we have two conditions, a and b, and we have the probabilities over here.
The probability of going from state a to state b is 50%. The likelihood of going state a to state 50%, state b to state a is 50%, state b to state b is 50%. Let's assume that the initial state is a, so based on these probabilities, we can choose anyone because they have equal chances.
So in this way, we can form a sequence of different states. This sequence or chain of conditions is called Markov chains.
N-Gram is a continuous sequence of N items from a sample of text. So these items are the different states that we saw in Markov chains. These items can be character words or sentences, and we can even increase the scope we can even make articles and so on. So when n is 2, we call it a Bigram; when n is 3, we call it a Trigram, and so on. So in the cast of characters, we consider the characters to be the state of Markov chains.
Sentence = ''I am a good boy'', n= 2
Bigram = 'I', 'am', 'go', 'od', b, 'y', etc..
Now consider the Trigrams for the sentence- ''The bird is flying in the blue sky''.
Trigram = 'The', 'he', 'e', 'b', 'fly', etc.
Working of N-Gram model
An N-Gram language model tells the probability of occurrence of a given N-Gram within any sequence of words. We can expect the likelihood of a word (w), given the history of previous observations (h), Containing n-1 words. We can compute the joint probability by using the conditional probability of a word given previous statements as-.
p(w1, w2,...wn) = p(w1) x p(w2 | w1) x p(w3 | w1 w2) x p(w4 | w1 w2 w3)...p(wn | w1 w2 w3…wn-1)
In N-Gram modelling there is a simple assumption that-
.p(wk | w1 w2 w3…wk-1) = .p(wn | wk-1)