Table of contents

Introduction

Topic Modelling

Latent Dirichlet Allocation

3.1.

How Does LDA work?

3.2.

Representation of LDA

3.3.

Optimisation:

3.3.1.

The product probability of p1 * p2 is used to reassign the word 'w' from the document 'D' to a new topic 'k.'

Implementation

FAQS

Conclusion

Last Updated: Aug 13, 2025

Easy

Topic Modelling with Latent Dirichlet Allocation

Author Taneesh Kaushik

Do you think IIT Guwahati certified course can help you in your career?

Yes

Introduction

Topic modelling is a sort of statistical modelling used to find abstract "themes" in a collection of documents. A topic model, such as Latent Dirichlet Allocation (LDA), is used to assign text in a document to a certain topic. It uses Dirichlet distributions to model a subject per document and a word per topic model.

We're attempting to allocate each word to a particular topic with some probability, and there are many ways these probabilities are found out; LDA is one of them.

Let's first have a look at what exactly is topic modelling.

Topic Modelling

The recognition of words from the themes included in a document or a corpus of data is known as topic modelling. This is useful since retrieving words from a document takes significantly longer and more complicated than extracting them from the document's themes. For example, there are 1000 documents, each with 500 words. So 500*1000 = 500000 threads are required to process this. When you divide a document into subjects, if there are five topics, the processing is simply 5*500 words = 2500 threads.

This looks simpler than processing the entire document, and this is how topic modelling has come up to solve the problem and visualize things better.

Hence we try to map each word to a particular topic in topic modelling.

LDA is used for topic modelling (Latent Dirichlet Allocation). The task of discovering themes that best characterize a set of documents is topic modelling. These themes will arise throughout the topic modelling process (therefore called latent). Latent Dirichlet Allocation is a well-known topic modelling technique (LDA).

Topic modelling is an unsupervised method of classifying or extracting themes by detecting patterns, similar to clustering algorithms that divide data into sections. The same thing happens in Topic modelling, where we learn about the various topics in the document. This is accomplished by extracting the document's patterns of word clusters and the frequency of terms.

Topic modelling is an unsupervised method of classifying or extracting themes by detecting patterns, similar to clustering algorithms that divide data into sections. The same thing happens in Topic modelling, where we learn about the various themes in the document. This is accomplished by extracting the document's patterns of word clusters and the frequency of terms.

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a tool and technique for Topic Modeling that classifies or categorizes the text in a document and the words per topic using Dirichlet distributions and processes.

The LDA makes two fundamental assumptions:

Documents are a made of topics, and
Topics are a made of tokens (or words)

In these topics, the probability distribution is employed to generate the words. In statistical terminology, the documents are the probability density (or distribution) of subjects, and the topics are the probability distribution of words.

It's an algebraic algorithm; any ML algorithm has three steps.

Representation: Defining the problem in mathematical entities.
Defining the loss
Optimization: Minimize the loss.

Let's understand these three things in LDA.

How Does LDA work?

First, LDA applies the following two key assumptions to the corpus at hand. Assume we have a corpus containing the following five documents:

Document 1: This weekend, I'd want to see a movie.

Document 2: Yesterday, I went shopping. New Zealand defeated India at Southampton by eight wickets to win the World Test Championship.

Document 3: I am not a cricket fan. Netflix and Amazon Prime both have excellent movie selections.

Document 4: While watching movies is a fun way to unwind, I'd rather paint and read some good books this time. It's been a long time!

Document 5: I love this blueberry milkshake! Dr Joe Dispenza's books are worth reading. His work is revolutionary! His works aided in discovering a lot of information on how our beliefs affect our biology and how our brains can be rewired.

A document word (or document term matrix), commonly known as DTM, can represent any corpus, which is a collection of documents.

The first step with text data is to clean, preprocess, and tokenize the text to words, as we all know. We get the following document word matrix after preprocessing the documents:

The five documents are D1, D2, D3, D4, and D5, and the words are represented by the Ws. Therefore there are eight distinct words from W1 to W8.

As a result, the matrix has the following shape: 5 * 8 (five rows and eight columns):

As a result, the corpus now primarily consists of the above-preprocessed document-word matrix, in which , each row represents a documentand each column represents tokens or words.

As demonstrated below, LDA converts this document-word matrix into two other matrices: Document Term and Topic Word.

Below is a description of these matrices:

The conceivable themes (represented by K above) that the documents can contain are already included in the Document-Topic matrix. Assume we have five themes and five papers, resulting in a matrix with a dimension of 5*6.

The words (or terms) that certain subjects can contain are listed in the Topic-Word matrix. The vocabulary has 5 themes and 8 distinct tokens; hence the matrix was 6*8 in shape.

Representation of LDA

The yellow box refers to the entire corpus of papers (represented by M). M = 5 in our example because we have 5 documents.
The amount of words in a document is provided by N in the peach colour box.
Many words might be found inside this peach box. W, which is in the blue colour circle, is one of the words.

The ultimate purpose of LDA is to identify the most optimized Document-Topic and Topic-Word distributions by finding the most optimal representation of the Document-Topic and Topic-Word matrices.

Because LDA thinks that documents are made up of various topics, and topics are made up of a variety of words, it starts at the document level to determine which topics and words would have generated these documents.

Now, let's look at our corpus, which consisted of five texts (D1 to D5), each with a different quantity of words:

The LDA model has two parameters that control the distributions:

Alpha () is in charge of per-document topic distribution
Beta () is in charge of per-topic word distribution.

To summarize:

M: total documents in the corpus
N: number of words in the document
w: Word in a document
z: latent topic assigned to a word
theta (𝛳): topic distribution
LDA model's parameters: Alpha (ɑ) and Beta (ꞵ)

Optimisation:

The ultimate purpose of LDA is to identify the most optimized Document-Topic and Topic-Word distributions by finding the most optimal representation of the Document-Topic and Topic-Word matrices.

Because LDA thinks that documents are made up of a variety of topics, and topics are made up of a variety of words, it starts at the document level to determine which topics and words would have generated these papers.

Now, let's look at our corpus, which consisted of five texts (d1 to d5), each with a different quantity of words:

d1 : (w1, w2, w3, w4, w5, w6, w7, w8)

d2 : (w`1, w`2, w`3, w`4, w`5, w`6, w`7, w`8, w`9, w`10)

d3 : (w“1, w“2, w“3, w“4, w“5, w“6, w“7, w“8, w“9, w“10, w“11, w“12, w“13, w“14 w“15)

d4 : (w“`1, w“`2, w“`3, w“`4, w“`5, w“`6, w“`7, w“`8, w“`9, w“`10, w“`11, w“`12)

d5 : (w““1, w““2, w““3, w““4, w““5, w““6, w““7, w““8, w““9, w““10,…, w““32, w““33, w““34)

After the first iteration, LDA provides the initial document-topic and topic-word matrices. The goal is to improve these findings, which LDA accomplishes by iterating over all the documents and terms.

LDA also assumes that all of the assigned topics are correct, except the present word. So, using those already-correct subject-word assignments, LDA attemptsLDA will iterate over each document 'D' and each word 'w' with a new assignment that:

LDA will iterate over each document 'D' and word 'w'.

How would it go about doing that? It does it by computing two probabilities for each topic (k): p1 and p2.

P1: the proportion of words currently assigned to the topic in the document (D) (k)

P2: the percentage of assignments devoted to the topic(k) out of all documents derived from the term w. To put it another way, p2 represents the percentage of papers in which the word (w) is also assigned to the topic (k)

The formulas for p1 and p2 are as follows:

P1 equals proportion (subject k / document D), and P2 equals percentage (word w / topic k).

LDA now estimates a new probability, which is the product of (p1*p2), using these probabilities p1 and p2. Through this product probability, LDA identifies the new topic, which is the most relevant topic for the current word.

**The product probability of p1 * p2 is used to reassign the word 'w' from the document 'D' to a new topic 'k.'**

Now, during the step of selecting a new topic, 'k,' the LDA is run for a large number of iterations until a steady-state is reached. LDA reaches its convergence point when it produces the most optimal document-term and topic-word matrices representation.

The working and method of Latent Dirichlet Allocation are now complete.

Implementation

# Parameters tuning using Grid Search
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.manifold import TSNE
grid_params = {'n_components' : list(range(5,10))}
# LDA model
lda = LatentDirichletAllocation()
lda_model = GridSearchCV(lda,param_grid=grid_params)
lda_model.fit(document_term_matrix)
# Estimators for LDA model
lda_model1 = lda_model.best_estimator_
print("Best LDA model's params" , lda_model.best_params_)
print("Best log likelihood Score for the LDA model",lda_model.best_score_)
print("LDA model Perplexity on train data", lda_model1.perplexity(document_term_matrix))

You can also try this code with Online Python Compiler

Run Code

There are three major hyperparameters in LDA. They are 'alpha,' which stands for document-topic density factor, 'beta,' which stands for word density in a subject, and 'or the number of components,' which stands for the number of topics you wish to cluster or divide into portions of the document.

FAQS

1. What is a good explanation of latent Dirichlet allocation?
Latent Dirichlet Allocation (LDA) is a popular form of statistical topic modelling. In LDA, documents are represented as a mixture of topics and a topic is a bunch of words. Those topics reside within a hidden, also known as a latent layer.
2. Why do we use latent Dirichlet allocation?
The aim behind the LDA to find topics that the document belongs to, on the basis of words contains in it. It assumes that documents with similar topics will use a similar group of words. This enables the documents to map the probability distribution over latent topics and topics are probability distribution.
3. What is the difference between LDA and LSA?
Both LSA and LDA have the same input which is a Bag of words in matrix format. LSA focus on reducing matrix dimension while LDA solves topic modelling problems. I will not go through mathematical detail and as there is lot of great material for that.
4. Is Latent Dirichlet Allocation Bayesian?
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modelled as a finite mixture over an underlying set of topics.

Conclusion

So in a nutshell LDA is a type of statistical model which is used for topic modelling

for discovering the abstract “topics” that occur in a collection of documents and are used to classify text in a document to a particular topic.

Hey Ninjas! Don't stop here; check out Coding Ninjas for Machine Learning, more unique courses, and guided paths. Also, try Coding Ninjas Studio for more exciting articles, interview experiences, and fantastic Data Structures and Algorithms problems.

Happy Learning!

Live masterclass

Zomato Data Analysis Case Study: Ace 25L+ Roles in FoodTech

by Abhishek Soni

16 Mar, 2026

01:30 PM

39+ registered

Data Analysis for 20L+ CTC@Flipkart: End-Season Sales dataset

by Sumit Shukla

15 Mar, 2026

06:30 AM

267+ registered

Beginner to GenAI Engineer Roadmap for 30L+ CTC at Amazon

by Shantanu Shubham

15 Mar, 2026

08:30 AM

55+ registered

Multi-Agent AI Systems: Live Workshop for 25L+ CTC at Google

by Saurav Prateek

16 Mar, 2026

03:00 PM

8+ registered

Zomato Data Analysis Case Study: Ace 25L+ Roles in FoodTech

by Abhishek Soni

16 Mar, 2026

01:30 PM

39+ registered

Data Analysis for 20L+ CTC@Flipkart: End-Season Sales dataset

by Sumit Shukla

15 Mar, 2026

06:30 AM

267+ registered

View more events