Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Table of contents
What is Image Captioning?
How Does Image Captioning Work?
Types of Architectures
Inject Architecture
Merge Architecture
Phases of Image Captioning
Feature Extraction
Text prediction
Applications of Image Captioning
Implementation of Image Captioning
Frequently Asked Questions
What is the goal of Image captioning?
Which algorithm is used for Image captioning?
Is Image captioning supervised or unsupervised?
How does LSTM work in Image captioning?
What is deep learning used for?
Last Updated: Mar 27, 2024

Image Captioning

Author Tashmit
1 upvote
Master Python: Predicting weather forecasts
Ashwin Goyal
Product Manager @


Deep learning has numerous applications and uses. The most nuanced approach to understand deep learning is practising and taking up projects. 

Image Captioning

In this article, we will study an exciting topic; Image Captioning. It combines text and image processing with building a useful Deep Learning model.

What is Image Captioning?

Image captioning is a method to describe an image by generating a textual description. It is in massive demand for people with impaired visuals as this method helps them listen to the text with the help of AI. Image captioning converts an image which is considered a sequence of pixels, into a sequence of words. Therefore, it is regarded as an end-to-end sequence-to-sequence problem. 

This type of problem is solved with the help of neural networks. The recurrent neural network is used for the images, and the convolution neural network is used to obtain the feature vectors. While LSTM is used to store hefty sentences.    

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job

How Does Image Captioning Work?

Let us say a human is told to describe the following image.

How Does Image Captioning Work?

How would you describe it?

“The mountains are covered with snow and surrounded by a lake.”

“It is a cloudy day, the sun is about to set.”

So, how did you describe it?

You looked at the picture, understood what was happening, and formed a meaningful sequence of words to describe the image, right?

The exact process is done by the image captioning method; the first part is done with the help of CNN and later with the use of RNN.

Types of Architectures

There are two significant types of architecture of image captioning.

Inject Architecture

In this method, the word and image are trained together using an RNN model. At each step of training, the RNN predicts the next term. 

Merge Architecture

In this method, only the word is introduced to the RNN model. Therefore, the image and RNN information are encoded separately by a feed-forward network. 

Merge Architecture

The inject method itself has three more types:

  • Init- Inject
  • Pre-inject
  • Par inject
Inject Architecture

Phases of Image Captioning

Feature Extraction

CNN, also known as embedding, creates a dense feature vector used as an input for the RNN model. CNN's initial action is to extract distinct features from an image based on its spatial context.

Feature Extraction

The CNN is fed with inputs in different formats, including png, jpg, etc. The neural networks compress large amounts of features extracted from the original image into smaller RNN-compatible feature vectors. It is why CNN is also referred to as 'Encoder.'


The second phase brings RNN into the picture for ‘decoding’ the process vector inputs generated by the CNN module. For initiating the task of captions, the RNN model needs to be trained with a relevant dataset. It is essential to prepare the RNN model for predicting the following word in the sentence. However, training the model with strings is ineffective without actual numerical alpha values.


Text prediction

After tokenization, the last step of the model is triggered using LSTM. This step requires an embedding layer to transform each word into the desired vector and eventually push for decoding. With LSTM, the RNN model must remember spatial information from the input feature vector and predict the next word. With LSTM performing its tasks, the final output is generated.

Applications of Image Captioning

  • Recommendations in Editing Applications
  • Assistance for Visually Impaired
  • Media and Publishing Houses
  • Social Media Posts

Image Sampling

Implementation of Image Captioning

In this article, we will work on Coco dataset by Microsoft. You can download the dataset from here.

We will work on the pretrained model of image captioning. You can find the pretrained model on the official website of of Coco Dataset.

#Importing necessary modules

import json
import pandas

#Creating dataset from json directory

pretrained_Directory = '../input/coco-2017-dataset/coco2017'
with open(f'{Base}/annotations/captions_train2017.json', 'r') as file:
    dataset = json.load(file)
    dataset = dataset['annotations']


In the above code, we imported all the necessary modules and created the dataset from the directory using pandas maodule. 

Now, we will split the dataset into images and captions, and store the values in a list.

#Spliting images and captions from the dataset

imageCaption = []

for test in dataset:
    imageName = '%012d.jpg' % test['image_id']
    imageCaption.append([imageName, test['caption']])

captions = pandas.DataFrame(imageCaption, columns=['image', 'caption'])
captions['image'] = captions['image'].apply(

    lambda x: f'{pretrained_Directory}/train2017/{x}'


Finally, we will generate caption on a random image from Coco Dataset.

#Generating the caption for a random image

captions = captions.sample(70000)
captions['caption'] = captions['caption']

selectedRow = captions.sample(1).iloc[0]

img =


The output of the above code will be:


Also read about, Artificial Intelligence in Education

Frequently Asked Questions

What is the goal of Image captioning?

Image captioning aims to automatically generate descriptions for a given image, i.e., capture the relationship between the objects present in the picture, generate natural language expressions, and judge the quality of the generated descriptions.

Which algorithm is used for Image captioning?

Convolutional Neural networks (CNN), Recurrent Neural networks, and Long Short Term Memory algorithms are used for image captioning. 

Is Image captioning supervised or unsupervised?

If we consider the Image as the source language, Image captioning is similar to unsupervised learning.

How does LSTM work in Image captioning?

We use a convolutional neural network to generate a vector image that feeds into a long short-term memory (LSTM) network, forging captions.

What is deep learning used for?

Deep learning is an essential element of data science, including statistics and predictive modeling. It is exceptionally beneficial to data scientists to collect, analyze, and interpret big data; deep learning makes this process faster and easier.


In this article, we have extensively discussed image captioning and its implementation in Python. We hope that this blog has helped you enhance your knowledge regarding image captioning and if you would like to learn more, check out our other articles here.

See more, Clean Architecture 

To learn more about Python, Import, and Modules in Python, you can visit Coding Ninjas Studio by Coding Ninjas. Refer to our guided paths on Coding Ninjas Studio to learn more about DSA, Competitive Programming, JavaScript, System Design, etc. Enroll in our courses and refer to the mock test and problems available; take a look at the interview experiences and interview bundle for placement preparations.

Happy learning, Ninja!

Previous article
Deep Learning vs CNN
Next article
Image Caption Generator
Live masterclass