Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Table of contents
Model Architecture
Step 1: Importing Dependencies
Step 2: Setting Dataset Path and Images Configuration
Step 3: Preprocessing Captions
Step 4: Extracting Image Features
Step 5: Generate Train and Test Sets
Step 6: Tokenizing Training Labels
Step 7: Generating Tensorflow Dataset
Step 8: Models Definition
Step 9: Training Stage
Step 10: Loss Plot
Step 11: Testing the Model
Frequently Asked Questions
What is a RNN?
What is LSTM?
What is Keras?
Last Updated: Mar 27, 2024

Image Caption Generator

Author Abhinav Anand
0 upvote
Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM


Image captioning is the process of generating strings that describe images present in a given dataset. In this article, you learn how to build an image caption generator using Tensorflow, Keras, and Jupyter Notebook.

image caption generator

TensorFlow is an open-source library used for machine learning. It allows you to create data flow graphs that describes how the data will flow through a sequence of processing nodes.

Keras is another open-source library commonly used in machine learning for training and building neural networks. It is also used for image classification and natural language processing.

Let’s get started.

Model Architecture

In this model, we will use a CNN to extract features from the input images and encode them.

For training our model, we will use the Flickr 8k dataset. This dataset consists of 8000 images that have five different captions. You should download it from this link.

Now let’s look at the definition of a CNN.


CNN stands for Convolutional Neural Network, and it is commonly used in machine learning for image and video recognition. They are particularly useful for working with images due to their grid-like structure.

For this project, we will use a pre-trained CNN model called Inception V3 from Keras.

Before moving forward, install Jupyter Notebook and create a new Python notebook. Make sure you store the dataset in the same directory.

Let’s get started with building the image caption generator.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job

Step 1: Importing Dependencies

Run the following code in a Jupyter Notebook cell to import all the required libraries.


import re
import random

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

from tensorflow import keras
from time import time

from tqdm import tqdm # progress bar
from sklearn.model_selection import train_test_split # Dividing train test


If nothing goes wrong, you won’t see any outputs.

If there are any errors related to missing dependencies, you should use the following command to install those dependencies.


pip install dependency-name



pip install scikit-learn

Step 2: Setting Dataset Path and Images Configuration

Run the following code to set the dataset paths.

dataset_path = "./dataset"
dataset_images_path = dataset_path + "/Images/"


Make sure to change the “dataset_path” if your dataset is not present in the same directory as your notebook.

Run the following code to configure the dimensions of the dataset images and the validation split.

img_height = 180
img_width = 180
validation_split = 0.2

The dataset is divided into two parts, one for training and the other for validation. The “validation_split” variable signifies the percentage of images from our dataset that will be used for validation. In our case, 20% of the dataset will be used for validation, while the other 80 percent will be used for training the model.

Step 3: Preprocessing Captions

In this step, we will pre-process the captions by splitting them and adding the <start> and <end> tokens.

Then we will create a dictionary with image filename as the key and an array of captions as the value.

Run the code below to define the preprocessing function:-

def get_preprocessed_caption(caption):    
    caption = re.sub(r'\s+', ' ', caption)
    caption = caption.strip()
    caption = "<start> " + caption + " <end>"
    return caption

Now run the code below to create the dictionary.

images_captions_dict = {}

with open(dataset_path + "/captions.txt", "r") as dataset_info:
    next(dataset_info) # Omit header: image, caption

    # Using a subset of 4,000 entries out of 40,000
    for info_raw in list(dataset_info)[:4000]:
        info = info_raw.split(",")
        image_filename = info[0]
        caption = get_preprocessed_caption(info[1])

        if image_filename not in images_captions_dict.keys():
            images_captions_dict[image_filename] = [caption]

Step 4: Extracting Image Features

Now, we will extract the image features of our dataset using the Inception V3 model from Keras.

The extracted features will be stored in a dictionary as values, and the image filename will act as the key.

Run the code below in a notebook cell to perform the feature extraction.

def load_image(image_path):
    img = + image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, (img_height, img_width))
    img = tf.keras.applications.inception_v3.preprocess_input(img)
    return img, image_path


image_captions_dict_keys = list(images_captions_dict.keys())
image_dataset =
image_dataset =,


def get_encoder():
    image_model = tf.keras.applications.InceptionV3(include_top=False, weights='imagenet')
    new_input = image_model.input
    hidden_layer = image_model.layers[-1].output
    image_features_extract_model = tf.keras.Model(new_input, hidden_layer)
    return image_features_extract_model

images_dict = {}
encoder = get_encoder()
for img_tensor, path_tensor in tqdm(image_dataset):
    batch_features_tensor = encoder(img_tensor)
    # Loop over batch to save each element in images_dict
    for batch_features, path in zip(batch_features_tensor, path_tensor):
        decoded_path = path.numpy().decode("utf-8")
        images_dict[decoded_path] = batch_features.numpy()




Step 5: Generate Train and Test Sets

We will now divide image filenames among training and testing sets.

First, we will define a utility function that returns the image labels of a given filename.

def get_images_labels(image_filenames):
    images = []
    labels = []
    for image_filename in image_filenames:
        image = images_dict[image_filename]
        captions = images_captions_dict[image_filename]

        # Add one instance per caption
        for caption in captions:
    return images, labels


Let's use the train_test_split function we imported earlier to split our dataset.

image_filenames = list(images_captions_dict.keys())
image_filenames_train, image_filenames_test = \
    train_test_split(image_filenames, test_size=validation_split, random_state=1)

X_train, y_train_raw = get_images_labels(image_filenames_train)
X_test, y_test_raw = get_images_labels(image_filenames_test)

Step 6: Tokenizing Training Labels

Tokenization is the process of converting the labels into a numerical form that can be used by machine learning algorithms.

We will use the tokenizer available in Keras.

top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
                                                  filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')


# Introduce padding to make the captions of the same size for the LSTM model
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'

# Create the tokenized vectors
y_train = tokenizer.texts_to_sequences(y_train_raw)

y_train = tf.keras.preprocessing.sequence.pad_sequences(y_train, padding='post')

max_caption_length = max(len(t) for t in y_train)

Step 7: Generating Tensorflow Dataset

Run the code below to generate the Tensorflow dataset.

dataset =, y_train))

BUFFER_SIZE = len(X_train)

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dataset = dataset.prefetch(


Buffer size is the number of elements that would be loaded in memory while training, and the 

batch size is the number of samples that are processed together in a single iteration. 

While training our model, we will use a batch size of 64.

Step 8: Models Definition

We will now define the encoder and decoder models.

class CNN_Encoder(tf.keras.Model):
    # Since you have already extracted the features and dumped it using pickle
    # This encoder passes those features through a Fully connected layer
    def __init__(self, embedding_dim):
        super(CNN_Encoder, self).__init__()
        # shape after fc == (batch_size, 64, embedding_dim)
        self.flat = tf.keras.layers.Flatten()
        self.fc = tf.keras.layers.Dense(embedding_dim) #, activation='relu')

    def call(self, x):
        x = self.flat(x)
        x = self.fc(x)
        return x


class RNN_Decoder(tf.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super(RNN_Decoder, self).__init__()
        self.units = units

        # input_dim = size of the vocabulary
        # Define the embedding layer to transform the input caption sequence
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

        # Define the Long Short Term Memory layer to predict the next words in the sequence 
        self.lstm = tf.keras.layers.LSTM(self.units, return_sequences=True, return_state=True)
        # Define a dense layer to transform the LSTM output into prediction of the best word
        self.fc = tf.keras.layers.Dense(vocab_size) #, activation='softmax')

    # A function that transforms the input embeddings and passes them to the LSTM layer 
    def call(self, captions, features, omit_features = False, initial_state = None, verbose = False):
        if verbose:
            print("Before embedding")

        embed = self.embedding(captions) #(batch_size, 1, embedding_dim)

        if verbose:

        features = tf.expand_dims(features, 1)
        if verbose:
        # Concatenating the image and caption embeddings before providing them to LSTM
        # shape == (batch_size, 1, embedding_dim + hidden_size)
        lstm_input = tf.concat([features, embed], axis=-2) if (omit_features == False) else embed
        if verbose:
            print("LSTM input")

        # Passing the concatenated vector to the LSTM
        output, memory_state, carry_state = self.lstm(lstm_input, initial_state=initial_state)

        if verbose:
            print("LSTM output")

        # Transform LSTM output units to vocab_size
        output = self.fc(output)

        return output, memory_state, carry_state

    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.units))

Step 9: Training Stage

Before training the model, we will initialize the encoder and decoder and define a few utility functions.

units = embedding_dim = 512 # As in the paper
vocab_size = min(top_k + 1, len(tokenizer.word_index.keys()))

# Initialize encoder and decoder
encoder = CNN_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, vocab_size)

# Initialize optimizer
optimizer = tf.keras.optimizers.Adam()

# As the label is not one-hot encoded but indices. Logits as they are not probabilities.
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

# Computes the loss using SCCE and calculates the average of singular losses in the tensor
def loss_function(real, pred, verbose=False):
    loss_ = loss_object(real, pred)
    if verbose:
    loss_ = tf.reduce_mean(loss_, axis = 1)
    if verbose:
        print("After Mean Axis 1")    

    return loss_


def train_step(img_tensor, target, verbose=False):    
    if verbose:
        print("Image tensor")


    # The input would be each set of words without the last one (<end>), to leave space for the first one that
    # would be the image embedding
    dec_input = tf.convert_to_tensor(target[:, :-1])

    # Source:
    with tf.GradientTape() as tape:
        features = encoder(img_tensor)
        if verbose:
            print("Features CNN")
        predictions, _, _ = decoder(dec_input, features, verbose=verbose)        
        if verbose:
            print("Predictions RNN")
        caption_loss = loss_function(target, predictions) # (batch_size, )

        # After tape
        total_batch_loss = tf.reduce_sum(caption_loss) # Sum (batch_size, ) => K
        mean_batch_loss = tf.reduce_mean(caption_loss) # Mean(batch_size, ) => K

    # Updated the variables
    trainable_variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(caption_loss, trainable_variables)
    optimizer.apply_gradients(zip(gradients, trainable_variables))

    return total_batch_loss, mean_batch_loss


The following code snippet will create a TensorFlow checkpoint in your local path to save the decoder and encoder state while training.

checkpoint_path = "./checkpoints/train"
ckpt = tf.train.Checkpoint(encoder=encoder,
                           optimizer = optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

start_epoch = 0
if ckpt_manager.latest_checkpoint:
    start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
    # restoring the latest checkpoint in checkpoint_path

Now, we will start training the model.

loss_plot = []
start_epoch = 0

for epoch in range(start_epoch, EPOCHS):
    real_epoch = len(loss_plot) + 1
    start = time()
    total_loss = 0

    for (batch, (img_tensor, target)) in enumerate(dataset):
        total_batch_loss, mean_batch_loss = train_step(img_tensor, target, verbose=False)
        total_loss += total_batch_loss

        if batch % 100 == 0:
            print ('Epoch {} Batch {} Batch Loss {:.4f}'.format(real_epoch, batch, mean_batch_loss.numpy()))
    print ('Total Loss {:.6f}'.format(total_loss))
    epoch_loss = total_loss / NUM_STEPS
    # storing the epoch end loss value to plot later

    if epoch % 5 == 0:

    print ('Epoch {} Epoch Loss {:.6f}'.format(real_epoch, epoch_loss))
    print ('Time taken for 1 epoch {} sec\n'.format(time() - start))


During each epoch, the weights and parameters of our model will be adjusted by processing the entire training set.



Step 10: Loss Plot

The following code will plot the epochs v/s loss graph.

plt.title('Loss Plot')


loss plot

Loss is a numerical value that indicates a discrepancy between the predicted output of the model and the actual captions.

Step 11: Testing the Model

Let’s first define a utility function that will clean the captions.

def clean_caption(caption):
    return [item for item in caption if item not in ['<start>', '<end>', '<pad>']]


Now, lets randomly pick an image from our dataset and then generate a caption for it from our model.

test_img_name = random.choice(image_filenames_train)

def get_caption(img):    
    # Add image to an array to simulate batch size of 1    
    features = encoder(tf.expand_dims(img, 0))
    caption = []
    dec_input = tf.expand_dims([], 0)
     # Inputs the image embedding into the trained LSTM layer and predicts the first word of the sequence.
    # The output, hidden and cell states are passed again to the LSTM to generate the next word.
    # The iteration is repeated until the caption does not reach the max length.
    state = None
    for i in range(1, max_caption_length):
        predictions, memory_state, carry_state = \
            decoder(dec_input, features, omit_features=i > 1, initial_state=state)

        # Takes maximum index of predictions
        word_index = np.argmax(predictions.numpy().flatten())


        dec_input = tf.expand_dims([word_index], 0)       
        state = [memory_state, carry_state]
    # Filter caption
    return clean_caption(caption)

raw_img = load_image(test_img_name)[0]
img = images_dict[test_img_name]
captions = images_captions_dict[test_img_name]


print("Real captions")
for caption in captions:

print("Esimated caption")
estimated_caption = get_caption(img)



image caption result

As you can see, the model generated a fairly accurate caption for the input image. 

Frequently Asked Questions

What is a RNN?

RNN stands for Recurrent Neural Network which is commonly used in machine learning for processing sequential data with different temporal dependencies such as time series, text, speech, etc. By using recurrent connections, it can maintain and use information from previous steps while processing the current input.

What is LSTM?

LSTM is a type of recurrent neural network architecture. It stands for Long short-term memory. It can selectively remember or forget information from previous steps by using gating mechanisms.

What is Keras?

Keras is a high-level deep learning framework written in Python which provides a user-friendly interface for building neural networks. It is compatible with other deep learning backends, such as Tensorflow.


In this article, you learned how to build an image caption generator using Tensorflow and Keras.

Read more about machine learning:-

Refer to our guided paths on Coding Ninjas Studio to learn more about DSA, Competitive Programming, JavaScript, System Design, etc. Enroll in our courses and refer to the mock test and problems available. Take a look at the interview experiences and interview bundle for placement preparations.

Happy Learning!

Live masterclass