Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Table of contents
What is Visual Question Answering?
Architecture of Visual Question Answering 
Visual Question Answering dataset
Importing the libraries
Implementing CNN model for image recognition
Implementing RNN model for Natural Language Processing
Combining the results from CNN and RNN to deliver the final answer
Current Approach
Approach based on Attention
Frequently Asked Questions
What is VQA used for?
What is visual grounding?
What is visual quality assurance?
Last Updated: Mar 27, 2024

Visual QA


In the 21st Century, every day, we come across websites or applications which ask us to go through certain security checks like to confirm "I am not a robot", etc. The whole process of building a system that is capable of answering natural language questions about any image has always been considered a very ambitious goal. We as humans can normally answer these questions without any inconvenience. The process which helps the machine know whether or not the right image is picked is known as Visual Question Answering in Deep Learning.  

What is Visual Question Answering?

It is defined as the ability of a machine to read a picture and answer questions based on it. The model basically generates an answer that is a set of reasons and visual attention maps. The algorithm takes an image as input and a natural language question about the image and generates a natural language answer as the output. It combines problems from multiple areas of computer vision and is capable of doing object classification, detection, and localization. The main difference in Visual Question Answering is that the search and the reasoning part should be performed over the content of an image. 

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job

Architecture of Visual Question Answering 

To do the job of visual question answering, it employs two well-known Deep Learning Architectures: Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). The CNN is used for the purpose of image recognition, whereas the RNN is used for natural language processing. These characteristics are merged and given as input into a fully linked multi-layer perceptron, which can be trained as a standard multi-class classifier across all answer classes. The network produces a probability distribution across all potential answer classes as output.

an image representing the architecture of VQA

 Source: Miro Medium

Visual Question Answering dataset

VQA dataset is comparatively larger than other datasets like COCO, etc. It includes 50,00 abstract cartoon images. It has 204,721 additional images, along with the images from the COCO dataset. It contains three questions per image and ten answers per every question. VQA has two kinds of answering modes: open-ended and multiple-choice. For open-ended, the answer is considered 100% correct if at least 3 workers have provided the same answer. For multiple-choice, the system has created 18 candidate answers which contain both correct and incorrect answers per question: Correct Answer(most common answer given by at least 10 annotators), plausible answer (3 answers which the annotators give without looking at the image), popular answer (the top 10 most popular answers), random answer (a correct answer that is randomly selected for the other question)


Let us implement a generic model for visual question answering in Keras.

Importing the libraries

from keras.models import Model, Sequential
from keras.layers import Input, LSTM, Embedding, Dense
from keras.layers import Conv2D, MaxPooling2D, Flatten

Implementing CNN model for image recognition

cnn_model = Sequential()
#adding the layers
cnn_model.add(Conv2D(64, (3, 3), activation='relu', padding='same', input_shape=(224, 224, 3)))
cnn_model.add(Conv2D(64, (3, 3), activation='relu'))
cnn_model.add(MaxPooling2D((2, 2)))
cnn_model.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
cnn_model.add(Conv2D(128, (3, 3), activation='relu'))
cnn_model.add(MaxPooling2D((2, 2)))

#Taking image input for the CNN model
input_image = Input(shape=(224, 224, 3))
encoded_image = cnn_model(input_image)

Implementing RNN model for Natural Language Processing

input_question = Input(shape=(100,), dtype='int32')
embedded_question = Embedding(input_dimension=10000, output_dimension=256, input_length=100)(input_question)
encoded_question = LSTM(256)(embedded_question)

Combining the results from CNN and RNN to deliver the final answer

layers_merged = keras.layers.concatenate([encoded_question, encoded_image])
output = Dense(1000, activation='softmax')(layers_merged)
vqa_model = Model(inputs=[image_input, question_input], outputs=output)

Current Approach

The approaches that are used in Visual Question Answering can be outlined as follows:

  • Extract features from the question.
  • Extract features from images.
  • To generate the answer, combine the features.

Techniques like BOW (Bag-Of-Words) or LSTM (Long Short Term Memory) are used for text features, and CNNs are pre-trained on ImageNet for image features. Approaches generally model the problem as a classification task for generating the answer. The main difference between different approaches is the way in which they combine the textual and image features. The approach can either combine the textual and image features using concatenation and then feed a linear classifier or use Bayesian models to infer the underlying relationship between the feature distribution of the question, image, and answer.

Approach based on Attention

The main focus of such an approach is to set the focus of the algorithm onto the most relevant portion of the input. Suppose if the question is, "What is the color of the ball?", then the portion or the region of the image containing the ball is more relevant than the other portion. In a similar way, the words "color" and "ball" are more informative as compared to the other words. 

Generally, in Visual Question Answering, spatial attention is preferred for generating region-specific features for the purpose of training the CNNs. There are two methods that are used to obtain the spatial regions of an image. One method is to project a grid over the image, once the grid is applied, the relevance of each region is determined by the specific question. The second method is by proposing automatically generated bounding boxes. Using the questions and with the help of given proposed regions, we can determine the relevance of the features for one another and can pick the ones that are important to answer the question.


It is a measure that was proposed by Malinowski and Fritz in 2014 and is based upon the WUP measure by Wu and Palmer. It estimates the semantic distance between the answer and the ground truth, which is a value between 0 and 1. WUPS relies on WordNet for computing the similarity using the distance in the semantic tree of the terms contained in the answer and the ground truth both.

WUPS is a better fit for many cases than classic accuracy. But since it is based on semantic similarity, answers such as "red" will get a higher score when the ground truth is "black" or "green". One major problem with the WUPS is that it can only work with small terms that have a WordNet meaning.

Also read, Sampling and Quantization

Frequently Asked Questions

What is VQA used for?

VQA can get information about any image on the Web or any social media. It is also used for integrating VQA into image retrieval systems.

What is visual grounding?

Based on a natural language query, visual grounding aims to locate the most relevant object or region in an image. The query can be a sentence, phrase, or multi-round dialogue.

What is visual quality assurance?

Visual Question Assurance's main objective is to ensure your team lead and developers understand the issue. It is important to take the time to record all the information they might need to find and fix the problem.


In this article, we have extensively discussed visual question answering.

After reading about visual question answering, are you not feeling excited to read/explore more articles on datasets? Don't worry; Coding Ninjas has you covered. To learn what is normalization of a dataset is, what are the top 20 datasets in machine learning, and what are reliable open crime datasets. 

If you wish to enhance your skills in Data Structures and AlgorithmsCompetitive ProgrammingJavaScript, etc., you should check out our Guided path column at Coding Ninjas Studio. We at Coding Ninjas Studio organize many contests in which you can participate. You can also prepare for the contests and test your coding skills by giving the mock test series available. In case you have just started the learning process, and your dream is to crack major tech giants like Amazon, Microsoft, etc., then you should check out the most frequently asked problems and the interview experiences of your seniors that will surely help you in landing a job in your dream company. 

Do upvote if you find the blogs helpful.

Happy Learning!

Thank you Image
Previous article
Detectron2 for Object Detection
Next article
Transformer Network
Live masterclass