Table of contents
1.
Introduction
2.
CNNs and how do they work
3.
Pooling
4.
Frequently Asked Questions
5.
Key Learnings
Last Updated: Mar 27, 2024

Convolutional Neural Network:  What makes it so good for Image Learning.

Author Arun Nawani
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Computer vision and Deep Learning have created quite a buzz in the past few years in the tech industry. They’re looked at as Lucrative and future-ready career choices. But have you ever thought about what Computer vision actually deals with? Computer vision is a subsidiary of Artificial Intelligence that aims to derive relevant information from visual input like Images and videos. But how exactly do we make a computer perform image learning? It’s not textual data that we intuitively think can be processed by a machine. We have various techniques for this process. We’ll discuss one such technique known as Convolutional Neural Network. 

Also Read, Resnet 50 Architecture

CNNs and how do they work

To understand image learning and how CNN serves the purpose of image learning, let us know a very basic detail about how humans understand images. It’s simple, we look for some salient features which are characteristic of a particular object, and the more these characteristics match with some object, the closer it resembles it. For example, if we were to identify a tiger, we would look for the characteristic features from our acquired knowledge of tigers. So we might look for stripes. But that won’t be enough since there are other striped animals. Then we may look for sharp and long canines. These are some of the defining features of a tiger. Basically, what we do is called feature detection. 

CNNs work in a similar manner. Let us first understand how CNNs perceives any image before it begins its operations. An image can be represented by a matrix of 1 and 0 or 1 and -1 in case of a strict shape search. 

 

Source -link

Here we have a number 9, which we have divided into a grid filled with 1 and -1. We see that 1s in the grid make up 9 and empty spaces are filled with -1 in the grid. 

We identify three characteristic features of ‘9’. The loop, the vertical line in the middle and the short diagonal line that makes up the tail. So we have 3 filters, one for each feature. We will superimpose all these filters on an image if we want to detect it has a ‘9’.

We took the loop filter (refer to figure (c)) we identified in number 9 and moved it over the grid of 1 and -1 of the image we wanted to check for. We map, multiply and then take the average of the numbers generated by superimposing the filter on the grid image. The filter is superimposed on the entire grid one by one. Think of the image as the parent matrix and the filter as a submatrix. Look at the image below. We multiply every cell in the filter to the corresponding submatrix in the grid image and then take the average of the sum of the numbers generated. 

In the case below, we do the following math-

1*(-1) + 1*1 + 1*1 + 1*(-1) + (-1)*1 + 1*(-1) + 1*(-1)  + 1*1 + 1*1 = -1 

Taking the average we get , -1/9 = - 0.11 

Source -link

 

Source -link

So for the submatrix that exactly matches the filter, the resultant is 1. This implies that there was a feature detection and the location of the detected feature can be identified in the resultant matrix. This resultant matrix is known as a feature map. A feature map is created for every filter we have. After this, we may have feature map aggregation to form another feature map that would essentially represent all the details of its component feature maps. Look at the example given for a better understanding. 

Source -link

Suppose we train a model to detect a koala. We identify 5 filters namely -eyes, nose, ears, hands and legs. From feature maps, we detect the presence of all 5 features. And eyes, nose and ears feature maps can be combined to derive the presence of a koala head. 

Similarly, legs and hands feature maps can be aggregated to derive the presence of a koala body. This results in two new feature maps, depicting the head and the body of the koala. 

These features maps are 2D arrays. Hence they’re converted to a single-dimensional array for the deep neural network to process the image. This process is known as ‘Flattening’. This forms a fully connected neural network. The neural network can then classify the image based on the input it received from the feature map aggregation. 

Pooling

We know CNNs can detect characteristic features in an object. But this is not location invariant without pooling. To make feature detection location invariant, we need to have pooling layers. Pooling essentially reduces the size of the original matrix. Have a look at the image given below:

 

Source -link

In the above image, we took a 2 x 2 filter with a stride of 2 to generate a smaller feature map with max values from every 2 x 2 filter. We can see 8 is the maximum value in the first 2 x 2 matrix, 9 in second, 3 in third, and 2 in fourth. There is also another type of pooling technique called Average pooling. It’s different from max pooling because instead of taking the maximum value in the filter, it takes the average of all the values within that filter. But max pooling is more commonly used. 

This decrease in the size of the feature map significantly decreases the computation costs. 

Source -link

As we can see, there is a shift in the position of the number on the grid. But pooling has effectively captured the loop pattern. This shows how it makes CNN robust to changes in the location of the features. 

So there are three main advantages of pooling - 

  • Reduces computational cost by reducing the dimensions
  • Since there are lesser features to deal with, it also reduces the risk of overfitting
  • Most importantly, it makes the model robust to changes in the position of the object. 

 

Source -link

To summarise, this is how a CNN in its entirety looks like. We start with the visual input, use convolution followed by pooling. And this process is repeated several times before flattening the final pooled feature map which is then fed to the deep neural network to make the classification. Image learning probably sounds a bit more intuitive now,  right? 

Frequently Asked Questions

  1. What is a convolutional neural network? 
    Ans. Convolutional Neural Networks are a type of Artificial Neural Networks that are widely used for extracting relevant information from visual input. Computer vision is the subsidiary of AI that deals with image learning. 
     
  2. Briefly explain feature detection in CNNs. 
    Ans.CNNs make use of filters, which are essentially feature detectors, and superimpose them on the grid image of 1 and -1. If the filter exactly matches a submatrix, it is activated in the resultant feature map. 
     
  3. Why do we need a deep neural network after identifying features using filters? 
    Ans. Deep Neural Networks makes the classification based on the input it receives. It also makes feature detection by filters more generic. 
     
  4. Briefly explain Pooling. 
    Ans. Pooling is a dimension-reducing technique that also makes the model robust to changes in the position of the features. There are 2 types of pooling - Max pooling and Average pooling. 

Key Learnings

Computer Vision and deep learning is emerging field in the domain of artificial intelligence. In this blog, we learnt how CNNs are used for image learning and what makes them so efficient at it. We also learnt about the high-level architecture of CNNs. However, this is just the beginning. You can check out our expert-curated courses on deep learning if you look forward to building a career in the domain. 

Check out this article - Padding In Convolutional Neural Network

Happy learning

Live masterclass