Table of contents
1.
Introduction
2.
About MobileNet
3.
Depth-wise Separable Convolution
3.1.
Depth-wise convolution
3.2.
Point-wise convolution
4.
MobileNet Architecture
5.
Performance of MobileNet
6.
Implementation of MobileNet
7.
Frequently Asked Questions
8.
Conclusion
Last Updated: Mar 27, 2024
Easy

MobileNet

Introduction

Computer vision is an astounding field of computer science that allows computers or machines to perceive and understand the world through images. Computer vision has made previously thought technologies like driverless cars a reality. The most popular algorithm behind computer vision is Convolutional Neural Network (CNN). Since AlexNet won the ILSVRC 2012, deep convolutional neural networks have become popular with machine learning practitioners. The models are made deeper and more complex to get higher accuracy. This costs us a lot of space and time to train. But in practical problems such as facial recognition or driverless cars, the computer vision task must be performed in less time on machines with less computational power. Let us look at an efficient model known as MobileNet that solves this problem.

Deep learning models are coming to our smartphones

About MobileNet

MobileNets are light and efficient convolutional neural network systems for mobile applications. The original version of MobileNet (MobileNetV1) was introduced in 2017 by Google, followed by MobileNetV2 in 2018 and MobileNetV3 in 2019.

They are TensorFlow based computer vision models that aim to achieve two main goals:

  • Smaller model size - fewer parameters
  • Smaller model complexity - fewer calculations 

 

MobileNets are small, low latency networks built on mobile and embedded vision applications.

source

Depth-wise Separable Convolution

Depth-wise separable convolutions form the backbone of MobileNet. The main idea behind depth-wise separable convolution is that the image input and the filter can be separated and processed individually, thus the name separable. Depth-wise separable convolutions are made of two layers, depth-wise convolution and point-wise convolution.

Depth-wise convolution

Typical image input is not two-dimensional. It has different channels and has a three-dimensional shape. We can say that our image and filter have depth in them. A depth-wise convolution breaks down the image into its constituent channels and then applies a filter to each channel.

Normal Convolution network

 

For example: if we have an image with three channels. A regular convolution network applies the filter to the whole input while in depth-wise convolution, a single filter is applied to each input channel. The output of a standard convolution is a single layer while in the depth-wise convolution, the number of output layers equals the number of input channels.

Depth-wise Convolution

Point-wise convolution

Depth-wise convolution is very efficient in comparison to normal convolution. However, the output contains different channels, and we need to combine them for further processing. A 1X1 convolution known as point-wise convolution is applied to the output of depth-wise convolution. The image below shows that point-wise convolution transformed the three channels into a single layer.

 

Point-wise convolution

A depth-wise separable convolution is formed by combining a depth-wise convolution with a point-wise convolution. Using depth-wise separable convolutions, the number of parameters used is reduced drastically. According to the original paper, the reduction in computation in a depth-wise separable convolution in comparison to a normal convolution is given by:

 

Where DK is the spatial dimension of the kernel assumed to be a square and N is the number of output channels.

Depth-wise separable convolution

MobileNet Architecture

MobileNet uses 3 X 3 depth-wise separable convolutions that save the computation cost up to 8 to 9 times while the reduction in accuracy is minor.

A standard MobileNet block looks like:

A MobileNet block vs a standard convolution

  1. A normal convolution layer with batch normalization and ReLU activation function. 
  2. A depthwise separable convolution layer with 3X3 depth-wise convolution followed by batch normalization and ReLU activation layer and a pointwise convolution followed by batch normalization and ReLU.

 

Complete MobileNet architecture:

If we take depthwise and pointwise convolution as different layers. MobileNet has 28 layers.

source

Performance of MobileNet

We can see below that a MobileNet uses much fewer parameters than a standard convolution network with only a 1.1% reduction in accuracy.

 

Standard convolution vs MobileNet

MobileNet performs at par with other popular methods despite having significantly fewer parameters:

MobileNet vs popular models

Implementation of MobileNet

To implement a MobileNet block:

First, we need to insert a 3X3 depth-wise separable convolution layer:

x = DepthwiseConv2D(kernel_size = 3, strides = 2, padding = 'same')(x)
You can also try this code with Online Python Compiler
Run Code

 

Adding Batch Normalization and ReLU:

x = BatchNormalization()(x)
x = ReLU() (x)
You can also try this code with Online Python Compiler
Run Code

 

Then, we have to add point-wise convolution layer which is basically a 1X1 convolution:

x = Conv2D(filters = 64, kernel_size = 1, strides = 1)(x)
You can also try this code with Online Python Compiler
Run Code

 

Adding batch normalization and ReLU:

x = BatchNormalization()(x)
x = ReLU()(x)
You can also try this code with Online Python Compiler
Run Code

Frequently Asked Questions

  1. Are there more models requiring less computational power than the standard convolution network?
    Yes, there are more such models—for example, ResNet-50.
     
  2. How can MobileNet work on smartphones?
    Any mobile application that can use a TensorFlow library can be used to implement MobileNet.
     
  3. How many versions of MobileNet are there?
    There are total three versions of MobileNet.

Conclusion

This blog discussed the need for a light neural network model that can perform well with less computational power and how MobileNet caters to that need. We learned about Depth-wise separable convolution, which is the backbone of MobileNet. We saw the architecture of MobileNet and how we can implement a standard MobileNet block in our code. You can learn more about other popular models like AlexNetResNet, and ZFNet at coding ninjas. To get a complete understanding of various machine learning algorithms, check out our Machine learning course.

Live masterclass