Depthwise Separable Convolution
Depthwise separable convolutions form the backbone of MobileNet. The main idea behind depthwise separable convolution is that the image input and the filter can be separated and processed individually, thus the name separable. Depthwise separable convolutions are made of two layers, depthwise convolution and pointwise convolution.
Depthwise convolution
Typical image input is not twodimensional. It has different channels and has a threedimensional shape. We can say that our image and filter have depth in them. A depthwise convolution breaks down the image into its constituent channels and then applies a filter to each channel.
Normal Convolution network
For example: if we have an image with three channels. A regular convolution network applies the filter to the whole input while in depthwise convolution, a single filter is applied to each input channel. The output of a standard convolution is a single layer while in the depthwise convolution, the number of output layers equals the number of input channels.
Depthwise Convolution
Pointwise convolution
Depthwise convolution is very efficient in comparison to normal convolution. However, the output contains different channels, and we need to combine them for further processing. A 1X1 convolution known as pointwise convolution is applied to the output of depthwise convolution. The image below shows that pointwise convolution transformed the three channels into a single layer.
Pointwise convolution
A depthwise separable convolution is formed by combining a depthwise convolution with a pointwise convolution. Using depthwise separable convolutions, the number of parameters used is reduced drastically. According to the original paper, the reduction in computation in a depthwise separable convolution in comparison to a normal convolution is given by:
Where D_{K} is the spatial dimension of the kernel assumed to be a square and N is the number of output channels.
Depthwise separable convolution
MobileNet Architecture
MobileNet uses 3 X 3 depthwise separable convolutions that save the computation cost up to 8 to 9 times while the reduction in accuracy is minor.
A standard MobileNet block looks like:
A MobileNet block vs a standard convolution
 A normal convolution layer with batch normalization and ReLU activation function.
 A depthwise separable convolution layer with 3X3 depthwise convolution followed by batch normalization and ReLU activation layer and a pointwise convolution followed by batch normalization and ReLU.
Complete MobileNet architecture:
If we take depthwise and pointwise convolution as different layers. MobileNet has 28 layers.
source
Performance of MobileNet
We can see below that a MobileNet uses much fewer parameters than a standard convolution network with only a 1.1% reduction in accuracy.
Standard convolution vs MobileNet
MobileNet performs at par with other popular methods despite having significantly fewer parameters:
MobileNet vs popular models
Implementation of MobileNet
To implement a MobileNet block:
First, we need to insert a 3X3 depthwise separable convolution layer:
x = DepthwiseConv2D(kernel_size = 3, strides = 2, padding = 'same')(x)
Adding Batch Normalization and ReLU:
x = BatchNormalization()(x)
x = ReLU() (x)
Then, we have to add pointwise convolution layer which is basically a 1X1 convolution:
x = Conv2D(filters = 64, kernel_size = 1, strides = 1)(x)
Adding batch normalization and ReLU:
x = BatchNormalization()(x)
x = ReLU()(x)
Frequently Asked Questions

Are there more models requiring less computational power than the standard convolution network?
Yes, there are more such models—for example, ResNet50.

How can MobileNet work on smartphones?
Any mobile application that can use a TensorFlow library can be used to implement MobileNet.

How many versions of MobileNet are there?
There are total three versions of MobileNet.
Conclusion
This blog discussed the need for a light neural network model that can perform well with less computational power and how MobileNet caters to that need. We learned about Depthwise separable convolution, which is the backbone of MobileNet. We saw the architecture of MobileNet and how we can implement a standard MobileNet block in our code. You can learn more about other popular models like AlexNet, ResNet, and ZFNet at coding ninjas. To get a complete understanding of various machine learning algorithms, check out our Machine learning course.