Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Table of contents
Max-Pooling vs Spatial Transformer
Computation of STN
Spatial Transformer Networks in action
Key Takeaways
Last Updated: Mar 27, 2024

Spatial Transformer Network

Author soham Medewar
0 upvote
Create a resume that lands you SDE interviews at MAANG
Anubhav Sinha
SDE-2 @
12 Jun, 2024 @ 01:30 PM


Convolutional Neural Networks provide an extraordinarily strong class of models, but they are nevertheless hampered by their inability to be spatially invariant to the input data in a parameter efficient and computationally manner. In this post, I'll discuss a new learnable module called the Spatial Transformer, which allows for explicit spatial manipulation of data within the network. This differentiable module can be embedded into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any additional training supervision or changes to the optimization process. The use of spatial transformer results in models that learn invariance to translation, scale, rotation, and more generic warping, resulting in state-of-the-art performance on a variety of tasks.

To put it another way, STN assists in cropping out and scale-normalizing the relevant region, which can make the future classification work easier and result in improved classification results.

Max-Pooling vs Spatial Transformer

Is it possible for the current deep learning model to achieve spatial invariance? The answer is yes, but not in a good way. In the max-pooling process, the model chooses one important pixel from a pool of a given size that is the most representative. When a pixel from a different column or row is chosen, spatial invariance is achieved. The bottom layer, on the other hand, is unable to learn this attribute. The layer can only have a limited ability to withstand this spatial disparity since the considered field in pooling is so tiny. Furthermore, the pooling method discards a great deal of information regarding the location of a feature. A feature will be detected by a pool across a tiny grid, regardless of where that feature occurs inside that grid.

Another concern is whether or not we can define a layer to learn spatial invariance. Again, the answer is yes, and STN is the way to go. DeepMind invented the concept of spatial transformer networks, or STNs, which employ image transformations, notably Affine Transformation is used, to modify the image feature map.

The below image represents the structural view of STN.


The spatial transformer's job is to convert the feature map into a different vector space representation. STN is made up of three parts: a localization network, a grid generator, and a sampler.

The below image represents the spatial transformation process.


The transformation parameters will be generated by the localization network, which is made up of fully linked layers or convolution layers. The grid generator is the second part. We can compute the opposite coordinates after we have the Affine transformation parameters. The coordinate of the target feature map is the input of the transformation function. But what is the coordinate of the target? The source coordinates are known, but the target coordinates are unknown!

Therefore the task of the grid generator is only to compute the coordinates in the source feature map for each target pixel.

The following formula helps us to calculate the affine transformation matrix.

To find out the intensity of each pixel in the target feature map we use a sampler. Bi-linear interpolation is used by the sampler to generate the pixel intensity of each target coordinate.

The following formula will help us to calculate the pixel intensities of each target coordinate by Bi-linear interpolation.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job

Computation of STN

In this section, we will see how the sampler generates the pixel intensities of the image. Let us consider a 4 × 4 grayscale image. The following image's left side is pixel intensities and the right side displays an original image. 

Now we will make the pixel intensity matrix of the above 4 × 4 grayscale image.

Each grid may be thought of as a single pixel, with the center position being the pixel coordinate. The pink dots in each grid indicate the center points of the boxes, as seen in the figure below.

Assume we've determined the Affine transformation's theta parameters. The transformation matrix is the leftmost matrix in the above image. The target coordinate is in the center. We pad the target coordinate matrix by one to achieve the shifting operation. The source coordinate is [2.5, 2.5] according to the Affine transformation computation as shown below.

The computing example is shown again in the figure below. We may transfer the [1, 1] target coordinate into [2.5, 2.5] source coordinate using the Affine transformation. However, no pink points can be found in this location! It's a factorial value for the coordinate! So, how to calculate the intensity in this factorial coordinate? We should utilize bi-linear interpolation in this case. The below image illustrates the transformation operation.

The bi-linear interpolation formula is graphically shown in the image below. This factorial point will receive some weight from each pixel, which differs from standard bi-linear interpolation. Traditional bi-linear interpolation only takes into account the coordinate's closest neighbours, while DeepMind wants every point to have some influence on this point.

Now we are going to make this image 3D. The magnitude of the effect of each point on the factorial point is represented by the z-axis. In simple terms, this point's intensity is the weighted sum of the distance between each pink point multiplied by the pixel intensity for the same points.


Spatial Transformer Networks in action

Now, let us see how STN helps in improving classification performance. After applying STN on MNIST cluttered dataset following results were observed. 

Also read, Sampling and Quantization


1. What is STN in deep learning?

Spatial transformer networks (STNs) train a neural network how to make spatial transformations on an input image in order to improve the model's geometric invariance. For example, it crops a section of interest, scale, and rectify an image's orientation.


2. What is affine transformation in CNN?

An affine transformation is a Euclidean space translation that retains collinearity and distance ratios between collinear points. Alternatively, a procedure or function for converting one mathematical set to another. Matrix algebra can be used to represent such a rule.


3. What does pooling do in CNN?

Its purpose is to gradually shrink the representation's spatial size in order to reduce the number of parameters and computations in the network. Each feature map is treated separately by the pooling layer.

Key Takeaways

In this article, we have discussed the following topics:

  • Introduction to STN
  • Difference between STN and Max Pooling
  • Computation of STN

Want to learn more about Machine Learning? Here is an excellent course that can guide you in learning. 

Happy Coding!

Previous article
Transformer Network
Next article
Pixel RNN
Live masterclass