Introduction
Convolutional Neural Networks provide an extraordinarily strong class of models, but they are nevertheless hampered by their inability to be spatially invariant to the input data in a parameter efficient and computationally manner. In this post, I'll discuss a new learnable module called the Spatial Transformer, which allows for explicit spatial manipulation of data within the network. This differentiable module can be embedded into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any additional training supervision or changes to the optimization process. The use of spatial transformer results in models that learn invariance to translation, scale, rotation, and more generic warping, resulting in state-of-the-art performance on a variety of tasks.
To put it another way, STN assists in cropping out and scale-normalizing the relevant region, which can make the future classification work easier and result in improved classification results.
Max-Pooling vs Spatial Transformer
Is it possible for the current deep learning model to achieve spatial invariance? The answer is yes, but not in a good way. In the max-pooling process, the model chooses one important pixel from a pool of a given size that is the most representative. When a pixel from a different column or row is chosen, spatial invariance is achieved. The bottom layer, on the other hand, is unable to learn this attribute. The layer can only have a limited ability to withstand this spatial disparity since the considered field in pooling is so tiny. Furthermore, the pooling method discards a great deal of information regarding the location of a feature. A feature will be detected by a pool across a tiny grid, regardless of where that feature occurs inside that grid.
Another concern is whether or not we can define a layer to learn spatial invariance. Again, the answer is yes, and STN is the way to go. DeepMind invented the concept of spatial transformer networks, or STNs, which employ image transformations, notably Affine Transformation is used, to modify the image feature map.
The below image represents the structural view of STN.
The spatial transformer's job is to convert the feature map into a different vector space representation. STN is made up of three parts: a localization network, a grid generator, and a sampler.
The below image represents the spatial transformation process.
The transformation parameters will be generated by the localization network, which is made up of fully linked layers or convolution layers. The grid generator is the second part. We can compute the opposite coordinates after we have the Affine transformation parameters. The coordinate of the target feature map is the input of the transformation function. But what is the coordinate of the target? The source coordinates are known, but the target coordinates are unknown!
Therefore the task of the grid generator is only to compute the coordinates in the source feature map for each target pixel.
The following formula helps us to calculate the affine transformation matrix.
To find out the intensity of each pixel in the target feature map we use a sampler. Bi-linear interpolation is used by the sampler to generate the pixel intensity of each target coordinate.
The following formula will help us to calculate the pixel intensities of each target coordinate by Bi-linear interpolation.