Introduction
Pixel Recurrent Neural Networks are generative models that are a subset of unsupervised learning. In Pixel RNN, we generate new data from the same distribution.
The main distinction between Generative Adversarial Networks (GANs) and Auto-regressive models is that the former learns an implicit data distribution while the latter learns an explicit distribution governed by a prior imposed by the model structure. A distribution can be anything, such as a class label, automobile or cat photographs, etc. In layman's terms, a prior is a probability distribution of a quantity.
Advantages of Auto Regressive model over GAN’s
- Provides a way to calculate likelihood: These models, unlike GANs, return clear probability densities, making them simple to use in domains like compression and probabilistic planning and exploration.
- The training is more stable than GANs: Finding the Nash equilibrium is necessary for training a GAN. Training a GAN is insecure compared to PixelRNN or PixelCNN because there is no algorithm that performs this.
- It works for both discrete and continuous data: Learning to generate discrete data for GAN, such as text, is difficult.
PIXEL RNN
Using probabilistic density models (such as the Gaussian or Normal distribution) to characterize the pixels of a picture as a product of conditional distributions is an effective way to model such a network. This method transforms the modelling problem into a sequence problem, in which the next pixel value is determined by all of the preceding pixel values.
We require an expressive sequence model like a Recurrent Neural Network to process these non-linear and long-term connections between pixel values and distributions (RNN). In the case of sequence difficulties, RNNs have been demonstrated to be particularly effective.
Conditional Independence
The network analyses the image one row at a time, pixel by pixel. It then forecasts conditional distributions over the range of possible pixel values. The picture pixel distribution is stated as a product of conditional distributions, and these values are shared across all of the image's pixels.
The goal here is to give each pixel in the (n x n) image a probability p(x). This is accomplished by expressing the probability of a pixel xi as:
Given the probability of all previously created pixels, this is the likelihood of the ith pixel. Row by row and pixel by pixel, the generation takes place. Furthermore, all three color channels, red, green, and blue, work together to identify each pixel xi (RGB). The ith pixel's conditional probability becomes:
Thus, each color is determined by other colors as well as the previously generated pixels.
We utilize a 256-way softmax layer to acquire the right pixel value since we now know the conditional probability of our pixel value. This layer's output can be any value between 0 and 255, which means our pixel value can be anywhere between 0 and 255.
Model Architecture
In this section, we will learn some model architectures, i.e., Row LSTM, Diagonal Bi-LSTM, Masked Convolutions, Multi-Scale Pixel RNN.
Row LSTM
The top layer is a 7x7 convolution with a type A mask. It is followed by a 3x1 state to state convolution layer that is not masked and a 3x1 input to state layer that is a 3x1 convolution that uses a mask of type B. The feature map is then passed through two 1x1 convolution layers, one of which is ReLU and the other of which is mask type B. The 256-way softmax layer is the final layer in the design.
The calculation for the hidden state for the row LSTM is done in the following way.
Hidden state(i,j) = Hidden state(i-1,j-1)+ Hidden state(i-1,j+1)+ Hidden state(i-1,j)+ p(i,j)
This method computes the features of the entire row by processing the image row by row from top to bottom. Above the pixel, it captures a pretty triangular region. It is, however, unable to capture the entire available area.
Diagonal Bi-Lstm
The input to state and state to state layers are the only differences between Diagonal BiLSTM's design and Row LSTM's. It includes a 1x1 convolution input to state layer with type B mask and a 1x2 convolution state to state layer without the mask.
Pixels are updated in the following way in Diagonal Bi-LSTM
pixel(i, j) = pixel(i, j-1) + pixel(i-1, j)
This layer's receptive field is the full available area. The computations are carried out in a diagonal direction. While traveling in both directions, it begins at the top corner and ends at the opposite corner.
Residual connections (also known as skip connections) are also utilized in these networks to speed up convergence and direct signal propagation.
The below figure shows the Residual block for PixelRNNs. 'h' refers to the number of parameters.
Masked convolutions
Every input position in each layer is divided into three portions, each of which corresponds to a different hue (RGB). We require the value of the R channel, as well as the values of all previous pixels, to compute the values of the G channel. Similarly, information from both the R and G channels is required by the B channel. We use masks on convolutions to force the network to adhere to these limits.
Two types of masks are used:
- Type A: It is only applied to the first convolutional layer, and it prevents connections to colours that have already been predicted in current pixels.
- Type B: This mask is used on other layers to create connections to expected colours in the current pixels.
Connectivity inside a masked convolution.
Multi-Scale Pixel RNN
An unconditional PixelRNN and one or more conditional PixelRNNs make up a Multi-Scale PixelRNN. The unconditional network starts by creating a smaller s x s image that is subsampled from the original image in the traditional fashion. The s x s image is subsequently used as an additional input by the conditional network, which generates a larger n x n image.
Also read, Sampling and Quantization