Code360 powered by Coding Ninjas X Code360 powered by Coding Ninjas X
Table of contents
Advantages of Auto Regressive model over GAN’s
Conditional Independence
Model Architecture
Diagonal Bi-Lstm
Masked convolutions
Multi-Scale Pixel RNN
Key Takeaways
Last Updated: Mar 27, 2024

Pixel RNN

Author soham Medewar
0 upvote
Master Python: Predicting weather forecasts
Ashwin Goyal
Product Manager @


Pixel Recurrent Neural Networks are generative models that are a subset of unsupervised learning. In Pixel RNN, we generate new data from the same distribution. 

The main distinction between Generative Adversarial Networks (GANs) and Auto-regressive models is that the former learns an implicit data distribution while the latter learns an explicit distribution governed by a prior imposed by the model structure. A distribution can be anything, such as a class label, automobile or cat photographs, etc. In layman's terms, a prior is a probability distribution of a quantity.

Advantages of Auto Regressive model over GAN’s

  1. Provides a way to calculate likelihood: These models, unlike GANs, return clear probability densities, making them simple to use in domains like compression and probabilistic planning and exploration.
  2. The training is more stable than GANs: Finding the Nash equilibrium is necessary for training a GAN. Training a GAN is insecure compared to PixelRNN or PixelCNN because there is no algorithm that performs this.
  3. It works for both discrete and continuous data: Learning to generate discrete data for GAN, such as text, is difficult.



Using probabilistic density models (such as the Gaussian or Normal distribution) to characterize the pixels of a picture as a product of conditional distributions is an effective way to model such a network. This method transforms the modelling problem into a sequence problem, in which the next pixel value is determined by all of the preceding pixel values.

We require an expressive sequence model like a Recurrent Neural Network to process these non-linear and long-term connections between pixel values and distributions (RNN). In the case of sequence difficulties, RNNs have been demonstrated to be particularly effective.

Conditional Independence


The network analyses the image one row at a time, pixel by pixel. It then forecasts conditional distributions over the range of possible pixel values. The picture pixel distribution is stated as a product of conditional distributions, and these values are shared across all of the image's pixels.

The goal here is to give each pixel in the (n x n) image a probability p(x). This is accomplished by expressing the probability of a pixel xi as:

Given the probability of all previously created pixels, this is the likelihood of the ith pixel. Row by row and pixel by pixel, the generation takes place. Furthermore, all three color channels, red, green, and blue, work together to identify each pixel xi (RGB). The ith pixel's conditional probability becomes:

Thus, each color is determined by other colors as well as the previously generated pixels.

We utilize a 256-way softmax layer to acquire the right pixel value since we now know the conditional probability of our pixel value. This layer's output can be any value between 0 and 255, which means our pixel value can be anywhere between 0 and 255.

Model Architecture

In this section, we will learn some model architectures, i.e., Row LSTM, Diagonal Bi-LSTM, Masked Convolutions, Multi-Scale Pixel RNN. 


The top layer is a 7x7 convolution with a type A mask. It is followed by a 3x1 state to state convolution layer that is not masked and a 3x1 input to state layer that is a 3x1 convolution that uses a mask of type B. The feature map is then passed through two 1x1 convolution layers, one of which is ReLU and the other of which is mask type B. The 256-way softmax layer is the final layer in the design.

The calculation for the hidden state for the row LSTM is done in the following way.

Hidden state(i,j) = Hidden state(i-1,j-1)+ Hidden state(i-1,j+1)+ Hidden state(i-1,j)+ p(i,j)

This method computes the features of the entire row by processing the image row by row from top to bottom. Above the pixel, it captures a pretty triangular region. It is, however, unable to capture the entire available area.

Diagonal Bi-Lstm

The input to state and state to state layers are the only differences between Diagonal BiLSTM's design and Row LSTM's. It includes a 1x1 convolution input to state layer with type B mask and a 1x2 convolution state to state layer without the mask.

Pixels are updated in the following way in Diagonal Bi-LSTM

pixel(i, j) = pixel(i, j-1) + pixel(i-1, j)

This layer's receptive field is the full available area. The computations are carried out in a diagonal direction. While traveling in both directions, it begins at the top corner and ends at the opposite corner.

Residual connections (also known as skip connections) are also utilized in these networks to speed up convergence and direct signal propagation.

The below figure shows the Residual block for PixelRNNs. 'h' refers to the number of parameters.

Masked convolutions

Every input position in each layer is divided into three portions, each of which corresponds to a different hue (RGB). We require the value of the R channel, as well as the values of all previous pixels, to compute the values of the G channel. Similarly, information from both the R and G channels is required by the B channel. We use masks on convolutions to force the network to adhere to these limits.

Two types of masks are used:

  1. Type A: It is only applied to the first convolutional layer, and it prevents connections to colours that have already been predicted in current pixels.
  2. Type B: This mask is used on other layers to create connections to expected colours in the current pixels.

Connectivity inside a masked convolution.

Multi-Scale Pixel RNN

An unconditional PixelRNN and one or more conditional PixelRNNs make up a Multi-Scale PixelRNN. The unconditional network starts by creating a smaller s x s image that is subsampled from the original image in the traditional fashion. The s x s image is subsequently used as an additional input by the conditional network, which generates a larger n x n image.

Also read, Sampling and Quantization

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job


1. What is Pixel RNN?

PixelRNNs are generative neural networks that forecast pixels in an image in two spatial dimensions consecutively. They encode the entire collection of dependencies in the image by modelling the discrete probability of the raw pixel values.


2. What is a biLSTM?

A bidirectional LSTM, often known as a biLSTM, is a sequence processing model that consists of two LSTMs, one of which takes input in one way and the other in the other.

3. What is mask in neural networks?

Masking is a way to tell sequence-processing layers that certain timesteps in the input are missing and thus should be skipped when processing the data.

4. What is meant by auto regression?

Autoregression is a time series model that predicts the value at the next time step by using observations from prior time steps as input to a regression equation. It's a simple concept that can produce reliable forecasts for a variety of time series issues.

Key Takeaways

In this article, we have discussed the following topics:

  • Introduction of auto regressive models
  • Pixel RNN and it's model architectures
  • Row LSTM, Diagonal Bi-LSTM, Masked Convolutions, Multi-Scale Pixel RNN

Want to learn more about Machine Learning? Here is an excellent course that can guide you in learning. 

Happy Coding!

Previous article
Spatial Transformer Network
Next article
Circle Detection
Live masterclass