Table of contents
1.
Introduction
2.
Parameter Tying
2.1.
Example
3.
Parameter Sharing
3.1.
Example
4.
Frequently Asked Questions
5.
Key Takeaways
Last Updated: Mar 27, 2024

Parameter Sharing and Tying

Author Mayank Goyal
2 upvotes
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Deep neural networks may contain millions, even billions, of parameters or weights, making storage and computation significantly expensive and motivating a large body of work to reduce their complexity by using, e.g., sparsity inducing regularization. Parameter sharing and parameter tying is another well-known approach for controlling the complexity of Deep Neural Networks by forcing certain weights to share the same value. Specific forms of weight sharing are hard-wired to express certain invariances. One typical example is the shift-invariance of convolutional layers. However, other weights groups may be tied together during the learning process to reduce network complexity further.

Also Read, Resnet 50 Architecture, Difference between argument and parameter

Parameter Tying

Parameter tying is a regularization technique. We divide the parameters or weights of a machine learning model into groups by leveraging prior knowledge, and all parameters in each group are constrained to take the same value. In simple terms, we want to express that specific parameter should be close to each other.

Example

Two models perform the same classification task (with the same set of classes) but with different input data.

 • Model A with parameters wt(A).

 • Model B with parameters wt(B). 

The two models hash the input to two different but related outputs.

Some standard regularisers like l1 and l2 penalize model parameters for deviating from the fixed value of zero. One of the side effects of Lasso or group-Lasso regularization in learning a Deep Neural Networks is that there is a possibility that many of the parameters may become zero. Thus, reducing the amount of memory required to store the model and lowering the computational cost of applying it. A significant drawback of Lasso (or group-Lasso) regularization is that in the presence of groups of highly correlated features, it tends to select only one or an arbitrary convex combination of elements from each group. Moreover, the learning process of Lasso tends to be unstable because the subsets of parameters that end up selected may change dramatically with minor changes in the data or algorithmic procedure. In Deep Neural Networks, it is almost unavoidable to encounter correlated features due to the high dimensionality of the input to each layer and because neurons tend to adapt, producing strongly correlated features that we pass as an input to the subsequent layer.

 

To overcome the issues we face while using Lasso or group lasso is countered by a regularizer, the group version of the ordered weighted one norm, known as group-OWL (GrOWL). GrOWL supports sparsity and simultaneously learns which parameters should share a similar value. GrOWL has been effective in linear regression, identifying and coping with strongly correlated covariates. Unlike standard sparsity-inducing regularizers (e.g., Lasso), GrOWL eliminates unimportant neurons by setting all their weights to zero and explicitly identifies strongly correlated neurons by tying the corresponding weights to an expected value. This ability of GrOWL motivates the following two-stage procedure: 

(i) use GrOWL regularization during training to simultaneously identify significant neurons and groups of parameters that should be tied together.

(ii) retrain the network, enforcing the structure unveiled in the previous phase, i.e., keeping only the significant neurons and implementing the learned tying structure.

Parameter Sharing

Parameter sharing forces sets of parameters to be similar as we interpret various models or model components as sharing a unique set of parameters. We only need to store only a subset of memory.

Suppose two models A and B, perform a classification task on similar input and output distributions. In such a case, we'd expect the parameters for both models to be identical to each other as well. We could impose a norm penalty on the distance between the weights, but a more popular method is to force the parameters to be equal. The idea behind Parameter Sharing is the essence of forcing the parameters to be similar. A significant benefit here is that we need to store only a subset of the parameters (e.g., storing only the parameters for model A instead of storing for both A and B), which leads to significant memory savings.

Example

The most extensive use of parameter sharing is in convolutional neural networks. Natural images have specific statistical properties that are robust to translation. For example photo of a cat remains a photo of a cat if it is translated one pixel to the right. Convolution Neural Networks consider this property by sharing parameters across multiple image locations. Thus we can find a cat with the same cat detector in column i or i+1 in the image.

Frequently Asked Questions

  1. What are the benefits of parameter sharing in CNNs?
    Convolution Neural Networks have a couple of techniques known as parameter sharing and parameter tying. Parameter sharing is the method of sharing weights by all neurons in a particular feature map. Therefore helps to reduce the number of parameters in the whole system, making it computationally cheap.
     
  2. Why is sharing parameters a good idea?
    Parameter sharing is used in all convolution layers in the network. Parameter sharing reduces the training time, which directly reduces the number of weight updates during backpropagation.
     
  3. Does weight sharing occur in Recurrent Neural networks?
    LSTM or RNN also uses shared parameters. The shared weights perspective comes from thinking about RNNs as feedforward networks unrolled across time. If the weights were different at each moment, this would be a feedforward network.

Key Takeaways

Let us brief the article.

Firstly, we saw parameter tying and had a detailed explanation with the help of an example.

Then we saw some of the lasso and group lasso regularisation shortcomings and how GROWL overcomes those shortcomings. Furthermore, we saw parameter sharing and various examples to use it. Lastly, we saw how these two techniques make memory computationally expensive. So, that's the end of the article.

Check out this article - Padding In Convolutional Neural Network

I hope you all like this article.

Happy Learning Ninjas!

Live masterclass