Table of contents
1.
📜Softmax Function📜
1.1.
🧑‍💻Implementation🧑‍💻
2.
📜Cross-Entropy📜
2.1.
🧑‍💻Implementation🧑‍💻
3.
Frequently Asked Questions
3.1.
Why is softmax used with cross-entropy?
3.2.
Why is cross-entropy loss better than MSE?
3.3.
What does the softmax function do?
3.4.
Can cross-entropy be negative?
3.5.
What is cross-entropy?
4.
Conclusion
Last Updated: Mar 27, 2024
Easy

Softmax and Cross-Entropy

Author Mayank Goyal
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

📜Softmax Function📜

Suppose there's a case where we divide our neural network into two categories: boy and girl. 

Decision.

How do we know that the two classes coordinate their probabilities to sum up to one? Well, the answer is they don't blend these results. The reason that the results have this coherence is that we use the softmax function.

The softmax function, also known as softargmax or normalized exponential function, generalizes the logistic function to multiple outputs. We use the softmax function in multinomial logistic regression and the last activation function in a neural network to normalize the network's output to a probability distribution over predicted output classes.

Softmax

The softmax function takes an input vector of K real numbers. It normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. Before applying softmax, some vector components could be negative or greater than one and might not sum to one. But after using the softmax function, each component will be in the interval (0,1), and the components will sum up to one so that they can be interpreted as probabilities.

The softmax function is represented as

formula

                                                                   

In simple words, the softmax function applies the standard exponential function to each element of the input vector z. It normalizes these values by dividing them by the sum of all these exponentials. This normalization ensures that the sum of the components of the output vector is one.

🧑‍💻Implementation🧑‍💻

import numpy as np
a = [1,5,6,4,2,2.6,6]
vector=np.exp(a) / np.sum(np.exp(a)) 


👉Output:

array([0.00263032, 0.14361082, 0.39037468, 0.05283147, 0.00714996,
       0.01302808, 0.39037468])


Also Read, Resnet 50 Architecture

📜Cross-Entropy📜

If we remember the artificial neural networks section material, we had the mean squared error function.  We use this function to assess the performance of our network, and by working to minimize this mean squared error, we are practically optimizing the network. The mean squared error function can be used with convolutional neural networks. Still, an even better option would be applying the cross-entropy function after you had entered the softmax function.

Cross entropy

We relate cross-entropy loss closely to the softmax function since it's practically only used with networks with a softmax layer at the output. We extensively use cross-entropy loss in multi-class classification tasks, where each sample belongs to one of the C classes. The label assigned to each sample consists of a single integer value between 0 and C -1. A one-hot encoded vector of size can represent the label C. We label the correct class as one and zero everywhere else.

Cross-entropy takes as input two discrete probability distributions (simply vectors whose elements lie between zero to one and add up to one) and outputs a single real-valued (!) number representing the similarity of both probability distributions.

It is defined as, 

formula

The larger the value of cross-entropy, the less similar the two probability distributions are. When cross-entropy is used as a loss function in a multi-class classification task, y is fed with the one-hot encoded label. The symbols represent the probabilities generated by the softmax layer. 

The above equation shows that we take logarithms of probabilities generated by the softmax layer, so we won't take the logarithm of zero as softmax will never produce zero values. We force the predicted probabilities to gradually resemble the accurate one-hot encoded vectors by minimizing the loss during training.

🧑‍💻Implementation🧑‍💻

👉Importing Libraries

import numpy as np
import matplotlib.pyplot as plt

 

👉Cross-Entropy function

def cross_entropy_loss(yHat, y):
    if y == 1:
      return -np.log(yHat)
    else:
      return -np.log(1 - yHat)


👉Sigmoid Function

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))


👉Dataset

z = np.arange(-5, 5, 0.2)
# Calculating the probability value
h_z = sigmoid(z)


👉Loss when y=1

# Value of cost function when y = 1
cost_1 = cross_entropy_loss(h_z, 1)


👉Loss when y=0

# Value of cost function when y = 0
cost_0 = cross_entropy_loss(h_z, 0)


👉Plotting

fig, ax = plt.subplots(figsize=(8,6))
plt.plot(h_z, cost_1, label='J(w) if y=1')
plt.plot(h_z, cost_0, label='J(w) if y=0')
plt.xlabel('$\phi$(z)')
plt.ylabel('J(w)')
plt.legend(loc='best')
plt.tight_layout()
plt.show()


👉Output

Output

Frequently Asked Questions

Why is softmax used with cross-entropy?

Softmax is a function placed at the end of a deep learning network to convert logistics into classification probabilities. The purpose of the Cross-Entropy is to take the output probabilities and measure the distance from the actual values.

Why is cross-entropy loss better than MSE?

First, Cross-entropy is better than MSE for classification because the decision boundary in a classification task is significant (compared with regression).

What does the softmax function do?

The softmax function turns a vector of K actual values into a vector of K actual values summing to 1. Softmax is helpful because it converts the scores to a normalized probability distribution that can be displayed to a user or input to other systems.

Can cross-entropy be negative?

Cross entropy can never be negative, and it's zero only when y and y hat are the same. Note that minimizing cross-entropy is the same as minimizing the KL divergence from y hat to y.

What is cross-entropy?

Cross-entropy measures the difference between two probability distributions for a given random variable or set of events.

Conclusion

Let us brief the article. Firstly, we saw the softmax function and its implementation. Further, we saw cross-entropy, why we use it with softmax, certain advantages of cross-entropy over, mean squared error, and finally, its implementation. Thus, the Cross entropy loss function is used as an optimization function to estimate parameters for logistic regression models or models with softmax output. 
Check out this problem - Subarray Sum Divisible By K

You can also refer to Stochastic Gradient DescentFeature SelectionLogistic RegressionSigmoid neuron and many more to enhance your knowledge.
Happy Learning, Ninjas!

Live masterclass