Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
What is the Actor-Critic Model?
3.
Introduction to Policy-Based Reinforcement Learning
4.
Pseudo-code to Actor-Critic Method
5.
Implementation and building CartPole Game
5.1.
What is the CartPole game?
5.2.
Implementing the game
6.
Frequently Asked Questions
6.1.
What is the temporal difference?
6.2.
What is a policy?
6.3.
Define actor and critic in the actor-critic model.
6.4.
What is the CartPole game about?
7.
Conclusion
Last Updated: Mar 27, 2024
Medium

Introduction to the Actor-Critic Model

Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

Hey Ninjas, in Machine learning and data mining, you would have come across the term model-free reinforcement learning. Reinforcement learning is a training method through which the computer system learns how to reward desired outputs and how to react to undesired outputs. There is a class named Temporal Difference (TD) learning in it, which deals with predicting a variable’s expected value in any sequence of states. In this article, you will learn about a temporal difference learning model named the actor-critic model.

Introduction

So, let us start learning about the actor-critic model. So, first of all, let us know what the actor-critic model is.

What is the Actor-Critic Model?

The actor-Critic model is a model in which the agent learns to map the visited states in an environment as he takes action and moves through it. The agent maps to two types of outputs-recommended action and estimated rewards. Here agent is an entity that takes an action and moves through an environment. While moving it takes note of the visited states and their properties, this is called mapping of the visited states.

What is actor-critic model

The part of the agent responsible for the recommended action is called the actor. The part of agent responsible for the estimated rewards of the future is called the critic. The agent and the critic perform their respective tasks with the motive that the recommended actions maximize the rewards.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Introduction to Policy-Based Reinforcement Learning

A policy is a function that returns a probability distribution. The probability distribution is calculated over the actions based on the given state that the agent can take. Another function named value determines the expected return for an agent acting according to a particular policy and starting at a given state.

In the policy-based methods, we directly learn the policy function instead of learning a value function. It means that instead of learning the expected sum of rewards given a state and an action, we directly learn the function that maps state to action. In other words, in policy-based reinforcement learning, we directly manipulate the policy to find the optimal policy. 

Pseudo-code to Actor-Critic Method

Follow the listed steps for the pseudo-code of the actor-critic model algorithm.

  • From the actor-critic network, get the policy. 
     
  • Now, using this policy  sample {s_t, a_t}
     
  • Since in the actor-citric model, the advantage function is generated by the critic network, so using it evaluates the critic function. Let the critic function be A_t. It can also be called a t (Temporal Difference error).
    Critic function
     
  • Using the equation below, evaluate the gradient.
    Gradient
     
  • The next step now is to update, the parameter for the policy.
    Parameter of the policy
     

Implementation and building CartPole Game

Before implementing and building the CartPole game, let us first begin with defining a CartPole game and knowing what a CartPole game is.

What is the CartPole game?

It is a game that consists of a pole attached to a cart that has to move on a frictionless track. For this movement, the agent has to apply force on it. Whenever the pole is upright, you receive a reward for that time stamp. So, to be successful in the game, the agent has to learn how to save the pole from falling over.  

Implementing the game

Let’s implement the game, but first, we need to set up our environment. For the environment import the required packages and configure the global changes in the system. Then define the model and implement the game.

import gym
import numpy as npy
import tensorflow as tsf
from tensorflow import keras
from tensorflow.keras import layers

# Configuration for the setup
seed = 42
gamma = 0.99  # Discount factor for past rewards
max_steps_per_episode = 10000
env = gym.make("CartPole-v0")  # Creation of the environment
env.seed(seed)
eps = npy.finfo(npy.float32).eps.item()  # Smallest number such that 1.0 + eps != 1.0

num_inputs = 4
num_actions = 2
num_hidden = 128

inputs = layers.Input(shape=(num_inputs,))
common = layers.Dense(num_hidden, activation="relu")(inputs)
action = layers.Dense(num_actions, activation="softmax")(common)
critic = layers.Dense(1)(common)

model = keras.Model(inputs=inputs, outputs=[action, critic])

optimizer = keras.optimizers.Adam(learning_rate=0.01)
huber_loss = keras.losses.Huber()
action_probs_history = []
critic_value_history = []
rewards_history = []
running_reward = 0
episode_count = 0

while True:  # Run until solved
    state = env.reset()
    episode_reward = 0
    with tsf.GradientTape() as tape:
        for timestep in range(1, max_steps_per_episode):
            state = tsf.convert_to_tensor(state)
            state = tsf.expand_dims(state, 0)
            # Predicting the action probabilities and estimated future reward from the environment state
            action_probs, critic_value = model(state)
            critic_value_history.append(critic_value[0, 0])

            # Sampling action from the action probability distribution
            action = npy.random.choice(num_actions, p=npy.squeeze(action_probs))
            action_probs_history.append(tsf.math.log(action_probs[0, action]))

            # Applying sampled action in our environment
            state, reward, done, _ = env.step(action)
            rewards_history.append(reward)
            episode_reward += reward

            if done:
                break

        # Updating the running reward to check the condition
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward

        # Calculating the expected value from the rewards at each timestep. 
        # - These are the labels for our critic
        returns = []
        discounted_sum = 0
        for r in rewards_history[::-1]:
            discounted_sum = r + gamma * discounted_sum
            returns.insert(0, discounted_sum)

        # Normalizing
        returns = npy.array(returns)
        returns = (returns - npy.mean(returns)) / (npy.std(returns) + eps)
        returns = returns.tolist()

        # Calculating the loss values
        history = zip(action_probs_history, critic_value_history, returns)
        actor_losses = []
        critic_losses = []
        for log_prob, value, ret in history:
            # The actor must be updated to predict an action which will leads to high rewards (compared to the critic's estimate) with higher probability.
            diff = ret - value
            actor_losses.append(-log_prob * diff)  # actor loss

           # We must update the critic to predict a better estimate of the future rewards.
            critic_losses.append(
                huber_loss(tsf.expand_dims(value, 0), tsf.expand_dims(ret, 0))
            )

        # Back-propagation
        loss_value = sum(actor_losses) + sum(critic_losses)
        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        # Clearing the loss and reward history
        action_probs_history.clear()
        critic_value_history.clear()
        rewards_history.clear()

    # Log details
    episode_count += 1
    if episode_count % 10 == 0:
        template = "running reward: {:.2f} at episode {}"
        print(template.format(running_reward, episode_count))

    if running_reward > 195:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break

Frequently Asked Questions

What is the temporal difference?

The temporal difference is a class of model-free that deals with predicting a variable’s expected value in any sequence of states.

What is a policy?

A policy is a function that returns a probability distribution. The probability distribution is calculated over the actions based on the given state that the agent can take.

Define actor and critic in the actor-critic model.

The part of the agent responsible for the recommended action is called the actor. The part of agent responsible for the estimated rewards of the future is called the critic

What is the CartPole game about?

It is a game that consists of a pole attached to a cart that has to move on a frictionless track.  To be successful in the game, the agent has to learn how to save the pole from falling over.  

Conclusion

In this article, we have discussed the Actor cryptic model. We started with policy-based reinforcement learning, then learned the pseudo code and implementation of the actor-critic model.

We hope this blog was useful and helped you enhance your knowledge of the actor-critic model, and if you want to learn more, check out our articles on Introduction to Reinforcement Learning and Four Types of Learnings in Machine Learning. Do upvote our blogs if you find them useful. 

Happy Coding!

Previous article
TradeOffs like Exploration vs. Exploitation
Next article
Real-life Applications of Reinforcement Learning
Live masterclass