Introduction to Policy-Based Reinforcement Learning
A policy is a function that returns a probability distribution. The probability distribution is calculated over the actions based on the given state that the agent can take. Another function named value determines the expected return for an agent acting according to a particular policy and starting at a given state.
In the policy-based methods, we directly learn the policy function instead of learning a value function. It means that instead of learning the expected sum of rewards given a state and an action, we directly learn the function that maps state to action. In other words, in policy-based reinforcement learning, we directly manipulate the policy to find the optimal policy.
Pseudo-code to Actor-Critic Method
Follow the listed steps for the pseudo-code of the actor-critic model algorithm.
-
From the actor-critic network, get the policy.
-
Now, using this policy sample {s_t, a_t}
-
Since in the actor-citric model, the advantage function is generated by the critic network, so using it evaluates the critic function. Let the critic function be A_t. It can also be called a t (Temporal Difference error).

-
Using the equation below, evaluate the gradient.

-
The next step now is to update, the parameter for the policy.

Implementation and building CartPole Game
Before implementing and building the CartPole game, let us first begin with defining a CartPole game and knowing what a CartPole game is.
What is the CartPole game?
It is a game that consists of a pole attached to a cart that has to move on a frictionless track. For this movement, the agent has to apply force on it. Whenever the pole is upright, you receive a reward for that time stamp. So, to be successful in the game, the agent has to learn how to save the pole from falling over.
Implementing the game
Let’s implement the game, but first, we need to set up our environment. For the environment import the required packages and configure the global changes in the system. Then define the model and implement the game.
import gym
import numpy as npy
import tensorflow as tsf
from tensorflow import keras
from tensorflow.keras import layers
# Configuration for the setup
seed = 42
gamma = 0.99 # Discount factor for past rewards
max_steps_per_episode = 10000
env = gym.make("CartPole-v0") # Creation of the environment
env.seed(seed)
eps = npy.finfo(npy.float32).eps.item() # Smallest number such that 1.0 + eps != 1.0
num_inputs = 4
num_actions = 2
num_hidden = 128
inputs = layers.Input(shape=(num_inputs,))
common = layers.Dense(num_hidden, activation="relu")(inputs)
action = layers.Dense(num_actions, activation="softmax")(common)
critic = layers.Dense(1)(common)
model = keras.Model(inputs=inputs, outputs=[action, critic])
optimizer = keras.optimizers.Adam(learning_rate=0.01)
huber_loss = keras.losses.Huber()
action_probs_history = []
critic_value_history = []
rewards_history = []
running_reward = 0
episode_count = 0
while True: # Run until solved
state = env.reset()
episode_reward = 0
with tsf.GradientTape() as tape:
for timestep in range(1, max_steps_per_episode):
state = tsf.convert_to_tensor(state)
state = tsf.expand_dims(state, 0)
# Predicting the action probabilities and estimated future reward from the environment state
action_probs, critic_value = model(state)
critic_value_history.append(critic_value[0, 0])
# Sampling action from the action probability distribution
action = npy.random.choice(num_actions, p=npy.squeeze(action_probs))
action_probs_history.append(tsf.math.log(action_probs[0, action]))
# Applying sampled action in our environment
state, reward, done, _ = env.step(action)
rewards_history.append(reward)
episode_reward += reward
if done:
break
# Updating the running reward to check the condition
running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward
# Calculating the expected value from the rewards at each timestep.
# - These are the labels for our critic
returns = []
discounted_sum = 0
for r in rewards_history[::-1]:
discounted_sum = r + gamma * discounted_sum
returns.insert(0, discounted_sum)
# Normalizing
returns = npy.array(returns)
returns = (returns - npy.mean(returns)) / (npy.std(returns) + eps)
returns = returns.tolist()
# Calculating the loss values
history = zip(action_probs_history, critic_value_history, returns)
actor_losses = []
critic_losses = []
for log_prob, value, ret in history:
# The actor must be updated to predict an action which will leads to high rewards (compared to the critic's estimate) with higher probability.
diff = ret - value
actor_losses.append(-log_prob * diff) # actor loss
# We must update the critic to predict a better estimate of the future rewards.
critic_losses.append(
huber_loss(tsf.expand_dims(value, 0), tsf.expand_dims(ret, 0))
)
# Back-propagation
loss_value = sum(actor_losses) + sum(critic_losses)
grads = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
# Clearing the loss and reward history
action_probs_history.clear()
critic_value_history.clear()
rewards_history.clear()
# Log details
episode_count += 1
if episode_count % 10 == 0:
template = "running reward: {:.2f} at episode {}"
print(template.format(running_reward, episode_count))
if running_reward > 195: # Condition to consider the task solved
print("Solved at episode {}!".format(episode_count))
break

You can also try this code with Online Python Compiler
Run Code
Frequently Asked Questions
What is the temporal difference?
The temporal difference is a class of model-free that deals with predicting a variable’s expected value in any sequence of states.
What is a policy?
A policy is a function that returns a probability distribution. The probability distribution is calculated over the actions based on the given state that the agent can take.
Define actor and critic in the actor-critic model.
The part of the agent responsible for the recommended action is called the actor. The part of agent responsible for the estimated rewards of the future is called the critic.
What is the CartPole game about?
It is a game that consists of a pole attached to a cart that has to move on a frictionless track. To be successful in the game, the agent has to learn how to save the pole from falling over.
Conclusion
In this article, we have discussed the Actor cryptic model. We started with policy-based reinforcement learning, then learned the pseudo code and implementation of the actor-critic model.
We hope this blog was useful and helped you enhance your knowledge of the actor-critic model, and if you want to learn more, check out our articles on Introduction to Reinforcement Learning and Four Types of Learnings in Machine Learning. Do upvote our blogs if you find them useful.
Happy Coding!