Saurav Bastola | AI Researcher

The Shift from Value to Policy

When I started working on Game AI at Bhoos, my initial instinct was to look at Value-based methods like DQN. In Deep Q-Networks, we try to approximate Q(s, a)—the expected return of taking action a in state s. The policy is deterministic: just pick the action with the highest Q value.

But for many complex games (and certainly for continuous control tasks in robotics), learning a deterministic value function is unstable. We often need a stochastic policy \pi_\theta(a|s) that gives us a probability distribution over actions.

This leads us to Policy Gradients. Instead of learning the value of a state, we directly optimize the parameters \theta of the policy network to maximize the expected reward.

The Objective Function

Our goal is to maximize the expected return J(\theta):

J(\theta) = E_{\tau \sim \pi_\theta}[R(\tau)]

Where \tau is a trajectory (a sequence of states and actions) and R(\tau) is the total reward. The challenge is: how do we take the gradient of an expectation? The environment dynamics are unknown, so we can't differentiate through them.

The Log-Derivative Trick

This is the most beautiful piece of math in RL. Using the identity \nabla x = x \cdot \nabla \log x, we can derive the gradient:

\nabla_\theta J(\theta) \approx E_{\tau} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot G_t \right]

The Intuition

This equation tells a simple story:

\nabla \log \pi(a|s) — the direction in parameter space that makes action a more likely.
G_t (Return) — the “weight” of that update.

If the return G_t is positive, we push the gradients to make that action more probable.
If the return is negative, we make it less probable.

We are simply reinforcing actions that led to good outcomes.

From REINFORCE to PPO

The vanilla implementation of the equation above is called REINFORCE. However, it suffers from high variance. A single "lucky" trajectory can skew the updates.

In my work implementing agents for card games, I moved to A3C (Asynchronous Advantage Actor-Critic) and eventually PPO (Proximal Policy Optimization).

Baseline Subtraction (Advantage):
Instead of raw return G_t, we use Advantage A_t = G_t - V(s).
This asks: How much better was this action than the average action in this state?
This significantly reduces variance.
Clipping (PPO):
PPO ensures that the new policy doesn't deviate too wildly from the old policy in a single update, preventing the “collapsing” training runs that are common in RL.

Implementing these from first principles gave me an appreciation for the delicate balance between exploration (entropy) and exploitation (value estimation) required to train robust agents.