Double Deep Q Network

Best attempt so far using our DDQN (after 29,155 games)

In this article, we’ll explore the Double Deep Q-Network (DDQN) algorithm, a powerful reinforcement learning (RL) method that extends standard Deep Q-Networks (DQN) to stabilize training and improve performance on complex tasks like playing Mario Bros.

1. Why Double DQN? The Overestimation Bias

In standard DQN (introduced by Mnih et al. in 2015), the target value $y$ for updating the network is computed using the maximum expected future return:

\[y = r + \gamma \cdot \max_{a'} Q_{\theta^{-}}(o', a')\]

Where:

$Q_{\theta^{-}}$ represents the Target Network parameters.
$o’$ is the next observation.
$a’$ represents possible next actions.

The Problem: Overestimation Bias

Because standard DQN uses the same network parameters $\theta^{-}$ to both select the best action and evaluate its value, it suffers from a positive maximization bias. Due to noise, approximation errors, and environmental stochasticity, some Q-values will be randomly overestimated. Taking the maximum over these estimates propagates this positive bias back through the network, leading to sub-optimal policies and unstable training.

The Solution: Decoupled Selection & Evaluation

Double DQN (introduced by van Hasselt et al. in 2015) solves this by decoupling the action selection from the action evaluation:

Selection: Use the Online Network ($Q_{\theta}$) to choose the best action $a^*$ in the next observation $o’$: $a^*$ = \arg\max_{a'} Q_{\theta}(o', a')$
Evaluation: Use the Target Network ($Q_{\theta^{-}}$) to evaluate the value of that chosen action: $y^{\text{DoubleDQN}} = r + \gamma \cdot Q_{\theta^{-}}(o', a^*)$

This prevents the agent from greedily overestimating values, leading to much more stable convergence.

2. Visualizing the DDQN Training Loop

Replay Buffer

Sampled Transition:
$(o, a, r, o', \text{done})$

Online Network $Q_{\theta}$

Online Network

• Predicts $Q_{\theta}(o, a)$
• Selects action $a^* = \arg\max_{a'} Q_{\theta}(o', a')$

Target Network $Q_{\theta^{-}}$

Target Network

• Evaluates value of selected action:
$Q_{\theta^{-}}(o', a^*)$

MSE Loss
$\mathcal{L}(\theta) = \frac{1}{2}\big(Q_{\theta}(o, a) - y\big)^2$

Sample Transitions. A random transition $(o, a, r, o', \text{done})$ is sampled from the Replay Buffer. The current observation and action are sent to the Online Network, while the next observation $o'$ is sent to both networks.

Action Selection. The Online Network selects the best action $a^* = \arg\max_{a'} Q_{\theta}(o', a')$ for the next observation, and shares it with the Target Network.

Action Evaluation. The Target Network evaluates the value of the selected action, $Q_{\theta^{-}}(o', a^*)$, to calculate the TD Target: $y = r + \gamma(1-\text{done})Q_{\theta^{-}}(o', a^*)$.

Loss & Update. We calculate the Mean Squared Error (MSE) loss against the target $y$ and backpropagate the gradients to update the Online Network parameters $\theta$.

3. Core Components under the Hood

3.1. The Replay Buffer

A Replay Buffer stores past experiences, forming a memory bank for the agent. This is essential for two key reasons:

Breaking Temporal Correlation: During online play, sequential observations are highly correlated. Random sampling from the buffer breaks this correlation, satisfying the I.I.D. assumption required for stable gradient descent.
Data Efficiency: Experiences are reused multiple times for training before being discarded.

Each stored transition tuple contains:

$o$: The observation (e.g., a stack of 4 consecutive grayscale game frames).
$a$: The action chosen by the agent.
$r$: The reward received.
$o’$: The resulting next observation.
$\text{done}$: A boolean flag indicating whether the episode ended (e.g., Mario died or cleared the level).

3.2. Exploration vs. Exploitation ($\epsilon$-Greedy)

To find optimal strategies, the agent must balance:

Exploration: Trying new, random actions to discover higher-reward observations.
Exploitation: Choosing actions it already knows will yield high rewards.

We navigate this using an $\epsilon$-greedy policy:

The agent chooses a random action with probability $\epsilon$ and the best action ($\arg\max_a Q(o, a)$) with probability $1 - \epsilon$.
$\epsilon$ starts at $1.0$ (pure exploration) and decays exponentially to a small minimum (e.g., $0.01$), allowing the agent to rely primarily on its learned policy over time.

4. Network Architecture

The neural network is composed of two primary modules:

Convolutional Neural Network (CNN): Extracts visual features from the raw game frames (e.g., detecting obstacles, enemies, and Mario’s position).
Fully Connected Network (MLP): Takes the extracted visual features and maps them to Q-value predictions for each possible action: $\text{Output} = [Q(o, a_1), \dots, Q(o, a_5)]$

5. Next Steps

In the upcoming articles, we will probably explore Proximal Policy Optimization (PPO) and its mechanisms to see if policy gradient methods perform better on this platform.

Next Article: The Gym Setup
Source Code: GitHub Repository