Best attempt so far using our DDQN (after 29155 games)

This article, along with its series, is a work in progress. Be aware, it may have significant revisions. Please revisit it later for an updated version.

Double Deep Q Network (DDQN)

We’ll explore Double Deep Q Network (DDQN), a Reinforcement Learning (RL) algorithm.Each neuron activation will represent the predicted Q-value for each state-action pair.

DDQN is based on two neural networks. The first one, called online network has to estimate the action-value function the function that compute the reward for a given state-action pair. The second one, the target network is used to compute an estimation of the online network’s prediction error. This network won’t be trained. We’ll copy the weights from the online network to the target network with a given frequency.

Understanding Q-Value, and the Replay Buffer

Neural network are great tool to approximate function from examples. See the first article of the series.

The Action-Value function:

DDQN relies on approximating the action-value function given by the Bellman equation. This formula is a cornerstone in understanding how DDQN updates its predictions based on new experiences:

\[new Q(s,a) = \colorbox{Lavender}{Q(s, a)} + \colorbox{SkyBlue}{$\alpha$} [r + \colorbox{Apricot}{$\gamma$} \cdot \colorbox{GreenYellow}{$\max\limits_{a'} Q'(s', a')$} - \colorbox{Lavender}{Q(s, a)}]\]

With:

  • s and a, respectivelly the state and action, from a tuple in the replay buffer.
  • $\colorbox{Lavender}{Q(s, a)}$: the predicted Q-Value of the couple state-action.
  • $r$: the reward for the action a given the state s.
  • $\colorbox{GreenYellow}{$\max\limits_{a’} Q(s’, a’)$}$: the maximum expected future reward.
  • $\colorbox{SkyBlue}{$\alpha$}$: the learning rate.
  • $\colorbox{Apricot}{$\gamma$}$: the discount rate.
  • $r + \colorbox{Apricot}{$\gamma$} \cdot \colorbox{GreenYellow}{$\max\limits_{a’} Q’(s’, a’)$}$: is the target value, used to compute the loss.

The value of taking an action $a$ in state $s$ is equal to the reward, plus the value of taking the best action $a’$ in the next state $s’$ multiplied by a discount factor. This function is recursive: the value of $Q’(s’, a’)$ itself, depends on the best subsequent state-action pair value.

The essence of this equation is to balance immediate rewards with the anticipated future rewards, thus guiding the agent towards optimal long-term strategies.

The discount factor makes reward from the futur states a bit less valuable. This value is between 0 (not taking account from the futur reward) and 1 (taking it completely). A typical value is 0.99.

Replay Buffer

A replay buffer stores past experiences, forming a memory bank for the agent to learn from. This is essantial for two reasons:

  • It mitigates the correlation between sequential experiences by allowing random sampling.
  • It enables reusing data for more efficient learning

Each stored experience is a tuple containing:

  • $s$: the state: a frame of the game,
  • $a$: the action taken by the agent in reaction of the state,
  • $r$: the reward received given this action and state,
  • $s’$: the next state, the resulting state after the action,
  • $done$: the value of flag_get which is a boolean indicating the agent has finished the level.

The Exploration-Exploitation dilemma

The exploitation vs exploration dilemma refers to a bias in the agent’s behavior. In exploitation, the agent takes actions it knows will yield rewards. In exploration, it tries new actions to discover new strategies.

DDQN navigates this dilemma using $\epsilon$ (‘epsilon’), a probability determining how often the agent takes random actions versus the best predicted one. This permits to explore space and not to be stuck in local optimal strategy.

  • At the beginning of the learning process, epsilon is set the 1, ensuring maximum exploration.
  • I gradually decrease to a small value near 0, allowing the agent to rely more on its learned strategies while maintaining some level of exploration.

Network architecture

The neural network will be composed of two modules:

  1. A convolutional neural network (CNN) dedicated to information extraction from input images (the states). The CNN are designed to extract hierarchical features from images: each layer learn and compute filters (or convolutions) on the output of the previous one.

  2. A fully connected neural network, that will be in charge of “taking the decisions” based on the previously extracted visual features.

Here, the input size is the dimension of the state the number of neurons on the ouput layer will be 5 the number of possible actions. Each neuron activation will represent the predicted Q-value for each state-action pair.

How to train the online network

  1. Sample a random experience from the replay buffer, get a (state, action, reward, next state, done) tuple.
  2. Pass the state to the online network, to get the predicted Q-Value for the action of the tuple.
    • Predicted Q-Value: $Q(s, a)$.
  3. To see how accurate it is, we compute the target value:
    • Pass the next state in the target network to get the target Q-Values for each possible actions.
    • We take the highest target Q-Value: $maxQ’(s’, a’)$
    • We use the reward to compute the final target value: $r + \gamma \cdot \max\limits_{a’} Q(s’, a’)$. Here $\gamma$ is the discount factor.
  4. Compute the loss as the difference between the predicted Q-Value and the target value.
  5. Do the error backproagation on the online network.

After each learning episode, the online network improve a bit. After a few episodes, we update the target network with the online network weights.

The Error Metric: Mean Square Error (MSE)

DDQN uses MSE to measure the accuracy of predictions:

  • $MSE = 1/2 (y - \hat{y})^2$

with :

  • $y = r + \gamma \cdot \max\limits_{a’} Q(s’, a’)$: the target value.
  • $\hat{y} = Q(s, a)$: the predicted Q-value.

Next steps

The upcoming articles will be about Proximal Policy Optimization (PPO) and possibly NeuroEvolution of Augmenting Topologies (NEAT).

In the next article we will talk about PPO and its mechanisms.

Thanks and see you next time!

Sources:

  • Sourish Kundu’s video : https://www.youtube.com/watch?v=_gmQZToTMac