2. Double Deep Q Network
Best attempt so far using our DDQN (after 29155 games)
This article, along with its series, is a work in progress. Be aware, it may have significant revisions. Please revisit it later for an updated version.
Double Deep Q Network (DDQN)
We’ll explore Double Deep Q Network (DDQN), a Reinforcement Learning (RL) algorithm.Each neuron activation will represent the predicted Q-value for each state-action pair.
DDQN is based on two neural networks.
The first one, called online network
has to estimate the action-value function
the function that compute the reward for a given state-action pair
.
The second one, the target network
is used to compute an estimation of the online network
’s prediction error.
This network won’t be trained. We’ll copy the weights from the online network to the target network with a given frequency.
Understanding Q-Value, and the Replay Buffer
Neural network are great tool to approximate function from examples. See the first article of the series.
The Action-Value function:
DDQN relies on approximating the action-value function given by the Bellman equation. This formula is a cornerstone in understanding how DDQN updates its predictions based on new experiences:
\[new Q(s,a) = \colorbox{Lavender}{Q(s, a)} + \colorbox{SkyBlue}{$\alpha$} [r + \colorbox{Apricot}{$\gamma$} \cdot \colorbox{GreenYellow}{$\max\limits_{a'} Q'(s', a')$} - \colorbox{Lavender}{Q(s, a)}]\]With:
s
anda
, respectivelly thestate
andaction
, from a tuple in thereplay buffer
.- $\colorbox{Lavender}{Q(s, a)}$: the predicted Q-Value of the couple
state
-action
. - $r$: the reward for the action
a
given the states
. - $\colorbox{GreenYellow}{$\max\limits_{a’} Q(s’, a’)$}$: the maximum expected future reward.
- $\colorbox{SkyBlue}{$\alpha$}$: the learning rate.
- $\colorbox{Apricot}{$\gamma$}$: the discount rate.
- $r + \colorbox{Apricot}{$\gamma$} \cdot \colorbox{GreenYellow}{$\max\limits_{a’} Q’(s’, a’)$}$: is the target value, used to compute the loss.
The value of taking an action $a$ in state $s$ is equal to the reward, plus the value of taking the best action $a’$ in the next state
$s’$ multiplied by a discount factor.
This function is recursive: the value of $Q’(s’, a’)$ itself, depends on the best subsequent state-action pair value.
The essence of this equation is to balance immediate rewards with the anticipated future rewards, thus guiding the agent towards optimal long-term strategies.
The discount factor makes reward from the futur states a bit less valuable. This value is between 0 (not taking account from the futur reward) and 1 (taking it completely). A typical value is 0.99.
Replay Buffer
A replay buffer stores past experiences, forming a memory bank for the agent to learn from. This is essantial for two reasons:
- It mitigates the correlation between sequential experiences by allowing random sampling.
- It enables reusing data for more efficient learning
Each stored experience is a tuple containing:
- $s$: the
state
: a frame of the game, - $a$: the
action
taken by the agent in reaction of thestate
, - $r$: the
reward
received given thisaction
andstate
, - $s’$: the
next state
, the resulting state after theaction
, - $done$: the value of
flag_get
which is a boolean indicating the agent has finished the level.
The Exploration-Exploitation dilemma
The exploitation vs exploration dilemma refers to a bias in the agent’s behavior. In exploitation, the agent takes actions it knows will yield rewards. In exploration, it tries new actions to discover new strategies.
DDQN navigates this dilemma using $\epsilon$ (‘epsilon’), a probability determining how often the agent takes random actions versus the best predicted one. This permits to explore space and not to be stuck in local optimal strategy.
- At the beginning of the learning process,
epsilon
is set the 1, ensuring maximum exploration. - I gradually decrease to a small value near 0, allowing the agent to rely more on its learned strategies while maintaining some level of exploration.
Network architecture
The neural network will be composed of two modules:
-
A convolutional neural network (CNN) dedicated to information extraction from input images (the states). The CNN are designed to extract hierarchical features from images: each layer learn and compute filters (or
convolutions
) on the output of the previous one. -
A fully connected neural network, that will be in charge of “taking the decisions” based on the previously extracted visual features.
Here, the input size is the dimension of the state the number of neurons on the ouput layer will be 5 the number of possible actions. Each neuron activation will represent the predicted Q-value for each state-action pair.
How to train the online network
- Sample a random experience from the replay buffer, get a (
state
,action
,reward
,next state
,done
) tuple. - Pass the
state
to theonline network
, to get thepredicted Q-Value
for theaction
of the tuple.Predicted Q-Value
: $Q(s, a)$.
- To see how accurate it is, we compute the target value:
- Pass the
next state
in thetarget network
to get the target Q-Values for each possible actions. - We take the highest target Q-Value: $maxQ’(s’, a’)$
- We use the
reward
to compute the finaltarget value
: $r + \gamma \cdot \max\limits_{a’} Q(s’, a’)$. Here $\gamma$ is the discount factor.
- Pass the
- Compute the loss as the difference between the
predicted Q-Value
and thetarget value
. - Do the error backproagation on the
online network
.
After each learning episode, the online network improve a bit. After a few episodes, we update the target network with the online network weights.
The Error Metric: Mean Square Error (MSE)
DDQN uses MSE to measure the accuracy of predictions:
- $MSE = 1/2 (y - \hat{y})^2$
with :
- $y = r + \gamma \cdot \max\limits_{a’} Q(s’, a’)$: the target value.
- $\hat{y} = Q(s, a)$: the predicted Q-value.
Next steps
The upcoming articles will be about Proximal Policy Optimization (PPO) and possibly NeuroEvolution of Augmenting Topologies (NEAT).
In the next article we will talk about PPO and its mechanisms.
Thanks and see you next time!
Sources:
- Sourish Kundu’s video : https://www.youtube.com/watch?v=_gmQZToTMac