Learning to play Mario Bros with reinforcement!

Let’s get a computer to play Mario Bros using reinforcement learning. In this experiment, we’ll attempt to teach a computer to master ‘Mario Bros’ using neural networks. Ready to level up?

Best attempt so far using a Double Deep Q Network (DDQN) after 29155 games

This article is an introduction to machine learning, neural networks, and reinforcement learning, and how they interact. If you’re already familiar with these concepts, feel free to jump ahead to the next articles:

Part 1: The Gym Setup
Part 2: Double Deep Q Network

Project introduction

Learning to play a game, such as Mario Bros, using neural networks, is a project I have wanted to start for a long time. The field of machine learning has always fascinated me. My PhD research focused on applying neural networks to graphs, but the field of machine learning is remarkably vast. For example, I’ve never explored the world of reinforcement learning (RL).

In machine learning, there are three main ways to teach an algorithm:

Supervised learning: this is the most intuitive method. It’s similar to teaching a child to recognize letters. For each letter they might encounter, you would say this is an 'A', this is a 'B', and so on. Technically you need to collect a labeled dataset: in this case, many occurrences of letters. For each letter, you have to give it a label, such as A, B, etc.
Unsupervised learning: here, algorithms discover patterns on their own. Imagine children playing with shape-sorter toys. They figure out how to match shapes with the correct holes by focusing on key features like the shape, while ignoring irrelevant ones like color or material. Technically you let the algorithm create groups with data following the same patterns. Then you label each group afterward.
Reinforcement learning: in this approach, an algorithm learns to perform tasks on its own by receiving rewards from the environment. This method seems to fit ideally in video games: each in-game action will result in a modification of the score, providing a reward, which guides the learning process to explore and exploit the game’s environment.

We’ve just talked about machine learning and neural networks, but what is the difference?

Neural networks and machine learning

Machine learning is a broad concept: it includes any algorithm that learns to solve a given task. In this context, ‘learning’ means adjusting the algorithm’s internal parameters.

Neural networks are a type of learning algorithm inspired by biological neural networks¹. They are usually structured in layers.

The first layer is the input: it’s where the data enters. For example, in a network designed to recognize letters from images, there’s a neuron for each pixel value.
The last layer is the output of the network. In the letter recognition example, there would be 26 output neurons, each representing a different letter.
All the intermediate, or hidden, layers do the network’s computation.
Each neuron is connected with a certain strength, or weight, to some neurons in the next layer. The strength is usually encoded between 0 (no connection) to 1 (full connection). See below where the weights are indicated on each connection.

Layer 1 (Input Neurons)

Input 1
1.0

Input 2
0.5

Layer 2 (Output Neuron)

Output
0.6

Target Label

Target
1.0

Connection Weights. The network starts with initial weights assigned to each connection: $w_1 = 0.2$ and $w_2 = 0.8$.

Forward Propagation. Input activations ($1.0$ and $0.5$) are multiplied by their weights and summed to compute the output neuron: $\text{Output} = (1.0 \times 0.2) + (0.5 \times 0.8) = 0.6$.

Loss / Error Calculation. The generated output ($0.6$) is compared with the target label ($1.0$) to calculate the loss: $\text{Error} = \text{Target} - \text{Output} = 1.0 - 0.6 = 0.4$.

Backpropagation. The computed error is sent backward through the network to update the weights using a learning rate $\eta = 0.1$. The weights are adjusted: $\Delta w_1 = \eta \times \text{Error} \times \text{Input}_1 = 0.04$ (new $w_1 = 0.24$) and $\Delta w_2 = \eta \times \text{Error} \times \text{Input}_2 = 0.02$ (new $w_2 = 0.82$).

Main steps of a neural network:

Forward Propagation: Compute the neural network output: the prediction².
Error computation: Compare the prediction with the target label.
Backpropagation: Uses a process called error backpropagation. Since the network’s output is computed step-by-step through a chain of operations (inputs $\to$ weights $\to$ output $\to$ error), we use The Chain Rule to backtrace the error.

Think of it like a chain of gears: if turning the first gear turns the second, which turns the third, the overall rotation rate is found by multiplying the gear ratios along the chain:
\[\frac{\partial \text{Gear}_3}{\partial \text{Gear}_1} = \frac{\partial \text{Gear}_3}{\partial \text{Gear}_2} \times \frac{\partial \text{Gear}_2}{\partial \text{Gear}_1}\]
In our neural network, to find how a tiny change in a connection weight $w$ affects the final error $E$, we multiply the rates of change backwards:
\[\frac{\partial E}{\partial w} = \frac{\partial E}{\partial \text{Output}} \times \frac{\partial \text{Output}}{\partial w}\]
By tracing this chain of multiplications backward, we calculate the exact gradient (sensitivity) for every weight, showing us how much to adjust each connection to reduce the overall error.

For instance, in a letter recognition network with 26 output neurons, if we present an A and the network outputs 0 everywhere except for B, the resulting error is propagated backwards. The Chain Rule traces the blame back through the connections, allowing us to update the weights so the network predicts A correctly next time.
On a side note, we like the network’s answers to be values between 0 and 1 for each neuron. That’s why we use activation functions at the end of layers. In our example, on the output layer, we’d like the output to be as close as possible to 0 or 1. But what if the values are less close, like 0.3 or 0.8? Some activation functions, like softmax, tend to ‘push’ mean values to 0 and 1. So we likely use such a function at the end of our output layer. Designing neural networks can be challenging.

After this overview of neural networks, you can understand how they work: information is fed to the input layer, transformed by each layer’s neurons and connections, and a result is given as the activation of the last layer. After an error, the backpropagation tries to correct the connections.

Besides being composed of very simple units (neurons and connections), neural networks are powerful: they have the ability to approximate any function³. I like to see them as compressed look-up tables storing experiences, making them ideal to create estimators or mapping functions in large spaces.

But how can neural networks be used in reinforcement learning?

Reinforcement Learning

Reinforcement Learning (RL) is a method of machine learning where an agent learns decision-making in an environment to achieve a goal. Unlike traditional methods used for classification or prediction, RL focuses on learning a strategy, or policy, to take actions based on environmental feedback.

Concept	Definition	Chess / Game Analogy
Agent	The algorithm that learns, interacts, and makes decisions.	The player (or AI) playing one of the two sides.
Environment	The external world/system the agent interacts with.	The chess framework, board, and rules of the game.
State	A complete snapshot of the environment at a specific moment in time.	The exact coordinates and arrangement of all pieces on the board.
Observation	What the agent actually perceives (which can be a partial view of the global State).	The full board in chess (fully observable), or just your own cards in poker (partially observable).
Action	The input, choice, or move the agent makes in the environment.	Moving a pawn to B3 or rotating a block in Tetris.
Reward	Feedback indicating the immediate value or positivity of an action.	Gaining points (e.g. +5 for capturing a rook) or increasing game score.
Value Function	Estimates the expected long-term reward from a given state or state-action pair.	Assessing how secure or strategically advantageous a board position is.
Policy	The decision-making strategy/mapping from states to actions.	A strategy like: “when in check, move the king to safety.”

In RL problems, states and action space can be complex and high-dimensional (like in the chess game). It is fairly impossible for a human to write what action an agent has to take given the board state, like a look-up table, that store the ideal action given any state.

In an ideal world, we’d like this look-up table to be compact and discovered by the agent instead of written by humans. That’s where neural networks enter in the game. We’ve seen they are universal approximators that act like a compressed look-up table. In fact we can use neural network at different points in RL:

As a policy: a neural network computes or stores directly the action to take given the state. It acts as a state-action look-up table.
As a value function: as an estimator that computes the reward for a given state.
As a compressor: when states or action spaces are very large, we might want a neural network to compress them by computing a representation (another interesting property of neural networks) into a smaller space. This can be combined with the two previous strategies.

Why this this challenging?

The core challenge in RL is how to effectively assign rewards to a sequence of actions leading to a delayed outcome. In many scenarios, the ultimate reward is only known at the end of a sequence of actions, such as winning or losing a chess game. The challenge involves determining how to propagate the final reward back to earlier actions, optimizing the agent’s policy (its decision-making strategy) to favor actions that are more likely to lead to successful outcomes in the long run.

The video shown at the top of the page is a good episode, but it doesn’t mean the model has converged to a state where it plays only good episodes.

My goal with this project is to deepen my understanding of reinforcement learning and its algorithms, and to share what I have learned. I’d like to see if a computer can truly master ‘Mario Bros’ through these methods.

Next steps

Thank you for reading this long and technical article. I tried to explain all our challenges and the tools we can use.

In the upcoming articles, we’ll talk about how to set up the environment and experiment with algorithms like Double Deep Q Networks (DDQN). In the future, it would also be really cool to test other options like Proximal Policy Optimization (PPO, which is the algorithm behind RL from Human Feedback (RLHF) used to train GPT models), and NeuroEvolution of Augmenting Topologies (NEAT, which doesn’t use error backpropagation but is a highly efficient evolutionary approach).

Stay tuned for the next article where we’ll discuss the project setup.

Thanks again for your time, and see you!

Original model from McCulloch and Pitts: Warren S. MCCULLOCH and Walter PITTS. «A logical calculus of the ideas immanent in nervous activity». In: The bulletin of mathematical biophysics 5.4 (1943), p. 115-133. ISSN: 1522-9602. DOI: 10.1007/BF02478259. ↩
More formally, the output $y$ of a neuron is given by $y = \sigma(b + \sum_{i \in I} w_i . x_i)$, where $I$ is the set of the input neurons, $x_i$ is the activation of the input neuron $i$, $w_i$ is the weight of the connection between $i$ and the current neuron, $b$ is the bias (a value representing prior knowledge on the task) of the current neuron and $\sigma$ is an activation function. ↩
They are referred to as universal approximators: Kurt HORNIK. «Approximation capabilities of multilayer feedforward networks». In: Neural Networks 4.2 (1991), p. 251-257. ISSN: 08936080. DOI: 10.1016/0893-6080(91)90009-T. ↩