The Gym setup!

Best attempt so far using a DDQN (after 29155 games)

Mario Bros as a Reinforcement Learning Challenge

In the project to have an RL algorithm playing ‘Mario Bros’, let’s first explore the essential tool and techniques to make this possble. This article focuses on the OpenAI’the Gym platform, offering simulations and an API to communicate with them. We’ll review how to interact with the emulator to receive rewards and how these rewards are calculated.

The Gym plateform and Nes-py emulator

The testing plateform for this experiment is based on OpenAI Gym API and its successor Gymnasium. They offer a Super Mario Bros. environment based on the Nes-py emulator

This Mario Bros environment is an ideal testing ground due to its balance between simplicity and complexity. Its traps and enemies provide clear but challenging objectives for algorithms. The key complexities playing ‘Mario Bros’ are:

Game Physics: The game’s internal physics adds complexity to movement and interaction.
Complex Reward: Rewards depends on factors like level completion, remaining time, etc.
In-Game Challenges: The presence of traps, such as holes, enemies, and pipes, creates a challenging environment for algorithms.

Plus, using raw images and gamepad inputs can be inefficient and overly complex. In many frames, numerous pixels don’t carry useful information. To make the learning process more efficient, the Gym plateform provides simplifications to interact with the game:

Emulator: the environment provide us an emulator (Nes-py) to run our simulation. It also wraps the emulator API in a simple way to retrieve frames and send controller inputs.
Simplified inputs and output: We can choose fine-grained inputs and outputs for the interactions between the environment and the agent.
Rewards system: The environment will provide rewards aligned with the game’s objective. We’ll explain that in the next section.
Environment abstration: Remove loading screens and cut-scenes to focus solely on gameplay.
Data reduction: Use a lower resolution version of the game to reduce input complexity. We’ll use the SuperMarioBros-v3 simplified version.

Simplified environment

We aim to build a neural network that processes raw game images as input and translate them into gamepad events as output:

Input: Initially the input requires a large number of neurons, one for each piece of picel information in a game frame - Typically (240 x 256 x 3) or 184.320 input neurons. By downsampling the frame resolution and convert it to grayscale, we’ll simplifying this to (30 x 32 x 1), or 960 neurons.
Output: The number of output neurons will correspond to the possible actions. Each neuron will represents a different action, like ‘push the move right button’, ‘push the jump button’, etc. We’ll use a simplified output mode called RIGHT_ONLY, which has 5 possibilities:
```
[['NOOP'], ['right'], ['right', 'A'], ['right', 'B'], ['right', 'A', 'B']]
```
Noop is for no input. These simplifications significantly reduces the complexity of the input and output spaces.

In the Gym framework, the concept is to use a base environment that will provide the simulation. Then it is to add some wrappers on top of it. Each wrapper will modify the environment and add some features:

# Create the base environment
env = gym_super_mario_bros.make('SuperMarioBros-1-1-v3')
# Setup the joypad space to the simplified output mode
env = JoypadSpace(env, RIGHT_ONLY)
# The SkipFrame wrapper will skip some frames and cumulate the reward during these frames.
# So the agent take a decision every four frames
env = SkipFrame(env, skip=4)
# The ResizeObservation permits to resize the frames given to the agent, keeping the ration and setting the width to 30px
env = ResizeObservation(env, shape=30)
# Convert the frames to a gray scale
env = GrayScaleObservation(env, keep_dim=False)
# Cumulate the last four frames to be given as input to the agent
env = FrameStack(env, num_stack=4, lz4_compress=True)

Here is the main loop, permiting to interact with the game simulation using the Gym plateform:

# initial state
done = False
state = env.reset()
# main loop
while not done: 
    
    # Here we get a random action 
    # In practice the agent would select an action based on the current state
    # For instance: action = model.predict_action(state)
    action = env.action_space.sample()

    # The environment execute the agent action and returns the new state, the reward and whether the game is over
    new_state, reward, done, info = env.step(action)

    # Will render the new_state
    env.render()

    # Update the current state to the new state
    state = new_state

In the described setup, game frames are used as states in the environment:

state: is the previous frame.
new_state is the next one after the agent action.

Additionally, info is a dictionary that holds various current game information, useful for creating custom reward functions or for tracking the progress of the agent:

flag_get: indicates if the player has completed the level.
The count of collected coins.
The number of remaining lifes.
The in-game score so far.
The Mario’s status: small, tall, fireball, etc., indicating his current power-up state.
the remaining time on the clock.
x_pos and y_pos: these represent Mario’s position on the stage, providing spatial context within the game

Understanding this information can be crucial for developing custom reward strategy. For now, it’s important to first comprehend how the current reward is computed within the environment.

How the reward is computed?

The reward is a critical point in the learning process. Let’s review how it is computed. These information are from the Gym Super Mario Bros page:

The reward function assumes the objective of the game is to move as far right as possible (increase the agent's x value), as fast as possible, without dying.

The reward system in this Mario Bros game environment is structured as a sum of three components: One positive: the velocity, and two negatives: the clock and the death:

The velocity, noted $v$, is the instantaneous velocity for the given step. It is computed by taking the difference in agent x values between states: $v = x1 - x0$, where:
- $ x0 $ is the x position before the step
- $x1$ is the x position after the step
- This implies that, when moving right: $v > 0$, when moving left: $v < 0$ and not moving: $v = 0$.
The clock, noted $c$, is a penalty that prevents the agent from standing still. We want that when no clock tick: $c = 0$ and then there is a clock tick: $c < 0$. This is the difference in the game clock between frames: $c = c0 - c1$, where:
- $c0$ is the clock reading before the step
- $c1$ is the clock reading after the step
The death part, denoted $d$, penalizes for dying. This is to encouraging the agent to avoid death. We simply want that whenn the agent is alive: $d = 0$ and when it lost a live: $d = -15$.

The total reward, $r$, is simply the sum at each frame of the three terms: $r = v + c + d$. The reward is also clipped into the range (-15, 15). This clipping is typically employed to maintain stability in the learning process. By constraining the reward to a fixed range, it prevents the learning algorithm from being influenced by extreme reward values.

Next steps

In upcoming articles we’ll talk about Double Deep Q Networks (DDQN), Proximal Policy Optimization (PPO) and NeuroEvolution of Augmenting Topologies (NEAT).

In the next article we will explore DDQN and how it is working.

Thanks and see you next time!