Superhuman Algorithm: MuZero Explained

The algorithm that wins without knowing the rules

10 min readJul 6, 2021

Preface

MuZero is a reinforcement learning algorithm developed by Google DeepMind. It is the successor of the famous AlphaZero algorithm and it’s performance is nothing short of superhuman. The algorithm is able to achieve better performance than all the other reinforcement learning algorithms developed by Google and doing so, without knowing the rules of the game! In the original 2019 paper released by DeepMind, the algorithm showed to have beaten humans and AlphaZero with a sizeable margin in chess, shogi, go and several atari games. The rise of artificial intelligence and the several theories surrounding it in both fiction and non-fiction has arguably made it one of the most discussed areas of computing; especially, when Google releases new machine learning algorithms.

Evaluation of MuZero throughout training in chess, shogi, Go and Atari. The x-axis shows millions of training steps. For chess, shogi and Go, the y-axis shows Elo rating, established by playing games against AlphaZero using 800 simulations per move for both players. MuZero’s Elo is indicated by the blue line, AlphaZero’s Elo by the horizontal orange line. For Atari, mean (full line) and median (dashed line) human normalized scores across all 57 games are shown on the y-axis and the orange is the performance of R2D2

Introduction

MuZero is described as reinforcement learning planning algorithm. Planning algorithms rely on knowledge of the environment’s dynamics, such as the rules of the game or a simulator — thus it prevents them from being used in real-word scenarios where there is no simulator to accurately predict the next state. However, MuZero, a model-based reinforcement learning algorithm aims to address the issue of the environment’s unpredictability and being able to plan despite it. MuZero makes use of a new approach in model-based reinforcement learning to plan and solve-problems in visually complex environments. The MuZero algorithm, the successor to AlphaZero, follows much of the similar approaches of AlphaZero, however a key difference is the use of a learned model to improve its training system. Additionally, It is able to achieve this ability to plan and predict the future accurately with the help of its three neural networks, unlike its predecessor AlphaZero which makes use of only one

The MuZero algorithm aims to accurately predict characteristics and details of thefuture which it deems important for planning. The algorithm initially receives an input, for instance an image of a chess board, which is translated into a hidden state. The hidden state then undergoes iterations based on the previous hidden state and a proposed subsequent plan of action. Each time the hidden state is updated the model predicts three variables: policy, value function and immediate reward. The policy is the next move to be played, the value function is the predicted winner, and the immediate reward is the strength of the move (if it improves the player’s position). The model is then trained to accurately predict the values of the three aforementioned variables. The pseudocode provided by the research paper will be used in this section to highlight points of interest and explain some of the code in detail and any mention of functions and classes is in respect to the pseudocode. The pseudocode can be found here.

The MuZero algorithm does not receive any rules of the game, for instance in a chess game, the algorithm is unaware of the legal moves or what constitutes as win, draw or a loss. It has no concept of these rules and so it aims to create a hidden representation of the game. This can be thought of as MuZero creating mini-games of the game it is actually playing. Sticking to the example of chess, MuZero might learn that by developing its pieces in the centre of the board allows it more control over the board, hence it would aim to achieve this result using its hidden state representation. Another crucial point to mention regarding the hidden representation of the game is that MuZero only receives the rewards of its actions when a game is terminated, either by winning, drawing, or losing. However, during its learning, MuZero receives rewards periodically based on how well it is doing in its own mini-games. In essence, MuZero is able to estimate its performance on a game based on its own representation of the game which can be seen as MuZero creating rules for itself; seeing as it was never given the rules.

MuZero generates game data by playing against itself several times and recording the game data so it can be used in training. This allows the algorithm to improve its neural networks since they have access to the game data, the hidden representation of the game as well as the outcome, among other things. The training and improvement of the neural networks allows MuZero to play more accurately and produce more game data, this feedback loop occurs constantly allowing MuZero and its neural networks to improve over time.

2.2.3 The two parts of MuZero

When examining MuZero algorithm from a high-level perspective, it can be argued that there are two key components to it, self-play and training, which is used to generate game data and train the neural networks respectively. Both work together to improve the performance of the algorithm. Both these components have access to the replayBuffer and sharedStorage objects in order to store generated game data and store previous iterations of neural network. The purpose of the shared storage object is to store different versions of the neural network as well as to be able to retrieve the latest neural network. The main purpose of the replayBuffer object on the other hand is used to store data from previous games. It is therefore crucial in the improvement of self-play and training for both of these objects to be accessible.

Self play and the play_game function

The self-play function is responsible for playing a specified game, for example chess with itself and saving the generated game data to the replayBuffer Object. It is important to point out that there will be several instances of self-play being run by MuZero, each would run independently using the latest neural network then each instance would save the game data to the replayBuffer.

The play_game function starts by initialising the game to its initial state, for example for tic-tac-toe a blank 3x3 grid. It then tries to find the optimal next moves making use of Monte Carlo Tree Search. This process is repeated until a terminal condition is met, for instance checkmate in chess or when more moves than the number of maximum moves allowed is reached.

Monte Carlo Tree Search in MuZero

The Monte Carlo Tree Search begins with the root node where the node stores various information, such as whose turn it is to play, number of times a node has been visited, the children of the nodes and the predicted reward for a candidate move among other information. The play_game function returns the current state of the game and then the process of expanding the root begins, where exploration noise is added to the expanded roots in order to ensure that more options are considered and not just the one currently explored. Since MuZero does not know the legal rules of a game environment nor does it know the rewards it might receive throughout the process of learning and finding optimal moves it makes use of an object called MinMaxStats. The MinMaxStats object stores the current maximum and minimum rewards encountered, this allows MuZero to adjust its reward value (Schrittwieser et al, 2020). The Monte Carlo Tree Search process decides on an action by running N simulations which always starts at the root node and traverses down the tree in accordance with the UCB formula given until a node with no children is explored.

Above is the UCB score formula which is used by MuZero to normalise the estimated value of an action, it also considers an exploration bias as well as the number of times an action has previously been selected (Schrittwieser et al, 2020). Once it has explored the tree, the predicted value for an action must be backpropagated. The action which has the most visits is chosen as the optimal move. This is because the algorithm explores various different moves and therefore if an action is visited repeatedly it must be the optimal move for a given position.

Training and Loss function

The training of MuZero is the other key aspect of what makes MuZero unique, the function train_network repeatedly trains the neural network by making use of the replayBuffer. This function also includes a gradient descent optimiser which updates the weights periodically to stay relevant. The train_network function works by looping the total number of training steps, which is set to one million by default. The function then samples a batch at every step and uses that data to update the neural network. More specifically, the training batch is created by using the sample_batch function in the replayBuffer class and a batch contains a list of tuples. The tuples within a batch are: the current state of the game, a list of actions taken from the current position and lastly the targets used to train the neural networks. The targets used to train the neural networks are calculated by using Temporal Difference (TD) Learning . The key concept behind TD learning is to update the prediction of a state using the data gathered previously in a dynamic manner. In essence in TD learning an estimate is used to update a future estimate, this is known as bootstrapping. The loss function of MuZero is responsible for how the weights of the neural network are updated. Below is the given loss function used in MuZero

In the formula above: k is used to denote number of steps taken after a given state and lr , lv and lp are loss functions for reward, value and policy respectively. There are three main purposes of the loss function:

To minimise the discrepancy between predicted reward and the actual reward received.
To optimise the predicted value to the target value derived from TD learning.
To reduce the difference between the predicted policy and target policy. To update weights in MuZero, MuZero makes use of initial_inference and recurrent_inference. The role of initial_inference is to provide an initial observation of the current policy, value and reward of a given state. These are then used to create a prediction. MuZero then makes use of recurrent_inference to predict the next policy, value and reward. These predictions are then used to calculate the loss when compared to the target values of policy, value and reward.

To update weights in MuZero, MuZero makes use of initial_inference and recurrent_inference. The role of initial_inference is to provide an initial observation of the current policy, value and reward of a given state. These are then used to create a prediction. MuZero then makes use of recurrent_inference to predict the next policy, value and reward. These predictions are then used to calculate the loss when compared to the target values of policy, value and reward.

Summary of the algorithm

The MuZero algorithm has a lot of moving parts which require a lot of deep understanding of different areas within Artificial Intelligence. The performance of the algorithm is superior to its predecessor AlphaZero in spite of receiving less information about the true environment, such as the rules of the game or what the legal moves are . It is able to achieve better results by making use of its two key components, self_play and training. The self-play makes use of three neural networks to create a hidden representation of the game and using this new environment to learn and find optimal moves using Monte Carlo Tree Search. The training of MuZero is done by using replayBuffer which stores the data of previous games played and uses this to train the neural networks of MuZero along with the use of TD Learning and it unique loss function to constantly update weights in real time.

While the MuZero algorithm was able to outperform alphaZero in chess, the behaviour of the algorithm is not fully understood. MuZero’s ability to make use of hidden state representation and use it to its advantage to find better moves is undoubtedly an advantage for the performance of the algorithm; however, it makes the algorithm tough to predict and analyse to optimise its performance further. The behaviour of the algorithm can be thought of as a black-box where the hidden states and the computations of the deep neural- network are unknown. Thus, making analysis on the algorithm’s metrics and performance, such as: optimal number of iterations needed to solve a problem or optimal number of neural networks needed, extremely difficult. MuZero is also a very computationally heavy algorithm and conducting experiments on it can be very computationally taxing, which make it very tough to conduct research on. These factors can make any research and analysis into the algorithm quite difficult.

This article is for informational purposes only and the views expressed are mine alone.

Geek Culture