Chapter 1 - 7: Monte Carlo Methods

最新推荐文章于 2026-05-11 07:13:40 发布

原创最新推荐文章于 2026-05-11 07:13:40 发布 · 772 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

深度强化学习专栏专栏收录该内容

13 篇文章

订阅专栏

本文深入探讨了强化学习中的蒙特卡洛方法，包括其基本原理、预测问题和控制问题。通过格子世界例子引入，详细解析了如何通过蒙特卡洛方法进行策略评估和策略改进，最终实现最优策略的学习。

Chapter 1 - 7: Monte Carlo Methods

1.7.1 Review

In order to rigorously define a reinforcement learning task, we generally use a Markov Decision Process(MDP) to model the environment. The MDP specifies the rules that the environment uses to respond to the agent’s actions, including how much reward to give to the agent in response to its behavior. The agent’s goal is to learn how to play by the rules of the environement, in order to maximize reward.
在这里插入图片描述
In particular, the optimal policy $π∗\pi_*$ specifies - for each environment state - how the agent should select an action towards its goal of maximizing reward. The agent could structure its search for an optimal policy by first estimating the optimal action value function $q_*$ ; then, once $q_*$ is known, $π∗\pi_*$ is quickly obtained.

1.7.2 Gridworld Example

在这里插入图片描述

1.7.3 Monte Carlo Methods

The most sensible thing for one agent to do at the begining when it doesn’t know anything is just to behave randomly and see what happens. When the agent randomly selects an action in this way where each action has an equal chance of being selected. We say that it’s following the equiprobable random policy.
在这里插入图片描述

This upon information is valuable! The agent can consolidate this experience in a way that allows it to improve upon its currently very random strategy.
Remerber the agent is search for the optimal policy $π∗\pi_{*}$ . It tell us for each state which action or actions are most useful towards the goal of maximizing return, or getting as much cumulative reward as we can over all time steps.
To truly understand the environment, the agent needs more episodes:

Reason 1: The agent hasn’t attempted each action from each state.
Reason 2: The environment’s dynamics are atochastic!

Monte Carlo Method
When the agent has a policy in mind, it follows the policy to collect a lot of episodes. Then, for each state, to figure out which action is best, the agent can look for which action tendted to result in most cumulative reward.

1.7.3 MC Prediction - Part 1

The agent collected two episodes and now the question is how exactly should the agent consolidate this information towards its goal of obtaining the optimal policy? It make sense to look at each date separately.
在这里插入图片描述
Keep trach of a table with one row for each non-terminal state and one column for each action.

在这里插入图片描述

1.7.5 MC Prediction - Part 2

在这里插入图片描述
If the agent follows a policy for many episodes, we can use the results to directly estimate the action-value function corresponding to the same policy.
The Q-Table is used to estimate the action-value function.

1.7.6 MC Prediction - Part 3

Estimating the action-value function with a Q-table is an important intermediate step. We also refer to this as the prediction problem.
Prediction Problem: Given a policy, how might the agent estimate the value function for that policy?
蒙特卡洛预测：基于一个给定的策略，智能体根据该策略，估计值函数的过程。

In the algorithm for MC prediction, we begin by collecting many episodes with the policy. Then, we note that each entry in the Q-table corresponds to a particular state and action. To populate an entry, we use the return that followed when the agent was in that state, and chose the action. In the event that the agent has selected the same action many times from the same state, we need only average the returns.
Before we dig into the pseudocode, we note that there are two different versions of MC prediction, depending on how you decide to treat the special case where - in a single episode - the same action is selected from the same state many times.
在这里插入图片描述

As discussed in the video, we define every occurrence of a state in an episode as a visit to that state-action pair. And, in the event that a state-action pair is visited more than once in an episode, we have two options.

Option 1: Every-visit MC Prediction
Average the returns following all visits to each state-action pair, in all episodes.
Option 2: First-visit MC Prediction
For each episode, we only consider the first visit to the state-action pair. The pseudocode for this option can be found below.

There are three relevant tables:

$Q$ - Q-table, with a row for each state and a column for each action. The entry corresponding to state $s$ and action $a$ is denoted $Q (s, a)$ .
$N$ - table that keeps track of the number of first visits we have made to each state-action pair.
$returns\_sum$ table that keeps track of the sum of the rewards obtained after first visits to each state-action pair.

In the algorithm, the number of episodes the agent collects is equal to $num\_episodes$ . After each episode, $N$ and $returns\_sum$ updated to store the information contained in the episode. Then, after all of the episodes have been collected and the values in $N$ and $returns\_sum$ have been finalized, we quickly obtain the final estimate for $Q$ .

Both the first-visit and every-visit method are guaranteed to converge to the true action-value function, as the number of visits to each state-action pair approaches infinity. In the case of first-visit MC, convergence follows from the Law of Large Numbers

Every-visit MC is biased, whereas first-visit MC is unbiased (see Theorems 6 and 7).
Initially, every-visit MC has lower mean squared error (MSE), but as more episodes are collected, first-visit MC attains better MSE

1.7.7 OpenAI Gym: BlackjacEnv

    """Simple blackjack environment
    Blackjack is a card game where the goal is to obtain cards that sum to as
    near as possible to 21 without going over.  They're playing against a fixed
    dealer.
    Face cards (Jack, Queen, King) have point value 10.
    Aces can either count as 11 or 1, and it's called 'usable' at 11.
    This game is placed with an infinite deck (or with replacement).
    The game starts with dealer having one face up and one face down card, while
    player having two face up cards. (Virtually for all Blackjack games today).
    The player can request additional cards (hit=1) until they decide to stop
    (stick=0) or exceed 21 (bust).
    After the player sticks, the dealer reveals their facedown card, and draws
    until their sum is 17 or greater.  If the dealer goes bust the player wins.
    If neither player nor dealer busts, the outcome (win, lose, draw) is
    decided by whose sum is closer to 21.  The reward for winning is +1,
    drawing is 0, and losing is -1.
    The observation of a 3-tuple of: the players current sum,
    the dealer's one showing card (1-10 where 1 is ace),
    and whether or not the player holds a usable ace (0 or 1).
    This environment corresponds to the version of the blackjack problem
    described in Example 5.1 in Reinforcement Learning: An Introduction
    by Sutton and Barto.
    http://incompleteideas.net/book/the-book-2nd.html
    """

1.7.8 Workspace - Introduction

1. Each state is a 3-tuple of:

the player’s current sum ∈{0,1,…,31} ,
the dealer’s face up card ∈{1,…,10} , and
whether or not the player has a usable ace (no =0 , yes =1 ).
2. The agent has two potential actions:
- STICK = 0
- HIT = 1

We will begin by investigating a policy where the player almost always sticks if the sum of her cards exceeds 18. In particular, she selects action STICK with 80% probability if the sum is greater than 18; and, if the sum is 18 or below, she selects action HIT with 80% probability.

1.7.10 Workspace

Part 0: Explore BlackjackEnv

import sys
import gym
import numpy as np
from collections import defaultdict

from plot_utils import plot_blackjack_values, plot_policy
# Use the code cell below to create an instance of the Blackjack environment.
env = gym.make('Blackjack-v0')
print(env.observation_space)
print(env.action_space)

# Execute the code cell below to play Blackjack with a random policy.
for i_episode in range(30):
    state = env.reset()
    while True:
        print(state)
        action = env.action_space.sample()
        state, reward, done, info = env.step(action)
        if done:
            print('End game! Reward: ', reward)
            print('You won :)\n') if reward > 0 else print('You lost :(\n')
            break

Part 1: MC Prediction
Implementation of MC prediction (for estimating the action-value function).
We will begin by investigating a policy where the player almost always sticks if the sum of her cards exceeds 18. In particular, she selects action STICK(停牌) with 80% probability if the sum is greater than 18; and, if the sum is 18 or below, she selects action HIT(继续要牌) with 80% probability. The function generate_episode_from_limit_stochastic samples an episode using this policy.

The function accepts as input:

bj_env: This is an instance of OpenAI Gym’s Blackjack environment.

It returns as output:

episode: This is a list of (state, action, reward) tuples (of tuples) and corresponds to $(S0,A0,R1,…,ST−1,AT−1,RT)(S_0, A_0, R_1, \ldots, S_{T-1}, A_{T-1}, R_{T})$ , where $T$ is the final time step. In particular, episode[i] returns $S_i, A_i, R_{i+1})$ , and episode[i][0], episode[i][1], and episode[i][2] return $S_i$ , $A_i$ , and $R_{i+1}$ , respectively.

def generate_episode_from_limit_stochastic(bj_env):
    episode = []
    state = bj_env.reset()
    while True:
        probs = [0.8, 0.2] if state[0] > 18 else [0.2, 0.8]
        action = np.random.choice(np.arange(2), p=probs)
        next_state, reward, done, info = bj_env.step(action)
        episode.append((state, action, reward))
        state = next_state
        if done:
            break
    return episode

Now, you are ready to write your own implementation of MC prediction. Feel free to implement either first-visit or every-visit MC prediction; in the case of the Blackjack environment, the techniques are equivalent.

Your algorithm has three arguments:

env: This is an instance of an OpenAI Gym environment.
num_episodes: This is the number of episodes that are generated through agent-environment interaction.
generate_episode: This is a function that returns an episode of interaction.
gamma: This is the discount rate. It must be a value between 0 and 1, inclusive (default value: 1).

The algorithm returns as output:

Q: This is a dictionary (of one-dimensional arrays) where Q[s][a] is the estimated action value corresponding to state s and action a.

def mc_prediction_q(env, num_episodes, generate_episode, gamma=1.0):
    # initialize empty dictionaries of arrays
    returns_sum = defaultdict(lambda: np.zeros(env.action_space.n))
    N = defaultdict(lambda: np.zeros(env.action_space.n)) # 初始化为全0的默认字典类型
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    # loop over episodes
    for i_episode in range(1, num_episodes+1):
        # monitor progress
        if i_episode % 1000 == 0:
            print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="")
            sys.stdout.flush()
        
        ## TODO: complete the function
        generate_episode_ = generate_episode_from_limit_stochastic(env)
        print(len(generate_episode_))
        print(generate_episode_)
        states, actions, rewards= zip(*generate_episode_)
        #print(states)
       # print(actions)
        #print(rewards)
        discount = np.array([gamma ** i for i in range(len(rewards)+1)])
        for j in range(len(generate_episode_)):
            # discount = gamma ** j
           # print("**********")
            N[states[j]][actions[j]] += 1
            returns_sum[states[j]][actions[j]] += sum(rewards[j:] * discount[:-(1+j)])
            Q[states[j]][actions[j]] = returns_sum[states[j]][actions[j]] / N[states[j]][actions[j]]
       
    # Q = returns_sum / N    
    return Q

1.7.11 Greedy Policies

在这里插入图片描述

When we take a Q-table and use the action that maximizes each row to come with the policy, we say that we are constructing the policy that’s greedy with respect to the Q-table.

贪心策略：有了Q表以后，在每个状态点，选择该状态下值函数最大的动作的过程。

1.7.12 Epsilon-Greedy Policies

在这里插入图片描述

So, the larger it is the more likely you are to pick one of the non-greedy actions.
In order to construct a policy $1π\displaystyle\frac{1}{\pi}$ that is $ϵ−greedy\epsilon-greedy$ with respect to the current action-value function estimate $Q$ , we will set

$a∈A(s)\pi \left( a | s \right)\longleftarrow \begin{cases} 1-\epsilon +\epsilon/ \left| \mathcal{A}(s)\right|& \textrm{if }a\text{ maximizes }Q(s,a)\\ \epsilon/|\mathcal{A}(s)| & \textrm{else} \\ \end{cases} \textrm{for each } s\in\mathcal{S} \textrm{ and } a\in\mathcal{A}(s)$

1.7.13 MC Control

Prediction Problem: Given a policy, how might the agent estimate the value function for that policy?
Control Problem: Estimate the optimal policy.

The agent can take a policy $π\pi$ , use it to interact with the environment for many episodes, and then use the results to estimate the action-value function $qπq_\pi$ with a Q-table.
Then, once the Q-table closely approximates the action-value function $qπq_\pi$ , the agent can construct the policy $π′\pi'$ that is $ϵ\epsilon$ -greedy with respect to the Q-table, which will yield a policy that is better than the original policy $π\pi$ .

Furthermore, if the agent alternates between these two steps, with:

Step 1(Policy evaluation): using the policy $π\pi$ to construct the Q-table:
Used to determine the action-value function of the policy.
Step 2(Policy improvement): Improving the policy by changing it to be $ϵ\epsilon$ -greedy with respect to the Q-table ( $π′←ϵ-greedy(Q),π←π′\pi' \leftarrow \epsilon\text{-greedy}(Q), \pi \leftarrow \pi'$ ),
we will eventually obtain the optimal policy $π∗\pi_*$

1.7.14 Exploration vs. Exploitation

Exploration-Exploitation Dilemma
Recall that the environment’s dynamics are initially unknown to the agent. Towards maximizing return, the agent must learn about the environment through interaction.

At every time step, when the agent selects an action, it bases its decision on past experience with the environment. And, towards minimizing the number of episodes needed to solve environments in OpenAI Gym, our first instinct could be to devise a strategy where the agent always selects the action that it believes (based on its past experience) will maximize return. With this in mind, the agent could follow the policy that is greedy with respect to the action-value function estimate. We examined this approach in a previous video and saw that it can easily lead to convergence to a sub-optimal policy.

To see why this is the case, note that in early episodes, the agent’s knowledge is quite limited (and potentially flawed). So, it is highly likely that actions estimated to be non-greedy by the agent are in fact better than the estimated greedy action.

With this in mind, a successful RL agent cannot act greedily at every time step (that is, it cannot always exploit its knowledge); instead, in order to discover the optimal policy, it has to continue to refine the estimated return for all state-action pairs (in other words, it has to continue to explore the range of possibilities by visiting every state-action pair). That said, the agent should always act somewhat greedily, towards its goal of maximizing return as quickly as possible. This motivated the idea of an \epsilonϵ-greedy policy.

We refer to the need to balance these two competing requirements as the Exploration-Exploitation Dilemma. One potential solution to this dilemma is implemented by gradually modifying the value of \epsilonϵ when constructing \epsilonϵ-greedy policies.

Setting the Value of \epsilonϵ, in Theory
It makes sense for the agent to begin its interaction with the environment by favoring exploration over exploitation. After all, when the agent knows relatively little about the environment’s dynamics, it should distrust its limited knowledge and explore, or try out various strategies for maximizing return. With this in mind, the best starting policy is the equiprobable random policy, as it is equally likely to explore all possible actions from each state. You discovered in the previous quiz that setting \epsilon = 1ϵ=1 yields an \epsilonϵ-greedy policy that is equivalent to the equiprobable random policy.

At later time steps, it makes sense to favor exploitation over exploration, where the policy gradually becomes more greedy with respect to the action-value function estimate. After all, the more the agent interacts with the environment, the more it can trust its estimated action-value function. You discovered in the previous quiz that setting \epsilon = 0ϵ=0 yields the greedy policy (or, the policy that most favors exploitation over exploration).

Thankfully, this strategy (of initially favoring exploration over exploitation, and then gradually preferring exploitation over exploration) can be demonstrated to be optimal.
Greedy in the Limit with Infinite Exploration (GLIE)
In order to guarantee that MC control converges to the optimal policy $π∗\pi_*$ , we need to ensure that two conditions are met. We refer to these conditions as Greedy in the Limit with Infinite Exploration (GLIE). In particular, if:

every state-action pair $s$ , $a$ (for all $s∈Ss\in\mathcal{S}$ and $a∈A(s)a\in\mathcal{A}(s)$ is visited infinitely many times, and
the policy converges to a policy that is greedy with respect to the action-value function estimate $Q$ (策略对Q表是贪婪的：基于Q表得到策略的时候的时候，是选一个状态下的值函数最大的动作）,
then MC control is guaranteed to converge to the optimal policy (in the limit as the algorithm is run for infinitely many episodes). These conditions ensure that:
the agent continues to explore for all time steps, and
the agent gradually exploits more (and explores less).
One way to satisfy these conditions is to modify the value of $ϵ\epsilon$ when specifying an $ϵ\epsilon$ -greedy policy. In particular, let $ϵi\epsilon_i$ correspond to the $i$ -th time step. Then, both of these conditions are met if:
$ϵi>0\epsilon_i > 0$ for all time steps $i$ i,
and $ϵi\epsilon_i$ decays to zero in the limit as the time step $i$ approaches infinity (that is, $lim⁡i→∞ϵi=0\lim_{i\to\infty} \epsilon_i = 0$ ).
For example, to ensure convergence to the optimal policy, we could set $ϵi=1i\epsilon_i = \displaystyle\frac{1}{i}$ .

可用的一种设置 $ϵ\epsilon$ 的方法：
The behavior policy during training was epsilon-greedy with epsilon annealed linearly from 1.0 to 0.1 over the first million frames, and fixed at 0.1 thereafter.

1.7.15 Incremental Mean

It would be more efficient to update the Q-table after every episode. Then, the updated Q-table could be used to improve the policy. That new policy could then be used to generate the next episode, and so on.
在这里插入图片描述

Performing the updates after each episode will allow us to use the Q-table up update the policy after each episode, which will make our algorithm much more efficient.

The algorithm proceeds by looping over the following steps:

Step 1: The policy $π\pi$ is improved to be $ϵ\epsilon$ -greedy with respect to $Q$ , and the agent uses $π\pi$ to collect an episode.
Step 2: $N$ is updated to count the total number of first visits to each state action pair.
Step 3: The estimates in $Q$ are updated to take into account the most recent information.

1.7.16 Constant-alpha

在这里插入图片描述

You should always set the value for $α\alpha$ to a number greater than zero and less than (or equal to) one.

If $α=0\alpha=0$ , then the action-value function estimate is never updated by the agent.
If $α=1\alpha = 1$ , then the final value estimate for each state-action pair is always equal to the last return that was experienced by the agent (after visiting the pair).

Smaller values for $α\alpha$ encourage the agent to consider a longer history of returns when calculating the action-value function estimate. Increasing the value of $α\alpha$ ensures that the agent focuses more on the most recently sampled returns.