《reinforcement learning：an introduction》第十章《On-policy Control with Approximation》总结

最新推荐文章于 2026-05-10 15:58:37 发布

原创最新推荐文章于 2026-05-10 15:58:37 发布 · 1.1k 阅读

1 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#增强学习 #sutton RL #reinforcement learni #an introduction

（深度）增强学习专栏收录该内容

40 篇文章

订阅专栏

本文探讨了Semi-Gradient SARSA算法及其在连续任务中的应用，介绍了一种不采用折扣因子的方法来评估策略，并详细说明了平均奖励设置下的价值函数及差分回报的概念。

由于组里新同学进来，需要带着他入门RL，选择从silver的课程开始。

对于我自己，增加一个仔细阅读《reinforcement learning：an introduction》的要求。

因为之前读的不太认真，这一次希望可以认真一点，将对应的知识点也做一个简单总结。

The present chapter features the semi-gradient Sarsa algorithm（即 On-policy Control with Approximation）, the natural extension of semi-gradient TD(0) (last chapter) to action values and to on-policy control.

In the episodic case, the extension is straightforward。

n-step Semi-gradient SARSA：

In the continuing case, we have to give up discounting and switch to a new "average-reward" formulation of the control problem with new value functions。 The Futility of Discounting in Continuing Problems（P257） =========》》》In fact, for policy π, the average of the discounted returns is always η(π) / (1 - γ), that is, it is essentially the average reward, η(π). In particular, the ordering of all policies in the average discounted return setting would be exactly the same as in the average-reward setting. The discount rate γ thus has no effect on the problem formulation.（所以没有必要考虑discounting，直接研究average就可以了；当然，也可以直接研究discounting，没必要研究average，但是估计是历史原因吧，The average-reward setting is one of the major settings considered in the classical theory of dynamic programming and, though less often, in reinforcement learning ）

Average reward setting applies to continuing problems, problems for which the interaction between agent and environment goes on and on forever without termination or start states.

In the average-reward setting, the quality of a policy π is defined as the average rate of reward while following that policy, which we denote an η(π) ：

In the average-reward setting, returns are defined in terms of differences between rewards and the average reward:

This is known as the differential return, and the corresponding value functions are known as differential value functions. Differential value functions also have Bellman equations, just slightly different from those we have seen earlier. We simply remove all γs and replace all rewards by the difference between the reward and the true average reward. There is also a differential form of the two TD errors.

n-step Differential Semi-gradient SARSA：