《reinforcement learning:an introduction》第十章《On-policy Control with Approximation》总结

本文探讨了Semi-Gradient SARSA算法及其在连续任务中的应用,介绍了一种不采用折扣因子的方法来评估策略,并详细说明了平均奖励设置下的价值函数及差分回报的概念。

由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。

对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。

因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。




The present chapter features the semi-gradient Sarsa algorithm(即 On-policy Control with Approximation), the natural extension of semi-gradient TD(0) (last chapter) to action values and to on-policy control. 




In the episodic case, the extension is straightforward。


n-step Semi-gradient SARSA:








In the continuing case, we have to give up discounting and switch to a new "average-reward" formulation of the control problem with new value functions。 The Futility of Discounting in Continuing Problems(P257) =========》》》In fact, for policy π, the average of the discounted returns is always η(π(1 γ), that is, it is essentially the average reward, η(π). In particular, the ordering of all policies in the average discounted return setting would be exactly the same as in the average-reward setting. The discount rate γ thus has no effect on the problem formulation.(所以没有必要考虑discounting,直接研究average就可以了;当然,也可以直接研究discounting,没必要研究average,但是估计是历史原因吧,The average-reward setting is one of the major settings considered in the classical theory of dynamic programming and, though less often, in reinforcement learning


Average reward setting applies to continuing problems, problems for which the interaction between agent and environment goes on and on forever without termination or start states.

In the average-reward setting, the quality of a policy π is defined as the average rate of reward while following that policy, which we denote an η(π) :



In the average-reward setting, returns are defined in terms of differences between rewards and the average reward: 



This is known as the differential return, and the corresponding value functions are known as differential value functions. Differential value functions also have Bellman equations, just slightly different from those we have seen earlier. We simply remove all γs and replace all rewards by the difference between the reward and the true average reward. There is also a differential form of the two TD errors.



n-step Differential Semi-gradient SARSA:








评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值