人工智能 rl-an-introduction

为什么REINFORCE算法中的折现率出现两次？

我正在阅读Richard S.Sutton和Andrew G.Barto撰写的《强化学习：入门》（初稿，2017年11月5日）。在第271页上，给出了突发性蒙特卡洛策略梯度方法的伪代码。看着这个伪代码，我无法理解为什么折扣率似乎出现2次，一次处于更新状态，而第二次出现在返回状态。[见下图] 看来，步骤1之后的步骤返回只是第一步返回的截断。此外，如果您仅在书的上方看一页，则会发现方程式的折现率仅为1（收益率内的那一）。为什么伪代码似乎不同？我的猜测是我误会了一些东西： θt+1 =˙ θt+αGt∇θπ(At|St,θt)π(At|St,θt).(13.6)(13.6)θt+1 =˙ θt+αGt∇θπ(At|St,θt)π(At|St,θt). {\mathbf{\theta}}_{t+1} ~\dot{=}~\mathbf{\theta}_t + \alpha G_t \frac{{\nabla}_{\mathbf{\theta}} \pi \left(A_t \middle| S_t, \mathbf{\theta}_{t} \right)}{\pi \left(A_t \middle| S_t, \mathbf{\theta}_{t} \right)}. \tag{13.6}

11 reinforcement-learning algorithm rl-an-introduction reinforce

Questions tagged «rl-an-introduction»