Questions tagged «policy-iteration»

2
为什么策略迭代算法会收敛到最优策略和价值函数?
我正在阅读Andrew Ng 关于强化学习的讲义,并且试图理解为什么策略迭代收敛到最优值函数和最优策略。V∗V∗V^*π∗π∗\pi^* 召回策略迭代为: 初始化 π 随机地重复{大号È 吨V :=Vπ \针对当前策略,求解贝曼方程式并将其设置为当前V大号È 吨π (s ):= a r g米一X一∈ 一∑s′P小号一(s′)五(s′)}Initialize π randomlyRepeat{Let V:=Vπ \for the current policy, solve bellman's eqn's and set that to the current VLet π(s):=argmaxa∈A∑s′Psa(s′)V(s′)} \text{Initialize $\pi$ randomly} \\ \text{Repeat}\{\\ \quad Let \ V := V^{\pi} \text{ \\for the current …
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.