PPO
Proximal Policy Optimization
(
Schulman
et al.,
2017
)
TRPO
を近似し,より最適化しやすくした手法
maximize
$ \mathcal{L}^{CRIP}(\theta) = \mathbb{E}_t[\min(r_t(\theta)\hat{A}_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A_t}]
$ r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}