Self-Imitation
Self-Imitation Learning (SIL) $ \mathcal{L}^{sil} = \mathbb{E}_{s,a,R \in D} \left[\mathcal{L}^{sil}_{policy}+\beta^{sil}\mathcal{L}^{sil}_{value}\right]
$ \mathcal{L}^{sil}_{policy} = -\log\pi_\theta(a|s)(R - V_\theta(s))_+
$ \mathcal{L}^{sil}_{value} = \frac{1}{2} \|(R - V_\theta(s))_+\|^2
$ (\cdot)_+ = \max(\cdot,0)
$ \mathcal{L}^{sil}_{value}:$ V_\thetaの報酬予測誤差 $ R - V_\theta(s) > 0のときのみ$ \mathcal{L}^{sil}>0
その他
探索が困難なAtariゲームで良い成績を出している 【論文】Self-Imitation Learning (SIL, 2018)
Self-Imitation Learning