DPO
https://arxiv.org/pdf/2305.18290
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
RLHF
代替