DPO
https://arxiv.org/pdf/2305.18290 Direct Preference Optimization: Your Language Model is Secretly a Reward Model
RLHF代替