DPO - work4ai

DPO

https://arxiv.org/pdf/2305.18290 Direct Preference Optimization: Your Language Model is Secretly a Reward Model

RLHF代替