MiniLLM: Knowledge Distillation of Large Language Models

In this work, we propose a KD approach that distills LLMs into smaller language models. (Abstract)

We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution.

Then, we derive an effective optimization approach to learn this objective.

The student models are named MiniLLM.

https://github.com/microsoft/LMOps/tree/main/minillm