Calibrate Before Use: Improving Few-shot Performance of Language Models

We demonstrate that this instability arises from the bias of language models towards predicting certain answers

Figure 2

There is high variance in GPT-3’s accuracy as we change the prompt’s training examples, as well as the permutation of the examples.

Language Models are Few-Shot Learnersだがバイアスがあることを述べている