Calibrate Before Use: Improving Few-shot Performance of Language Models
We demonstrate that this instability arises from the bias of language models towards predicting certain answers
Figure 2
There is high variance in GPT-3’s accuracy as we change the prompt’s training examples, as well as the permutation of the examples.