UniMax
_akhaliq UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UNIMAX sampling
https://gyazo.com/3942a84b90b58f564f847e5ce8742410