UniMax
_akhaliq UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining
release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UNIMAX sampling
abs: https://arxiv.org/abs/2304.09151
github: https://github.com/google-research/t5x/blob/main/docs/models.md
https://gyazo.com/3942a84b90b58f564f847e5ce8742410
#Google