UniMax - work4ai

UniMax

_akhaliq UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining

release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UNIMAX sampling

abs: https://arxiv.org/abs/2304.09151

github: https://github.com/google-research/t5x/blob/main/docs/models.md

https://gyazo.com/3942a84b90b58f564f847e5ce8742410

#Google