TVLT: Textless Vision-Language Transformer
https://arxiv.org/abs/2209.14156v1
AudioLM: a Language Modeling Approach to Audio Generation