H3
Attention is all you need... but how much of it do you need?
Announcing H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! Accepted as a *spotlight* at #ICLR2023! 📣 w/ @tri_dao // Podcast `#2: Hungry Hungry Hippos (H3) //
Stanford researchers just released a new architecture that:
- Beats Transformers at ~1B param scale
- Admits *much* longer context than Transformers
Is H3 the Transformer-killer? More below!
https://gyazo.com/155fd3e1ffcf0f23398da93493668975https://gyazo.com/321936b2896e7002a1826924ece5f497https://gyazo.com/ebc7e45c83665711c41b60e011c61aaahttps://gyazo.com/b08332cd5d578780625a2c67a22c4e42
Hungry Hungry Hippos, aka "H3", functions like a linear RNN, or a long convolution.
The key idea: due to the fast Fourier transform, an H3 layer:
- can be computed in n*log(n) time, with n the context length
- unlike Transformers, which require n^2!
https://gyazo.com/b08332cd5d578780625a2c67a22c4e42
transformerの計算量は$ O(n^2)だったのが$ O(nlog_2n)まで減らせる
つまり、1000トークン入力した時に、Transformerだと100万オーダーまで計算量が増えてしまうところが、H3ならたったの1万オーダーで済む。メチャメチャ計算量が減る。ChatGPTは4千トークンしか入力できないけど、H3ベースになれば数万、数十万トークン入力可能になるかもしれない うみゆき@AI研究 Transformerが革新的な技術だったようにこれも基盤となる超重要研究になるのかもしれないwogikaze.icon
というかなれ