FlashAttention

https://arxiv.org/pdf/2205.14135.pdf

Flash Attentionは、長いSequenceでのTransformer学習ができるようにするという目的で提案された手法で、従来のAttention方法での以下のような問題を解決しようししています。

Pytorch2.0でFlash Attentionを使ってみた話

https://github.com/Dao-AILab/flash-attention

このリポジトリでは、以下の論文にあるFlashAttentionとFlashAttention-2の公式実装を提供しています。

FlashAttention：IOを意識した高速でメモリ効率の良い厳密なアテンション

Tri Dao、Daniel Y. Fu、Stefano Ermon、Atri Rudra、Christopher Ré

論文：https://arxiv.org/abs/2205.14135

https://x.com/tri_dao/status/1811453622070444071

FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs. We’re releasing FlashAttention-3: 1.5-2x faster on FP16, up to 740 TFLOPS on H100 (75% util), and FP8 gets close to 1.2 PFLOPS!

https://github.com/mjun0812/flash-attention-prebuild-wheels?tab=readme-ov-file#windows-x86_64mjun0812/flash-attention-prebuild-wheels

WindowsとかLinux向けのバイナリおいていてくれるRepository、運が良ければ対応するバイナリがあるかもmorisoba65536.icon

https://github.com/sunsetcoder/flash-attention-windowssunsetcoder/flash-attention-windows

Flash Attention Windows Wheels (Python 3.10), 2.7.0まで対応済. 探せば他にもいくつかRepoあるかも？