OpenSourceWeek
Honored to share FlashMLA - our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production.
✅ BF16 support
✅ Paged KV cache (block size 64)
âš¡ 3000 GB/s memory-bound & 580 TFLOPS compute-bound on H800