kv-cache
key value cache
QKVのKV部をcacheする
過去トークンの Key と Value を保存して、次のトークン生成で再利用する
https://cdn-uploads.huggingface.co/production/uploads/6527e89a8808d80ccff88b7a/DbL2RbXFRoMWA5CrOaGB8.png
1. new input → 最初のK/Vを計算して保存
2. first generation → 1トークン目を生成
3. 次のステップ → 保存したKVを再利用し、新しいKVを追加
4. Attentionを計算し、次のトークンを生成
5. 生成したトークンを入力に追加、3へ
without kv caching
https://cdn-uploads.huggingface.co/production/uploads/6527e89a8808d80ccff88b7a/PWI-EwqizVLInztmiI7Eo.mp4
with kv caching
https://cdn-uploads.huggingface.co/production/uploads/6527e89a8808d80ccff88b7a/HnzDhoJdAbJhSassYjzEy.mp4
chatgpt.iconKey value cache explained
FYI
https://jp.micron.com/about/blog/company/insights/from-buzzword-to-bottom-line-understanding-the-why-behind-kv-cache-in-ai
https://arpable.com/artificial-intelligence/llm-inference-optimization-kv-cache/
https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo/
https://huggingface.co/blog/not-lain/kv-caching
https://medium.com/@joaolages/kv-caching-explained-276520203249
https://neptune.ai/blog/transformers-key-value-caching
https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
https://huggingface.co/docs/transformers/main/en/generation_strategies#kv-caching