LLMの圧縮・高速化
SpQR: ほぼ損失のない LLM 重み圧縮のためのスパース量子化表現より
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
A Simple and Effective Pruning Approach for Large Language Models
Google Colab で vLLM を試す
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
高速AI推論「Groq」を試す
深層モデルの高速化