LLMの圧縮・高速化
高速化
NPU
ニューラル構造探索(NAS)
SpQR: ほぼ損失のない LLM 重み圧縮のためのスパース量子化表現より
https://note.com/daichi_mu/n/n32b6bd0ab28d
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
https://arxiv.org/abs/2301.12017
A Simple and Effective Pruning Approach for Large Language Models
https://arxiv.org/pdf/2306.11695.pdf
Google Colab で vLLM を試す
https://note.com/npaka/n/ne6fe8ae8aca0?sub_rt=share_h
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
https://arxiv.org/abs/2312.15234
高速AI推論「Groq」を試す
https://zenn.dev/kun432/scraps/0f70f232a5f27b
深層モデルの高速化
https://speakerdeck.com/joisino/shen-ceng-moderunogao-su-hua
vLLMを利用したLLM推論高速化テクニック
https://acro-engineer.hatenablog.com/entry/2024/12/24/120000
vLLMを利用したLLM推論高速化テクニック
https://acro-engineer.hatenablog.com/entry/2024/12/24/120000
vLLMのOpenAI APIインターフェースサーバーでバッチ推論をさせる
https://note.com/kan_hatakeyama/n/n172c65744e33
vLLM V1: A Major Upgrade to vLLM's Core Architecture
https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
blueqat
https://blueqat.com/yuichiro_minato2/140c0fa2-6c57-4fc0-aad3-8578ecf51bdf
TensorRT-LLMによるRTX 5090でのLLMのNVFP4量子化・推論
https://zenn.dev/aratako_lm/articles/a573b2ccc4d795