Raspberry Pi 4 Model B 8GBでLLaMA 2は動く
スペック
Raspberry Pi 4 Model B
Memory
8GB
4GB
code:bash
cd ~/llama.cpp/models
code:bash
code:bash
cd ~/llama.cpp
code:bash
./main -m ./models/llama-2-7b-chat.ggmlv3.q4_K_M.bin -n 16 -p "In a nutshell, The capital of Japan is"
8GB
全然動く!!
https://gyazo.com/5880a272e3a13fec3157eb77e379bdf5
CPU利用率がMAXで草
メモリ使用量はそれほどでもないのが意外
30秒くらいで推論終了
https://gyazo.com/8cc05051778aaf5834e5f598cfb70c3f
code:result.txt
main: build = 806 (061f5f8)
main: seed = 1690087001
llama.cpp: loading model from ./models/llama-2-7b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: mem required = 5683.32 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size = 256.00 MB
system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 16, n_keep = 0
In a nutshell, The capital of Japan is Tokyo. The city is located on the eastern coast of Honshu, the
llama_print_timings: load time = 1849.24 ms
llama_print_timings: sample time = 33.23 ms / 16 runs ( 2.08 ms per token, 481.49 tokens per second)
llama_print_timings: prompt eval time = 7446.25 ms / 11 tokens ( 676.93 ms per token, 1.48 tokens per second)
llama_print_timings: eval time = 18073.52 ms / 15 runs ( 1204.90 ms per token, 0.83 tokens per second)
llama_print_timings: total time = 25557.22 ms
4GB
1st try
めちゃ重い
https://gyazo.com/e8d4e01c98814917db88c9dd9113e06b
Swapがおかしい
Swap増やさないと無理そうなので、増やす
Swapを1GBにした
動いた
https://gyazo.com/38c328ba8b420d15026526ae525cdee1
1トークン出力するのに1分近く掛かっていて過酷だ
8GBの1/10くらいの速度になってしまう。これは厳しい
Swapを2GBにしても変わらない……
やっぱりメインメモリが重要なのかな
(3分に収まらないので動画は省略)
code:result.txt
main: build = 880 (b9b7d94)
main: seed = 1690087083
llama.cpp: loading model from ./models/llama-2-7b-chat.ggmlv3.q4_K_M.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 15 (mostly Q4_K - Medium)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: mem required = 4193.32 MB (+ 256.00 MB per state)
llama_new_context_with_model: kv self size = 256.00 MB
system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 16, n_keep = 0
In a nutshell, The capital of Japan is Tokyo. However, here are some other interesting facts about the city:
llama_print_timings: load time = 43312.50 ms
llama_print_timings: sample time = 98.87 ms / 16 runs ( 6.18 ms per token, 161.83 tokens per second)
llama_print_timings: prompt eval time = 59635.79 ms / 11 tokens ( 5421.44 ms per token, 0.18 tokens per second)
llama_print_timings: eval time = 711930.04 ms / 15 runs (47462.00 ms per token, 0.02 tokens per second)
llama_print_timings: total time = 771858.36 ms
結論
8GBだったら割と可能性を感じられるくらいのスピードで動く
4GBだと、動かないこともないけど、どうも厳しい
メモリの差でここまで変わるとは…………
8GBのRaspberry Pi、もう何個か買います……