llama3-cake
The goal of the project is being able to run big (70B+) models by repurposing consumer hardware into an heterogeneous cluster of iOS, macOS, Linux and Windows devices.
LAN内に存在するiOSやmacOSをクラスタ化し70Bモデルを動かそう!の実験的コード。
https://scrapbox.io/files/66952b39c69dae001de1a3c7.png
cake作者もexolabsにインスピレーション受けてるっぽい。
After the hype for @mo_baioumy/@exolabs_ being able to run a distributed LLM on Apple devices using @__tinygrad__, I couldn't wait for the code to be released so I developed llama3-cake, a 100% Rust implementation based on the Candle framework that allows you to run the inference of big LLMs by distributing its transformer blocks on multiple machines. It already supports CUDA and Apple Metal acceleration and it is not limited to MLX models. In the picture I'm sharding an 8B model between a Linux server and an M1 Mac. Release soon, including support for iOS / Android / ARM!