Tinker
Tinker - Thinking Machines Lab
LLM Training インフラ丸投げサービス
forward_backward, optim_step, sample, save_state だけ書いたら LoRA アダプタを作れる
LoRA のみ / 対象モデルが限定される のさえ良ければ使いやすそう
Models & Pricing - Tinker Documentation
ただサポートモデルが deprecated になっていったりする点は気になるところ
使い方
Get Started - Tinker Documentation
Cookbook: Get Started - Tinker Documentation
thinking-machines-lab/tinker-cookbook: Post-training with Tinker
Build LoRA Adapter - Tinker Documentation
tinker_cookbook.weights.build_lora_adapter を呼ぶことで Hugging Face transformer / peft で動かせる形になる
build_hf_model - Tinker Documentation
tinker-cookbook/tinker_cookbook/weights/README.md at 839c27d21b66d3027c6866238e6e3f02406c90ed · thinking-machines-lab/tinker-cookbook
Tinker 上にある生のアダプタの状態では Tinker インフラ上でしか実行できない?
コスト感
学習
Models & Pricing - Tinker Documentation
forward backward の総トークンに課金される?
H100 等1枚に乗るようなモデルだと Runpod のほうが安いものの手間の面で救われる価格感かなあ、明らかに楽ではある
トークン課金なので、Tinker は速くさばくモチベーションあるし、実際速いのがいいね
ストレージ
Checkpoints に残る LoRA アダプタで $0.10/GB/mo
TTL 設定できるので常に設定しておくとよさそう
中断/再開
中断したり再開したりできる・途中の checkpoint から別の学習をしたり
Weights Management - Tinker Documentation
save_weights_for_sampler: weights
save_state: weights & optimizer
LoRA に一言ありそう
LoRA Without Regret - Thinking Machines Lab
LoRA Primer - Tinker Documentation
まとまっているしネガも説明しているのもいいね
For supervised fine-tuning on small-to-medium-sized instruction-tuning and reasoning datasets, LoRA performs the same as full fine-tuning.
For datasets that exceed LoRA capacity, LoRA underperforms FullFT. Rather than the loss reaching a distinct floor that it can’t go below, LoRA results in worse training efficiency that depends on the relationship between model capacity to dataset size.
In some scenarios, LoRA is less tolerant of large batch sizes than full fine-tuning — it pays a larger penalty in loss as batch size increases beyond some point. This penalty is not mitigated by increasing the LoRA rank; it is a property of the product-of-matrices parametrization, which has different training dynamics than optimizing the original weight matrix.
Even in small data settings, LoRA performs better when applied to all weight matrices, especially MLP and MoE layers. Attention-only LoRA underperforms even when we match the number of trainable parameters by using higher rank for attention-only LoRA.
LoRA performs equivalently to FullFT for reinforcement learning even with small ranks. We find that RL requires very low capacity, a result we anticipated based on information-theoretical arguments.
学習率係数計算機
LoRA requires a much larger LR than full fine-tuning---typically 20-100x larger, depending on model size.
thinking-machines-lab/tinker-cookbook@main - tinker_cookbook/hyperparam_utils.py#L76
10 固定じゃん、いまは rank や alpha に依存せず単に full の 10 倍が経験的にいいらしい
code:param.py
base_lr = 5e-5
lora_multiplier = 10.0
lr = base_lr * lora_multiplier
lr = lr * (2000/hidden)**exp # 隠れ層の次元数で補正、このテーブルがコード中にある
#LLM