Tinker - pokutuna

Tinker

Tinker - Thinking Machines Lab

LLM Training インフラ丸投げサービス

forward_backward, optim_step, sample, save_state だけ書いたら LoRA アダプタを作れる

LoRA のみ / 対象モデルが限定されるのさえ良ければ使いやすそう

Models & Pricing - Tinker Documentation

ただサポートモデルが deprecated になっていったりする点は気になるところ

使い方

Get Started - Tinker Documentation

Cookbook: Get Started - Tinker Documentation

thinking-machines-lab/tinker-cookbook: Post-training with Tinker

Build LoRA Adapter - Tinker Documentation

tinker_cookbook.weights.build_lora_adapter を呼ぶことで Hugging Face transformer / peft で動かせる形になる

build_hf_model - Tinker Documentation

tinker-cookbook/tinker_cookbook/weights/README.md at 839c27d21b66d3027c6866238e6e3f02406c90ed · thinking-machines-lab/tinker-cookbook

Tinker 上にある生のアダプタの状態では Tinker インフラ上でしか実行できない?

コスト感

学習

Models & Pricing - Tinker Documentation

forward backward の総トークンに課金される?

H100 等1枚に乗るようなモデルだと Runpod のほうが安いものの手間の面で救われる価格感かなあ、明らかに楽ではある

トークン課金なので、Tinker は速くさばくモチベーションあるし、実際速いのがいいね

ストレージ

Checkpoints に残る LoRA アダプタで $0.10/GB/mo

TTL 設定できるので常に設定しておくとよさそう

中断/再開

中断したり再開したりできる・途中の checkpoint から別の学習をしたり

Weights Management - Tinker Documentation

save_weights_for_sampler: weights

save_state: weights & optimizer

LoRA に一言ありそう

LoRA Without Regret - Thinking Machines Lab

LoRA Primer - Tinker Documentation

まとまっているしネガも説明しているのもいいね

For supervised fine-tuning on small-to-medium-sized instruction-tuning and reasoning datasets, LoRA performs the same as full fine-tuning.

For datasets that exceed LoRA capacity, LoRA underperforms FullFT. Rather than the loss reaching a distinct floor that it can’t go below, LoRA results in worse training efficiency that depends on the relationship between model capacity to dataset size.

In some scenarios, LoRA is less tolerant of large batch sizes than full fine-tuning — it pays a larger penalty in loss as batch size increases beyond some point. This penalty is not mitigated by increasing the LoRA rank; it is a property of the product-of-matrices parametrization, which has different training dynamics than optimizing the original weight matrix.

Even in small data settings, LoRA performs better when applied to all weight matrices, especially MLP and MoE layers. Attention-only LoRA underperforms even when we match the number of trainable parameters by using higher rank for attention-only LoRA.

LoRA performs equivalently to FullFT for reinforcement learning even with small ranks. We find that RL requires very low capacity, a result we anticipated based on information-theoretical arguments.

学習率係数計算機

LoRA requires a much larger LR than full fine-tuning---typically 20-100x larger, depending on model size.

thinking-machines-lab/tinker-cookbook@main - tinker_cookbook/hyperparam_utils.py#L76

10 固定じゃん、いまは rank や alpha に依存せず単に full の 10 倍が経験的にいいらしい

code:param.py

base_lr = 5e-5

lora_multiplier = 10.0

lr = base_lr * lora_multiplier

lr = lr * (2000/hidden)**exp # 隠れ層の次元数で補正、このテーブルがコード中にある