Supervised Fine-tuning Trainer
Supervised fine-tuning (or SFT for short) is a crucial step in RLHF
example
Quickstart
code:これだけ.py
from datasets import load_dataset
from trl import SFTTrainer
dataset = load_dataset("imdb", split="train")
trainer = SFTTrainer(
"facebook/opt-350m",
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
)
trainer.train()
imdbデータセットを読み込み、textフィールドを指定
SFTTrainerはモデルのIDも受け取れる
中でAutoModelForCausalLM.from_pretrainedしているっぽい
Make sure to pass a correct value for max_seq_length as the default value will be set to min(tokenizer.model_max_length, 1024).
IMO:tokenizerは外から渡さなくてもいいの?
Advanced usage
Train on completions only
You can use the DataCollatorForCompletionOnlyLM to train your model on the generated prompts only.
packing=Falseとする
tokenizerを渡してdata collatorを作る
To instantiate that collator for instruction data, pass a response template and the tokenizer.
IMO:これがinstruction tuning?
Make sure to have a pad_token_id which is different from eos_token_id which can result in the model not properly predicting EOS (End of Sentence) tokens during generation.
Using token_ids directly for response_template
the same string (”### Assistant:”) is tokenized differently:
Add Special Tokens for Chat Format
Dataset format support
conversational format
messagesを持つ
roleとcontent
instruction format
promptとcompletion
If your dataset uses one of the above formats, you can directly pass it to the trainer without pre-processing.
apply_chat_template?
Format your input prompts
formatting_func引数について
関数を渡せる
引数はDataset全体っぽい(i番目を取り出しているので)
code:関数の出力フォーマット
### Question
{question}
### Answer:
{answer}
IMO:Instruction Tuning前提?
Packing dataset ( ConstantLengthDataset )
few-shotサポートと理解した
packing=True
multiple short examples are packed in the same input sequence to increase training efficiency.
(詰めるということか!)
eval_packing引数もある
Control over the pretrained model
model_init_kwargs引数
You can directly pass the kwargs of the from_pretrained() method to the SFTTrainer
Training adapters
Using Flash Attention and Flash Attention 2
Flash-Attention 1
Flash Attention-2
Best practices
SFTTrainer always pads by default the sequences to the max_seq_length argument of the SFTTrainer.
For training adapters in 8bit, you might need to tweak the arguments of the prepare_model_for_kbit_training method from PEFT, hence we advise users to use prepare_in_int8_kwargs field, or create the PeftModel outside the SFTTrainer and pass it.
For a more memory-efficient training using adapters, you can load the base model in 8bit, for that simply add load_in_8bit argument when creating the SFTTrainer, or create a base model in 8bit outside the trainer and pass it.
If you create a model outside the trainer, make sure to not pass to the trainer any additional keyword arguments that are relative to from_pretrained() method.
Datasets
In the SFTTrainer we smartly support datasets.IterableDataset in addition to other style datasets.
This is useful if you are using large corpora that you do not want to save all to disk.
The data will be tokenized and processed on the fly, even when packing is enabled.
in the SFTTrainer, we support pre-tokenized datasets if they are datasets.Dataset or datasets.IterableDataset.
In other words, if such a dataset has a column of input_ids, no further processing (tokenization or packing) will be done, and the dataset will be used as-is.
「データセットがinput_idsというカラムを持つ場合、追加の処理(トークン化やパッキング)は行われず、データセットは現在の状態で使われる」
This can be useful if you have pretokenized your dataset outside of this script and want to re-use it directly.
「スクリプトの外側で事前にトークン化している場合に便利」