SFTTrainerの実装 (v0.8)

v0.9.0で変わったらしい（TODO。v0.8のコードが動かなかった）

max_seq_length (Optional[int]):

The maximum sequence length to use for the ConstantLengthDataset and for automatically creating the Dataset. Defaults to 512.

transformers.Trainerを継承している

The trainer takes care of properly initializing the PeftModel in case a user passes a PeftConfig object.

train()はtransformers.Trainerのtrain()

warningを見ていくと動きがつかみやすいかも

You passed a model_id to the SFTTrainer. This will automatically create an AutoModelForCausalLM or a PeftModel (if you passed a peft_config) for you.

（外からモデル自体を渡していない場合）AutoModelForCausalLMで読み込む

PeftModelに変換

packing=Falseでdata collatorが渡ってきていなければDataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)（CausalLM用の設定！）

モデルに沿ってTokenizerも初期化される

データセットは_prepare_datasetメソッドで用意

packing=False (__init__のデフォルト値)

tokenizerにmax_seq_lengthが渡る

return_overflowing_tokens

それを適用したdatasetが返る（tokenize関数を定義し、mapで適用）

packing=True

IMO：この挙動をカスタマイズしたければ先にtokenizeして渡すということか！（input_idsを持っていれば何もしない）