train_test_splitで可能な分割
再現性を確保するために、random_state引数を指定する
code:reproduce1.py
>> from sklearn.model_selection import train_test_split
>> train_test_split(range(10))
5, 9, 7, 2, 8, 4, 1], [6, 0, 3
>> train_test_split(range(10)) # 再現しない
7, 5, 3, 0, 9, 4, 2], [6, 1, 8
>> train_test_split(range(10), random_state=42)
0, 7, 2, 9, 4, 3, 6], [8, 1, 5
>> train_test_split(range(10), random_state=42) # 再現する
0, 7, 2, 9, 4, 3, 6], [8, 1, 5
>> train_test_split(range(10), random_state=42)
0, 7, 2, 9, 4, 3, 6], [8, 1, 5
stratify引数にy(分類先のラベル)を指定すると、比率を保つ
code:reproduce2.py
>> from sklearn.datasets import load_iris
>> iris = load_iris()
>> from sklearn.model_selection import train_test_split
>> X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, shuffle=True, stratify=iris.target, random_state=42)
>> from collections import Counter
>> Counter(y_test)
Counter({0: 10, 2: 10, 1: 10})
>> Counter(y_train)
Counter({0: 40, 2: 40, 1: 40})
yは分割しないが指定するというようなことも可能
indexだけを返したいシーン
(custom cv splitterとして使った)
code:reproduce3.py
>> from sklearn.datasets import load_iris
>> iris = load_iris()
>> from sklearn.model_selection import train_test_split
>> idx_train1, idx_test1 = train_test_split(range(len(iris.data)), test_size=0.2, shuffle=True, stratify=iris.target, random_state=42)
>> from collections import Counter
Counter({0: 10, 2: 10, 1: 10})
>> # 再現
>> idx_train2, idx_test2 = train_test_split(range(len(iris.data)), test_size=0.2, shuffle=True, stratify=iris.target, random_state=42)
Counter({0: 10, 2: 10, 1: 10})
>> import numpy as np
>> np.testing.assert_array_equal(idx_test1, np.array(0,1,2))