train_test_splitで可能な分割

再現性を確保するために、random_state引数を指定する

code:reproduce1.py

>> from sklearn.model_selection import train_test_split

>> train_test_split(range(10))

5, 9, 7, 2, 8, 4, 1], [6, 0, 3

>> train_test_split(range(10)) # 再現しない

7, 5, 3, 0, 9, 4, 2], [6, 1, 8

>> train_test_split(range(10), random_state=42)

0, 7, 2, 9, 4, 3, 6], [8, 1, 5

>> train_test_split(range(10), random_state=42) # 再現する

0, 7, 2, 9, 4, 3, 6], [8, 1, 5

>> train_test_split(range(10), random_state=42)

0, 7, 2, 9, 4, 3, 6], [8, 1, 5

stratify引数にy（分類先のラベル）を指定すると、比率を保つ

code:reproduce2.py

>> from sklearn.datasets import load_iris

>> iris = load_iris()

>> from sklearn.model_selection import train_test_split

>> X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, shuffle=True, stratify=iris.target, random_state=42)

>> from collections import Counter

>> Counter(y_test)

Counter({0: 10, 2: 10, 1: 10})

>> Counter(y_train)

Counter({0: 40, 2: 40, 1: 40})

yは分割しないが指定するというようなことも可能

indexだけを返したいシーン

（custom cv splitterとして使った）

code:reproduce3.py

>> from sklearn.datasets import load_iris

>> iris = load_iris()

>> from sklearn.model_selection import train_test_split

>> idx_train1, idx_test1 = train_test_split(range(len(iris.data)), test_size=0.2, shuffle=True, stratify=iris.target, random_state=42)

>> from collections import Counter

>> Counter(iris.targetidx_test1)

Counter({0: 10, 2: 10, 1: 10})

>> # 再現

>> idx_train2, idx_test2 = train_test_split(range(len(iris.data)), test_size=0.2, shuffle=True, stratify=iris.target, random_state=42)

>> Counter(iris.targetidx_test2)

Counter({0: 10, 2: 10, 1: 10})

>> import numpy as np

>> np.testing.assert_array_equal(idx_test1, np.array(0,1,2))