学習曲線と検証曲線によるアルゴリズムの診断

学習曲線を使ってバイアスとバリアンスの問題を診断する

Coding - サンプルサイズを変化させる

code: Python

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)

X = df.loc:, 2:.values

y = df.loc:, 1.values

# yのカテゴリ変数「M」「B」を数値に変換する

le = LabelEncoder()

y = le.fit_transform(y)

le.transform('B', 'M')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=1)

%matplotlib inline

import matplotlib.pyplot as plt

import numpy as np

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import make_pipeline

from sklearn.model_selection import learning_curve

pipe_lr = make_pipeline(StandardScaler(), LogisticRegression(penalty='l2', random_state=1))

# learning_curve関数で交差検証による正解率を算出

train_sizes, train_scores, test_scores = learning_curve(estimator=pipe_lr, X=X_train, y=y_train, train_sizes=np.linspace(0.1, 1.0, 10), cv=10, n_jobs=1)

train_mean = np.mean(train_scores, axis=1)

train_std = np.std(train_scores, axis=1)

test_mean = np.mean(test_scores, axis=1)

test_std = np.std(test_scores, axis=1)

# トレーニングデータの精度をプロット

plt.plot(train_sizes, train_mean, color='blue', marker='o', markersize=5, label='training accuracy')

# 平均+-標準偏差の幅を塗りつぶす

plt.fill_between(train_sizes, train_mean + train_std, train_mean - train_std, alpha=0.15, color='blue')

# テストデータの精度をプロット

plt.plot(train_sizes, test_mean, color='green', linestyle='--', marker='s', markersize=5, label='validation accuracy')

# 平均+-標準偏差の幅を塗りつぶす

plt.fill_between(train_sizes, test_mean + test_std, test_mean - test_std, alpha=0.15, color='green')

plt.grid()

plt.xlabel('Number of training samles')

plt.ylabel('Accuracy')

plt.legend(loc='lower right')

plt.ylim(0.8, 1.05)

plt.tight_layout()

plt.show()

https://gyazo.com/ea8b4223710feb0441927558f955d7d2

トレーニング時のサンプルの個数が250個を超えている場合、このモデルの性能はトレーニングデータセットでも検証データセットでも非常に良いことがわかります。サンプルの個数が満たない場合には過学習しています。

検証曲線を使って過学習と学習不足を明らかにする

Coding - ロジスティック回帰の逆正則化パラメータCを変化させる

code: Python

from sklearn.model_selection import validation_curve

param_range = 0.001, 0.01, 0.1, 1.0, 10.0, 100.0

# validation_curve関数によりモデルのパラメータを変化させ、交差検証による正解率を算出

train_scores, test_scores = validation_curve(estimator=pipe_lr, X=X_train, y=y_train, param_name='logisticregression__C', param_range=param_range, cv=10)

train_mean = np.mean(train_scores, axis=1)

train_std = np.std(train_scores, axis=1)

test_mean = np.mean(test_scores, axis=1)

test_std = np.std(test_scores, axis=1)

plt.plot(param_range, train_mean, color='blue', marker='o', markersize=5, label='training accuracy')

plt.fill_between(param_range, train_mean + train_std, train_mean - train_std, alpha=0.15, color='blue')

plt.plot(param_range, test_mean, color='green', linestyle='--', marker='s', markersize=5, label='validation accuracy')

plt.fill_between(param_range, test_mean + test_std, test_mean - test_std, alpha=0.15, color='green')

plt.grid()

plt.xscale('log')

plt.xlabel('Parameter C')

plt.ylabel('Accuracy')

plt.legend(loc='lower right')

plt.ylim(0.8, 1.05)

plt.tight_layout()

plt.show()

https://gyazo.com/c24c31ec695501f41ce2286dbd4949a3

Cの値を小さくして正則化の強さを上げると、モデルが少し学習不足に陥ることがわかります。これに対し、Cの値を大きくして正則化の強さを下げると、モデルがわずかに過学習することがわかります。