汎用パイプラインインターフェス

Coding

より簡単にパイプラインを生成するために関数が用意されています。各ステップに対して固有の名前を与える必要がない場合に有効です。

code: Python

from sklearn.pipeline import Pipeline

from sklearn.pipeline import make_pipeline

# 標準の文法

pipe_long = Pipeline(('scaler', MinMaxScaler()), ('svm', SVC(C=100)))

# 短縮文法

pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))

# デフォルトでlowercaseで名前が与えられる

print('Pipeline steps:\n{}'.format(pipe_short.steps))

--------------------------------------------------------------------------

Pipeline steps:

[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('svc', SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,

decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',

max_iter=-1, probability=False, random_state=None, shrinking=True,

tol=0.001, verbose=False))]

--------------------------------------------------------------------------

パイプラインを使う目的の1つは、グリッドサーチです。グリッドサーチの中のパイプラインのいずれかのステップにアクセスしたいことはよくあります。cancerデータセットに対して、StandardScalerによるスケール変換をしてLogisticRegressionクラス分類器を用いてグリッドサーチをしてみます。

code: Python

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler

pipe = make_pipeline(StandardScaler(), LogisticRegression())

# make_piplne関数を用いたのでパイプラインの中のLogsticRegressionステップの名前はlogisticregressionになる

# したがって、パラメータをチューニングするには、パラメータグリッドでlogisticregression__Cを指定する

param_grid = {'logisticregression__C': 0.01, 0.1, 1, 10, 100}

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=4)

grid = GridSearchCV(pipe, param_grid, cv=5)

grid.fit(X_train, y_train)

# GridSearchCVがすべての訓練データに対して訓練を行って見つけた最良のモデルは、grid.bestestimatorに格納されています。

print('Best estimator:\n{}'.format(grid.best_estimator_))

# logisticregressionにアクセスするにはパイプラインのnamed_steps属性を用いる

print('Logistic regression step:\n{}'.format(grid.best_estimator_.named_steps'logisticregression'))

# 個々の入力特徴量に対応する係数（重み）にアクセスできる

print('Logistic regression coefficients:\n{}'.format(grid.best_estimator_.named_steps'logisticregression'.coef_))

--------------------------------------------------------------------------

Best estimator:

Pipeline(memory=None,

steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,

penalty='l2', random_state=None, solver='liblinear', tol=0.0001,

verbose=0, warm_start=False))])

Logistic regression step:

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,

penalty='l2', random_state=None, solver='liblinear', tol=0.0001,

verbose=0, warm_start=False)

Logistic regression coefficients:

[[-0.38856355 -0.37529972 -0.37624793 -0.39649439 -0.11519359 0.01709608

-0.3550729 -0.38995414 -0.05780518 0.20879795 -0.49487753 -0.0036321

-0.37122718 -0.38337777 -0.04488715 0.19752816 0.00424822 -0.04857196

0.21023226 0.22444999 -0.54669761 -0.52542026 -0.49881157 -0.51451071

-0.39256847 -0.12293451 -0.38827425 -0.4169485 -0.32533663 -0.13926972]]

--------------------------------------------------------------------------

前処理のパラメータとモデルのパラメータをグリッドサーチで同時に探すこともできます。

code: Python

from sklearn.datasets import load_boston

boston = load_boston()

X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=0)

from sklearn.preprocessing import PolynomialFeatures

from sklearn.linear_model import Ridge

pipe = make_pipeline(StandardScaler(), PolynomialFeatures(), Ridge())

param_grid = {'polynomialfeatures__degree': 1, 2, 3, 'ridge__alpha': 0.001, 0.01, 0.1, 1, 10, 100}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1)

grid.fit(X_train, y_train)

mglearn.tools.heatmap(grid.cv_results_'mean_test_score'.reshape(3, -1),

xlabel='ridge__alpha', ylabel='polynomialfeatures__degree',

xticklabels=param_grid'ridge__alpha',

yticklabels=param_grid'polynomialfeatures__degree', vmin=0)

https://gyazo.com/927cfd2920c5fd69aea87e5ffbf73a64

パイプランで実際に行われるステップに対してもサーチすることが可能です。例えば、StandardScalerとMinMaxScalerのどちらを用いるかをサーチの対象とすることができます。ここでは、irisデータセットに対して、RandomForestClassifierとSVCを比較する例を見てみます。

ランダムフォレストは前処理が必要ないことで知られています。SVCは前処理した場合としなかった場合をみたいです。

code: Python

from sklearn.svm import SVC

from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import StandardScaler

pipe = Pipeline(('preprocessing', StandardScaler()), ('classifier', SVC()))

param_grid = [

{'classifier': SVC(), 'preprocessing': StandardScaler(), None,

'classifier__gamma': 0.001, 0.01, 0.1, 1, 10, 100,

'classifier__C': 0.001, 0.01, 0.1, 1, 10, 100},

{'classifier': RandomForestClassifier(n_estimators=100),

'preprocessing': None, 'classifier__max_features': 1, 2, 3}]

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)

grid = GridSearchCV(pipe, param_grid, cv=5)

grid.fit(X_train, y_train)

print('Best params:\n{}\n'.format(grid.best_params_))

print('Best cross-validation score: {:.2f}'.format(grid.best_score_))

print('Test-set score: {:.2f}'.format(grid.score(X_test, y_test)))

--------------------------------------------------------------------------

Best params:

{'classifier': SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,

decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',

max_iter=-1, probability=False, random_state=None, shrinking=True,

tol=0.001, verbose=False), 'classifier__C': 10, 'classifier__gamma': 0.01, 'preprocessing': StandardScaler(copy=True, with_mean=True, with_std=True)}

Best cross-validation score: 0.99

Test-set score: 0.98

--------------------------------------------------------------------------

StandardScalerで前処理をしたSVCで、C=10、gamma=0.01が最良の結果を返すことがわかります。