kaggle本第6章 - kaggle-friends

kaggle本第6章

第６章　モデルのチューニング

6.1 パラメータチューニング(P.306)

追記パラメータチューニングはコンペで時間をかけて注力する部分ではなくあくまでおまけレベルwakame.icon

参考にカレーちゃんの地震コンペでの例

private3位。このコンペでは、他のモデリングのところでリスクのあることをしていて振れ幅が大きいので、パラメータチューニングはしてもしなくてもそんなに影響はなかったと思っている。参考参考

Optunaでのチューニングcurrypurin.icon

影響が大きいと考えているmaxdepth, min_sample_data_in_leaf, sub_sample, colsample_bytreeなどで30位のitrでまずチューニング。

その後影響が小さいと考えているreg_alpha, reg_labmdaなどをチューニング

（だったはず）

やったタイミングは提出締め切り前日currypurin.icon

それまで頑張ってやっていたので、パラメータで負けても嫌なので、念のためやったcurrypurin.icon

特にモデルがGBDTである場合は、パラメータチューニングよりも、良い特徴量を加えることが精度改善に役立つことが多いです。ある程度パラメータチューニングを行うことは特徴量の評価が行いやすくなるため有効ですが、あまり序盤から注力し過ぎない方が良いでしょう。(P.311 冒頭)

6.1.4 ベイズ最適化でのパラメータ探索(P.311)

hyperopt

https://neptune.ml/blog/optuna-vs-hyperopt

optunaとhyperoptの機能を比較した記事

追記パラメータの探索が時間効率的に終わるのはどっちかという疑問

Speed and Parallelizationという項目があるが並列処理に対応しているかどうかで比較しているだけ

追記 KDDCupのAutoML部門での上位解法は1位から3位全員hyperoptだったwakame.icon

https://speakerdeck.com/yohrn/8th-place-solution-of-autowsl-2019?slide=15

tutorialがhyperoptを使ったものだったからなのか、optunaの認知度が低い？

optuna

計算時間を節約するために、クロスバリデーションのすべてのfoldでなく、そのうちの1つのfoldを使って精度を確認する方法があります。逆に、計算ごとのばらつきが大きいときには、foldの分け方を変えて何回か計算した場合の平均を使う方法もあります。(P.310 冒頭)

追記 optunaに枝刈り機能があり、パラメータの探索を途中で打ち切るので時間効率良く探索ができるよという話wakame.icon

https://www.slideshare.net/pfi/pydatatokyo-meetup-21-optuna

スライドP.31

6.1.5 GBDTのパラメータおよびそのチューニング(P.315)

パラメータチューニングって具体的にどうやってるのか、記事やkagglerの具体的な手法をまとめてみましたwakame.icon

XGBoost-param_tuning(公式document)

LightGBM-Parameters-Tuning(公式document)

Complete Guide to Parameter Tuning in XGBoost(Analytics Vidhya)

PARAMETERS（Laurae++）

lightgbmとxgboostのパラメータ名対応

具体的なパラメータやパラメータのチューニング方法

LightGBMTunerの例

KDDCupAutoML5での上位チームの例

https://github.com/pfnet-research/KDD-Cup-AutoML-5/blob/master/optable_submission/optable_package/optable/learning/optuna_hyper_params_searcher.py#L103-L121

https://github.com/DeepBlueAI/AutoSmart/blob/master/auto_smart/auto_smart/automl/auto_lgb.py

COLUM (P.318)

table:threecourse氏のパラメータチューニング(hyperoptによるベイズ最適化)

xgboostパラメータ(lightgbm) ベースラインの値探索範囲

eta(learning_rate) 0.1 パラメータ探索では固定する

num_round(num_iterations) - 十分大きくしてアーリーストッピングで最適な決定木の本数を決定

max_depth(max_depth) 5 3 ~ 9 、一様分布に従う、1刻み

min_child_weight(min_sum_hessian_in_leaf) 1.0 0.1 ~ 10.0、対数が一様分布に従う

gamma(?) 0.0 1e-8 ~ 1.0、対数が一様分布に従う

colsample_bytree(feature_fraction) 0.8 0.6 ~ 0.95、一様分布に従う、0.05刻み

subsample(bagging_fraction) 0.8 0.6 ~ 0.95、一様分布に従う、0.05刻み

alpha(lambda_l1) 0.0 デフォルト値としておき、余裕があれば調整する

lambda(lambda_l2) 1.0 デフォルト値としておき、余裕があれば調整する

etaはチューニングでは0.1を使い、提出するモデルを作るときには小さくする

table:Jack氏の手動パラメータチューニング

xgboostパラメータ(lightgbm) ベースラインの値探索範囲

eta(learning_rate) 0.1 or 0.5(データ量に依存する) パラメータ探索では固定する

max_depth(max_depth) 最初にチューニングするので決めない ①5~8を試す、さらに浅いor深い方が改善しそうなら広げる

min_child_weight(min_sum_hessian_in_leaf) 1.0 ③1,2,4,8,16,32,...と2倍ごとに試す

gamma(?) 0.0 -

colsample_bytree(feature_fraction) 1.0 -

colsample_bylevel(-) 0.3 ②0.5 ~ 0.1を0.1刻み

subsample(bagging_fraction) 0.9 -

alpha(lambda_l1) 0.0 ④両者のバランスなのでいろいろ試す

lambda(lambda_l2) 1.0 ④両者のバランスなのでいろいろ試す

6.2 特徴量選択および特徴量の重要度(P.328)

https://www.kaggle.com/c/elo-merchant-category-recommendation/discussion/73937

xgboostはゲイン、カバー、頻度が選べたんだ。currypurin.icon

lihgtgbmはゲインとsplit(頻度)

頻度はxgboostだと引数名がweightとややこしい、lightgbmはsplitwakame.icon

https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.feature_importance

catboostは？currypurin.icon

URLを拾ってきましたが全部読めてないwakame.icon

https://catboost.ai/docs/features/feature-importances-calculation.html

https://catboost.ai/docs/concepts/shap-values.html

https://catboost.ai/docs/concepts/feature-interaction.html#feature-interaction__feature-interaction-strength

PredictionValuesChange(for non-ranking metrics)

For each feature, PredictionValuesChange shows how much on average the prediction changes if the feature value changes. The bigger the value of the importance the bigger on average is the change to the prediction value, if this feature is changed.

特徴量毎の特徴が変化した場合の予測の変化量

LossFunctionChange(for ranking metrics (the value is determined automatically))

For each feature the value represents the difference between the loss value of the model with this feature and without it. The model without this feature is equivalent to the one that would have been trained if this feature was excluded from the dataset. Since it is computationally expensive to retrain the model without one of the features, this model is built approximately using the original model with this feature removed from all the trees in the ensemble. The calculation of this feature importance requires a dataset and, therefore, the calculated value is dataset-dependent.

対象の特徴量を使うモデルと削除したモデルの損失の差をとる

追記おそらくこれがxgboost/lightgbmでいうgainなのかとwakame.icon

InternalFeatureImportance

The importance values both for each of the input features and for their combinations (if any).

ShapValues

A vector with contributions of each feature to the prediction for every input object and the expected value of the model prediction for the object (average prediction given no knowledge about the object).

計算はshapパッケージに依存

Interaction

The value of the feature interaction strength for each pair of features.

InternalInteraction

The value of the feature interaction strength for each pair of features that are used in the model. Internally the model uses feature combinations as separate features. All feature combinations that are used in the model are listed separately. For example, if the model contains a feature named F1 and a combination of features {F2, F3}, the interaction between F1 and the combination of features {F2, F3} is listed in the output file.

特徴選択は最近はしないor importanceで切っているcurrypurin.icon

adversarial validationは、使いやすいし、事故を防ぐためにも使っているcurrypurin.icon

前提として、trainingにfitすべきデータなのか、そうでないのかはよく考える。時系列などはtrainにfitすべきでない場合が多いcurrypurin.icon

fitすべき場合もあることにも注意。

permutation importanceやnull importanceは試してみたい

追記カレーちゃんの意見にほぼ同意で、特徴選択しなければならないくらいの大量の特徴量を生成しない、featureimpotance見て選択している。本に書かれているような手法を使って特徴量選択はそもそもやらない。wakame.icon

大量の特徴量方法の例でtakuokoさんの例が挙げられた

petfinderでの特徴量生成の例

GroupByとカラムと統計量(mean/maxなど)の組み合わせで大量に作ってる

https://github.com/okotaku/pet_finder/tree/master/code/fe

https://github.com/okotaku/pet_finder/blob/03d6a2cf8f4757de8ab59e53b88abc3f00a017d2/code/all_tools.py#L475

Column 多層パーセプトロンの具体的なパラメータチューニングの方法(P.322)

下記は本で多層パーセプトロンのパラメータチューニングにて参考にされたURL

https://github.com/ChenglongChen/Kaggle_HomeDepot/blob/master/Code/Chenglong/model_param_space.py

6.3 クラスの分布が偏っている場合(P.341)

アンダーサンプリング(P.341)

Kaggle - Porto Seguro’s Safe Driver Predictionの例wakame.icon

https://copypaste-ds.hatenablog.com/entry/2019/02/08/170518

不均衡データではあったが下記の通り評価指標のおかげか正例の予測を外してもそこまでペナルティがない、アンダーサンプリング等の処理をしなくてもよい例

https://employment.en-japan.com/engineerhub/entry/2018/08/24/110000#Porto-Seguros-Safe-Driver-Predictionとは

評価指標が正解率である場合には不均衡データの対応は非常に厄介になるんですが、「Porto Seguro’s Safe Driver Prediction」では評価指標がジニ係数であったため、不均衡データ特有の対応法がそれほど必要ありませんでした。

オーバーサンプリング(P.342)

テーブルデータではなく画像データでの参考情報wakame.icon

https://www.slideshare.net/yuyasoneoka/a-systematic-study-of-the-class-imbalance-problem-in-convolutional-neural-networks

https://github.com/arXivTimes/arXivTimes/issues/1395

ラベルの不均衡がDNN(CNN)に及ぼす影響と、効果的な対処法について調査した研究。不均衡の影響はCNNでも例外なく発生し、対処法としてはOversamplingが良いという結論(DNNの場合Overfitは起きにくいとのこと)。Accuracy重視の場合不均衡補正(Thresholding)もかけた方が良い　

論文ではベースラインとOverSampling、UnderSamplingそれぞれを比較しているがUnderSamplingはベースラインよりもスコアが低くなる（試行回数が少ないので参考程度）

追記画像のAugmentationみたいなイメージであってると思うwakame.icon

Column 　ベイズ最適化およびTPEのアルゴリズム(P.343)

正直ここは読めてないwakame.icon