Uncertainty 上位の解法

1位

validation: purged GroupKFold

using years as folds while excluding the months around the holidays.

All cross-validation was done intra-fold also using GroupKFold.

intra-foldってなんだろう

A separate RandomizedSearchCV was run within each fold.

各foldでパラメータサーチを回すよう

feature

lv12にMSEで特徴を一つずつ入れるのを試した

販売数量を店舗の平均販売数量で割った特徴もあった

価格データ、外部データ、item_idは使っていない

subsampling

サブサンプリングにより、5-10分で実行が終わるようにした

early_stoppingを使わずに、lightgbmで使う可能性があるパラメータについて、RandomizedSerachCVを行った

Lgb_quantile_param

code:python

'max_depth': 10, 20,

'n_estimators': 200, 300, 350, 400,

'min_split_gain': 0, 0, 0, 0, 1e-4, 1e-3, 1e-2, 0.1,

'min_child_samples': [2, 4, 7, 10, 14, 20, 30, 40, 60, 80, 100, 130,

170, 200, 300, 500, 700, 1000],

'min_child_weight': 0, 0, 0, 0, 1e-4, 1e-3, 1e-3, 5e-3, 2e-2, 0.1,

'num_leaves': 20, 30, 30, 30, 50, 70, 90,

'learning_rate': 0.02, 0.03, 0.04, 0.04, 0.05, 0.05, 0.07,

'colsample_bytree': 0.3, 0.5, 0.7, 0.8, 0.9, 0.9, 0.9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

'colsample_bynode': 0.1, 0.15, 0.2, 0.2, 0.2, 0.25, 0.3, 0.5, 0.65, 0.8, 0.9, 1,

'reg_lambda': 0, 0, 0, 0, 1e-5, 1e-5, 1e-5, 1e-5, 3e-5, 1e-4, 1e-3, 1e-2, 0.1, 1, 10, 100,

'reg_alpha': 0, 1e-5, 3e-5, 1e-4, 1e-4, 1e-3, 3e-3, 1e-2, 0.1, 1, 1, 10, 10, 100, 1000,

'subsample': 0.9, 1,

'cat_smooth': 0.1, 0.2, 0.5, 1, 2, 5, 7, 10

lv10~12は10回以下の反復でok。

他は最大50回、bestモデルはnext bestモデルより数%すぐれている

LV10~12はLightGBMと同じ程度の精度がでた

他のlvは精度が出なかった

既存の特徴、WaveNets, embedding layerを使用

Range-Blended Gradient Boosting

This is a technique that is essential for low-n gradient-boosted time series forecasting.

The idea is simple: move a target and any scaled features around to blend them across the histograms. This prevents overfitting and ensures the model generalizes well even in the face of drift or non-stationarity.

Covid19のコンペでも、使われた手法

https://www.kaggle.com/david1013/covid-19-daily-counts-a6-bag-1

https://www.kaggle.com/david1013/covid-19-daily-counts-a3-bag-1

3位

Accuracy21位

Uncertainty

Accuracy予測のレベル12からlv1からLV12の中央値を作成

このlv1から12の中央値から係数予測（何倍すれば良いか）

レベル1から9は正規分布、レベル10から12は歪正規分布 (Skew-Normal (SN) distribution) にした

全てのlvで、9つの分位点の最後の分位点は更に係数1.02から、1.03をかけた

lv12だけは、正確な推定が困難であり個別調整した

lateサブしたところ、accuracyのスコアと非常に良く相関した