欠損データへの対処
Coding
code: Python
import pandas as pd
from io import StringIO
# サンプルデータを作成
csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''
# サンプルデータを読み込む
df = pd.read_csv(StringIO(csv_data))
df
--------------------------------------------------------------------------
A B C D
0 1.0 2.0 3.0 4.0
1 5.0 6.0 NaN 8.0
2 10.0 11.0 12.0 NaN
--------------------------------------------------------------------------
欠損値を取り除く
code: Python
# 欠損値を含む行を削除
df.dropna()
--------------------------------------------------------------------------
A B C D
0 1.0 2.0 3.0 4.0
--------------------------------------------------------------------------
code: Python
# 欠損値を含む列を削除
df.dropna(axis=1)
--------------------------------------------------------------------------
A B
0 1.0 2.0
1 5.0 6.0
2 10.0 11.0
--------------------------------------------------------------------------
code: Python
# すべての列がNaNである行だけを削除
df.dropna(how='all')
--------------------------------------------------------------------------
A B C D
0 1.0 2.0 3.0 4.0
1 5.0 6.0 NaN 8.0
2 10.0 11.0 12.0 NaN
--------------------------------------------------------------------------
code: Python
# 非NaN値が4つ未満の行を削除
df.dropna(thresh=4)
--------------------------------------------------------------------------
A B C D
0 1.0 2.0 3.0 4.0
--------------------------------------------------------------------------
code: Python
# 特定の列(個の場合は'C')にNaNが含まれている行だけを削除
--------------------------------------------------------------------------
A B C D
0 1.0 2.0 3.0 4.0
2 10.0 11.0 12.0 NaN
--------------------------------------------------------------------------
欠損値を補完する
code: Python
from sklearn.preprocessing import Imputer
# 欠損値補完のインスタンスを生成(列の平均値で補完)
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
# データを適合
imr = imr.fit(df.values)
# 補完を実行
imputed_data = imr.transform(df.values)
imputed_data
--------------------------------------------------------------------------
--------------------------------------------------------------------------