欠損値の取り扱い
欠損値を削除する
欠損値を削除するには、dropnaを使う方法が便利です。シリーズに対してdropnaを用いると、欠損値でないデータとそのインデックスのみを持ったシリーズが戻されます。
code: Python
import numpy as np
import pandas as pd
code: Python
from numpy import nan as NA
data.dropna()
--------------------------------------------------------------------------
0 1.0
2 3.5
4 7.0
dtype: float64
--------------------------------------------------------------------------
code: Python
# dropnaと等価
--------------------------------------------------------------------------
0 1.0
2 3.5
4 7.0
dtype: float64
--------------------------------------------------------------------------
code: Python
data
--------------------------------------------------------------------------
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
--------------------------------------------------------------------------
code: Python
# 行に一つでも欠損値があると削除
cleaned = data.dropna()
cleaned
--------------------------------------------------------------------------
0 1 2
0 1.0 6.5 3.0
--------------------------------------------------------------------------
code: Python
# 行がすべて欠損値だと削除
data.dropna(how='all')
--------------------------------------------------------------------------
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
--------------------------------------------------------------------------
code: Python
data
--------------------------------------------------------------------------
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
--------------------------------------------------------------------------
code: Python
# 列を削除する場合はaxis=1
data.dropna(axis=1, how='all')
--------------------------------------------------------------------------
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
--------------------------------------------------------------------------
code: Python
df = pd.DataFrame(np.random.randn(7, 3))
df
--------------------------------------------------------------------------
0 1 2
0 -0.204708 NaN NaN
1 -0.555730 NaN NaN
2 0.092908 NaN 0.769023
3 1.246435 NaN -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843
6 1.669025 -0.438570 -0.539741
--------------------------------------------------------------------------
code: Python
df.dropna()
--------------------------------------------------------------------------
0 1 2
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843
6 1.669025 -0.438570 -0.539741
--------------------------------------------------------------------------
code: Python
# 欠損値が2つ以上ある時に削除
df.dropna(thresh=2)
--------------------------------------------------------------------------
0 1 2
2 0.092908 NaN 0.769023
3 1.246435 NaN -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843
6 1.669025 -0.438570 -0.539741
--------------------------------------------------------------------------
欠損値を穴埋めする
欠損値を削除するのではなく、欠損値という「穴」を埋めてくれるのがfillnaメソッドです。fillnaに何らかの値を引数として与えて呼び出すと、その値で欠損値を置き換えることができます。
code: Python
df.fillna(0)
--------------------------------------------------------------------------
0 1 2
0 -0.204708 0.000000 0.000000
1 -0.555730 0.000000 0.000000
2 0.092908 0.000000 0.769023
3 1.246435 0.000000 -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843
6 1.669025 -0.438570 -0.539741
--------------------------------------------------------------------------
code: Python
df.fillna({1: 0.5, 2: 0})
--------------------------------------------------------------------------
--------------------------------------------------------------------------
code: Python
# 列ごとに異なる値で埋めることができる
df.fillna({1: 0.5, 2: 0})
--------------------------------------------------------------------------
0 1 2
0 -0.204708 0.500000 0.000000
1 -0.555730 0.500000 0.000000
2 0.092908 0.500000 0.769023
3 1.246435 0.500000 -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843
6 1.669025 -0.438570 -0.539741
--------------------------------------------------------------------------
code: Python
# inplaceはcopyではなく呼び出し元のオブジェクトの参照を直接書き換える
_ = df.fillna(0, inplace=True)
df
--------------------------------------------------------------------------
0 1 2
0 -0.204708 0.000000 0.000000
1 -0.555730 0.000000 0.000000
2 0.092908 0.000000 0.769023
3 1.246435 0.000000 -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843
6 1.669025 -0.438570 -0.539741
--------------------------------------------------------------------------
code: Python
df = pd.DataFrame(np.random.randn(6, 3))
df
--------------------------------------------------------------------------
0 1 2
0 0.476985 3.248944 -1.021228
1 -0.577087 0.124121 0.302614
2 0.523772 NaN 1.343810
3 -0.713544 NaN -2.370232
4 -1.860761 NaN NaN
5 -1.265934 NaN NaN
--------------------------------------------------------------------------
code: Python
# 前の行で穴埋め
df.fillna(method='ffill')
--------------------------------------------------------------------------
0 1 2
0 0.476985 3.248944 -1.021228
1 -0.577087 0.124121 0.302614
2 0.523772 0.124121 1.343810
3 -0.713544 0.124121 -2.370232
4 -1.860761 0.124121 -2.370232
5 -1.265934 0.124121 -2.370232
--------------------------------------------------------------------------
code: Python
# 各列の欠損値を2行分だけ穴埋めする
df.fillna(method='ffill', limit=2)
--------------------------------------------------------------------------
0 1 2
0 0.476985 3.248944 -1.021228
1 -0.577087 0.124121 0.302614
2 0.523772 0.124121 1.343810
3 -0.713544 0.124121 -2.370232
4 -1.860761 NaN -2.370232
5 -1.265934 NaN -2.370232
--------------------------------------------------------------------------
code: Python
# 平均値で穴埋めする
data.fillna(data.mean())
--------------------------------------------------------------------------
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
--------------------------------------------------------------------------