データの変形
重複の除去
code: Python
data = pd.DataFrame({'k1': 'one', 'two' * 3 + 'two',
'k2': 1, 1, 2, 3, 3, 4, 4})
data
--------------------------------------------------------------------------
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
--------------------------------------------------------------------------
code: Python
data.duplicated()
--------------------------------------------------------------------------
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
--------------------------------------------------------------------------
code: Python
data.drop_duplicates()
--------------------------------------------------------------------------
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
--------------------------------------------------------------------------
code: Python
data'v1' = range(7)
data
--------------------------------------------------------------------------
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
--------------------------------------------------------------------------
code: Python
# 重複したら最初の値だけを残す
data.drop_duplicates('k1')
--------------------------------------------------------------------------
k1 k2 v1
0 one 1 0
1 two 1 1
--------------------------------------------------------------------------
code: Python
# 重複したら最後の値だけを残す
data.drop_duplicates('k1', 'k2', keep='last')
--------------------------------------------------------------------------
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6
--------------------------------------------------------------------------
関数やマッピングを用いたデータの変換
code: Python
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': 4, 3, 12, 6, 7.5, 8, 3, 5, 6})
data
--------------------------------------------------------------------------
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
--------------------------------------------------------------------------
code: Python
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
lowercased = data'food'.str.lower()
data'animal' = lowercased.map(meat_to_animal)
data
# 1行でも書ける
# data'food'.map(lambda x: meat_to_animalx.lower())
--------------------------------------------------------------------------
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
--------------------------------------------------------------------------
データの置き換え
code: Python
data = pd.Series(1., -999., 2., -999., -1000., 3.)
data
--------------------------------------------------------------------------
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
--------------------------------------------------------------------------
code: Python
data.replace(-999, np.nan)
--------------------------------------------------------------------------
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
--------------------------------------------------------------------------
code: Python
data.replace(-999, -1000, np.nan)
--------------------------------------------------------------------------
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
--------------------------------------------------------------------------
code: Python
data.replace(-999, -1000, np.nan, 0)
--------------------------------------------------------------------------
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
--------------------------------------------------------------------------
code: Python
data.replace({-999: np.nan, -1000: 0})
--------------------------------------------------------------------------
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
--------------------------------------------------------------------------
軸のインデックスの名前を変更する
code: Python
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index='Ohio', 'Colorado', 'New York',
columns='one', 'two', 'three', 'four')
data
--------------------------------------------------------------------------
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
--------------------------------------------------------------------------
code: Python
transform = lambda x: x.upper()
data.index = data.index.map(transform)
data
--------------------------------------------------------------------------
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
--------------------------------------------------------------------------
code: Python
data.rename(index=str.title, columns=str.upper)
--------------------------------------------------------------------------
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
--------------------------------------------------------------------------
code: Python
data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'})
--------------------------------------------------------------------------
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
--------------------------------------------------------------------------
離散化とビニング
code: Python
ages = 20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32
# 18-25, 26-35, 36-60, 61-100に分割
bins = 18, 25, 35, 60, 100
cats = pd.cut(ages, bins)
cats
--------------------------------------------------------------------------
(18, 25, (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, intervalint64): (18, 25 < (25, 35] < (35, 60] < (60, 100]]
--------------------------------------------------------------------------
code: Python
cats.codes
--------------------------------------------------------------------------
array(0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1, dtype=int8)
--------------------------------------------------------------------------
code: Python
cats.categories
--------------------------------------------------------------------------
IntervalIndex((18, 25, (25, 35], (35, 60], (60, 100]]
closed='right',
dtype='intervalint64')
--------------------------------------------------------------------------
code: Python
pd.value_counts(cats)
--------------------------------------------------------------------------
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
--------------------------------------------------------------------------
code: Python
# 左側を閉区間(境界を含む)にする
pd.cut(ages, 18, 26, 36, 61, 100, right=False)
--------------------------------------------------------------------------
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), 26, 36)
Length: 12
Categories (4, intervalint64): [[18, 26) < [26, 36) < [36, 61) < 61, 100)
--------------------------------------------------------------------------
code: Python
group_names = 'Youth', 'YoungAdult', 'MiddleAged', 'Senior'
# labelsでビンの名前を設定できる
pd.cut(ages, bins, labels=group_names)
--------------------------------------------------------------------------
Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult
Length: 12
Categories (4, object): Youth < YoungAdult < MiddleAged < Senior
--------------------------------------------------------------------------
code: Python
data = np.random.rand(20)
# dataを4つに分割
# preceisionは小数点以下2桁にする
pd.cut(data, 4, precision=2)
--------------------------------------------------------------------------
(0.34, 0.55, (0.34, 0.55], (0.76, 0.97], (0.76, 0.97], (0.34, 0.55], ..., (0.34, 0.55], (0.34, 0.55], (0.55, 0.76], (0.34, 0.55], (0.12, 0.34]]
Length: 20
Categories (4, intervalfloat64): (0.12, 0.34 < (0.34, 0.55] < (0.55, 0.76] < (0.76, 0.97]]
--------------------------------------------------------------------------
code: Python
data = np.random.randn(1000)
# 4つの四分位範囲のビンに分割(cut関数だとビンのデータ数が同じにならない)
cats = pd.qcut(data, 4)
cats
--------------------------------------------------------------------------
(-0.0265, 0.62, (0.62, 3.928], (-0.68, -0.0265], (0.62, 3.928], (-0.0265, 0.62], ..., (-0.68, -0.0265], (-0.68, -0.0265], (-2.9499999999999997, -0.68], (0.62, 3.928], (-0.68, -0.0265]]
Length: 1000
Categories (4, intervalfloat64): (-2.9499999999999997, -0.68 < (-0.68, -0.0265] < (-0.0265, 0.62] < (0.62, 3.928]]
--------------------------------------------------------------------------
code: Python
pd.value_counts(cats)
--------------------------------------------------------------------------
(0.62, 3.928] 250
(-0.0265, 0.62] 250
(-0.68, -0.0265] 250
(-2.9499999999999997, -0.68] 250
dtype: int64
--------------------------------------------------------------------------
code: Python
pd.qcut(data, 0, 0.1, 0.5, 0.9, 1.)
--------------------------------------------------------------------------
(-0.0265, 1.286, (-0.0265, 1.286], (-1.187, -0.0265], (-0.0265, 1.286], (-0.0265, 1.286], ..., (-1.187, -0.0265], (-1.187, -0.0265], (-2.9499999999999997, -1.187], (-0.0265, 1.286], (-1.187, -0.0265]]
Length: 1000
Categories (4, intervalfloat64): (-2.9499999999999997, -1.187 < (-1.187, -0.0265] < (-0.0265, 1.286] < (1.286, 3.928]]
--------------------------------------------------------------------------
外れ値の検出と除去
code: Python
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()
--------------------------------------------------------------------------
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.008212 -0.002558 -0.059165 -0.048681
std 1.041077 1.019451 0.980169 1.003074
min -3.183867 -3.481593 -3.194414 -3.108915
25% -0.746147 -0.699027 -0.753809 -0.736424
50% 0.005656 0.039159 -0.055737 -0.037073
75% 0.696967 0.705667 0.616784 0.603749
max 3.189940 2.961194 3.023720 2.916153
--------------------------------------------------------------------------
code: Python
# 絶対値が3より大きかったら外れ値にする
col = data2
colnp.abs(col) > 3
--------------------------------------------------------------------------
5 3.248944
102 3.176873
324 3.260383
499 -3.056990
586 -3.184377
Name: 2, dtype: float64
--------------------------------------------------------------------------
code: Python
# 絶対値が3より大きい値を一つ以上持つなら外れ値にする
data(np.abs(data) > 3).any(1)
--------------------------------------------------------------------------
0 1 2 3
25 0.336788 -3.333767 -1.240685 -0.650855
107 -3.018842 -0.298748 0.406954 0.183282
131 0.781753 -0.555434 -0.048478 -3.108915
262 -3.183867 1.050471 -1.042736 1.680374
309 -3.140963 -1.509976 -0.389818 -0.273253
474 1.090038 -0.848098 -3.194414 0.077839
504 0.003349 -0.011807 3.023720 -1.105312
533 0.452649 -3.481593 0.789944 1.737746
702 3.082067 -0.516982 0.251909 -0.029354
730 3.189940 0.070978 0.516982 -0.805171
894 -0.436479 0.901529 -3.044612 -1.193980
928 -1.148738 -3.170292 -1.017073 -1.147658
--------------------------------------------------------------------------
code: Python
# np.sign()は正なら1、負なら-1を返す
np.sign(data).head()
--------------------------------------------------------------------------
0 1 2 3
0 -1.0 1.0 -1.0 1.0
1 -1.0 1.0 -1.0 -1.0
2 -1.0 -1.0 -1.0 -1.0
3 1.0 1.0 1.0 1.0
4 -1.0 -1.0 1.0 -1.0
--------------------------------------------------------------------------
順列(ランダムな並び替え)やランダムサンプリング
code: Python
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
df
--------------------------------------------------------------------------
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
--------------------------------------------------------------------------
code: Python
# 順列をランダムに並び替える
sampler = np.random.permutation(5)
sampler
--------------------------------------------------------------------------
array(0, 3, 2, 1, 4)
--------------------------------------------------------------------------
code: Python
df.take(sampler)
--------------------------------------------------------------------------
0 1 2 3
0 0 1 2 3
3 12 13 14 15
2 8 9 10 11
1 4 5 6 7
4 16 17 18 19
--------------------------------------------------------------------------
code: Python
# ランダムに一部分だけ非復元抽出(一度抽出したらその後抽出の対象とならない)
df.sample(n=3)
--------------------------------------------------------------------------
0 1 2 3
1 4 5 6 7
4 16 17 18 19
2 8 9 10 11
--------------------------------------------------------------------------
code: Python
choices = pd.Series(5, 7, -1, 6, 4)
# 復元抽出
draws = choices.sample(n=10, replace=True)
draws
--------------------------------------------------------------------------
4 4
2 -1
2 -1
2 -1
0 5
4 4
0 5
3 6
0 5
2 -1
dtype: int64
--------------------------------------------------------------------------
標識変数やダミー変数の計算
code: Python
df = pd.DataFrame({'key': 'b', 'b', 'a', 'c', 'a', 'b',
'data1': range(6)})
df
--------------------------------------------------------------------------
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 b
--------------------------------------------------------------------------
code: Python
pd.get_dummies(df'key')
--------------------------------------------------------------------------
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
--------------------------------------------------------------------------
code: Python
dummies = pd.get_dummies(df'key', prefix='key')
df_with_dummy = df'data1'.join(dummies)
df_with_dummy
--------------------------------------------------------------------------
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
--------------------------------------------------------------------------