GroupByの仕組み
code: Python
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})
df
--------------------------------------------------------------------------
data1 data2 key1 key2
0 -0.204708 1.393406 a one
1 0.478943 0.092908 a two
2 -0.519439 0.281746 b one
3 -0.555730 0.769023 b two
4 1.965781 1.246435 a one
--------------------------------------------------------------------------
code: Python
# data1をkey1でグループ化
grouped.mean()
--------------------------------------------------------------------------
key1
a 0.746672
b -0.537585
Name: data1, dtype: float64
--------------------------------------------------------------------------
code: Python
means
--------------------------------------------------------------------------
key1 key2
a one 0.880536
two 0.478943
b one -0.519439
two -0.555730
Name: data1, dtype: float64
--------------------------------------------------------------------------
code: Python
--------------------------------------------------------------------------
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
--------------------------------------------------------------------------
GroupByオブジェクトの繰り返し
code: Python
# GroupByオブジェクトはグループの名前とその名前(name)に対応するデータ(group)を含むタプルを返す
for name, group in df.groupby('key1'):
print(name)
print(group)
--------------------------------------------------------------------------
a
data1 data2 key1 key2
0 -0.204708 1.393406 a one
1 0.478943 0.092908 a two
4 1.965781 1.246435 a one
b
data1 data2 key1 key2
2 -0.519439 0.281746 b one
3 -0.555730 0.769023 b two
--------------------------------------------------------------------------
code: Python
print((k1, k2))
print(group)
--------------------------------------------------------------------------
('a', 'one')
data1 data2 key1 key2
0 -0.204708 1.393406 a one
4 1.965781 1.246435 a one
('a', 'two')
data1 data2 key1 key2
1 0.478943 0.092908 a two
('b', 'one')
data1 data2 key1 key2
2 -0.519439 0.281746 b one
('b', 'two')
data1 data2 key1 key2
3 -0.55573 0.769023 b two
--------------------------------------------------------------------------
ディクショナリやシリーズのグループ化
code: Python
people = pd.DataFrame(np.random.randn(5, 5),
people.iloc[2:3, 1, 2] = np.nan people
--------------------------------------------------------------------------
a b c d e
Joe 1.007189 -1.296221 0.274992 0.228913 1.352917
Steve 0.886429 -2.001637 -0.371843 1.669025 -0.438570
Wes -0.539741 NaN NaN -1.021228 -0.577087
Jim 0.124121 0.302614 0.523772 0.000940 1.343810
Travis -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
--------------------------------------------------------------------------
code: Python
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
'd': 'blue', 'e': 'red', 'f' : 'orange'}
by_column = people.groupby(mapping, axis=1)
by_column.sum()
--------------------------------------------------------------------------
blue red
Joe 0.503905 1.063885
Steve 1.297183 -1.553778
Wes -1.021228 -1.116829
Jim 0.524712 1.770545
Travis -4.230992 -2.405455
--------------------------------------------------------------------------
code: Python
map_series = pd.Series(mapping)
map_series
--------------------------------------------------------------------------
a red
b red
c blue
d blue
e red
f orange
dtype: object
--------------------------------------------------------------------------
code: Python
people.groupby(map_series, axis=1).count()
--------------------------------------------------------------------------
blue red
Joe 2 3
Steve 2 3
Wes 1 2
Jim 2 3
Travis 2 3
--------------------------------------------------------------------------
関数を使ったグループ化
code: Python
people.groupby(len).sum()
--------------------------------------------------------------------------
a b c d e
3 0.591569 -0.993608 0.798764 -0.791374 2.119639
5 0.886429 -2.001637 -0.371843 1.669025 -0.438570
6 -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
--------------------------------------------------------------------------
code: Python
--------------------------------------------------------------------------
a b c d e
3 one -0.539741 -1.296221 0.274992 -1.021228 -0.577087
two 0.124121 0.302614 0.523772 0.000940 1.343810
5 one 0.886429 -2.001637 -0.371843 1.669025 -0.438570
6 two -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
--------------------------------------------------------------------------
インデックス階層によるグループ化
code: Python
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df
--------------------------------------------------------------------------
cty US JP
tenor 1 3 5 1 3
0 0.560145 -1.265934 0.119827 -1.063512 0.332883
1 -2.359419 -0.199543 -1.541996 -0.970736 -1.307030
2 0.286350 0.377984 -0.753887 0.331286 1.349742
3 0.069877 0.246674 -0.011862 1.004812 1.327195
--------------------------------------------------------------------------
code: Python
hier_df.groupby(level='cty', axis=1).count()
--------------------------------------------------------------------------
cty JP US
0 2 3
1 2 3
2 2 3
3 2 3
--------------------------------------------------------------------------