GroupByの仕組み - Note for Machine Learning

GroupByの仕組み

code: Python

df = pd.DataFrame({'key1' : 'a', 'a', 'b', 'b', 'a',

'key2' : 'one', 'two', 'one', 'two', 'one',

'data1' : np.random.randn(5),

'data2' : np.random.randn(5)})

--------------------------------------------------------------------------

data1 data2 key1 key2

0 -0.204708 1.393406 a one

1 0.478943 0.092908 a two

2 -0.519439 0.281746 b one

3 -0.555730 0.769023 b two

4 1.965781 1.246435 a one

--------------------------------------------------------------------------

code: Python

# data1をkey1でグループ化

grouped = df'data1'.groupby(df'key1')

grouped.mean()

--------------------------------------------------------------------------

key1

a 0.746672

b -0.537585

Name: data1, dtype: float64

--------------------------------------------------------------------------

code: Python

means = df'data1'.groupby([df'key1', df'key2']).mean()

means

--------------------------------------------------------------------------

key1 key2

a one 0.880536

two 0.478943

b one -0.519439

two -0.555730

Name: data1, dtype: float64

--------------------------------------------------------------------------

code: Python

df.groupby('key1', 'key2').size()

--------------------------------------------------------------------------

key1 key2

a one 2

two 1

b one 1

two 1

dtype: int64

--------------------------------------------------------------------------

GroupByオブジェクトの繰り返し

code: Python

# GroupByオブジェクトはグループの名前とその名前（name）に対応するデータ（group）を含むタプルを返す

for name, group in df.groupby('key1'):

print(name)

print(group)

--------------------------------------------------------------------------

data1 data2 key1 key2

0 -0.204708 1.393406 a one

1 0.478943 0.092908 a two

4 1.965781 1.246435 a one

data1 data2 key1 key2

2 -0.519439 0.281746 b one

3 -0.555730 0.769023 b two

--------------------------------------------------------------------------

code: Python

for (k1, k2), group in df.groupby('key1', 'key2'):

print((k1, k2))

print(group)

--------------------------------------------------------------------------

('a', 'one')

data1 data2 key1 key2

0 -0.204708 1.393406 a one

4 1.965781 1.246435 a one

('a', 'two')

data1 data2 key1 key2

1 0.478943 0.092908 a two

('b', 'one')

data1 data2 key1 key2

2 -0.519439 0.281746 b one

('b', 'two')

data1 data2 key1 key2

3 -0.55573 0.769023 b two

--------------------------------------------------------------------------

ディクショナリやシリーズのグループ化

code: Python

people = pd.DataFrame(np.random.randn(5, 5),

columns='a', 'b', 'c', 'd', 'e',

index='Joe', 'Steve', 'Wes', 'Jim', 'Travis')

people.iloc[2:3, 1, 2] = np.nan

people

--------------------------------------------------------------------------

a b c d e

Joe 1.007189 -1.296221 0.274992 0.228913 1.352917

Steve 0.886429 -2.001637 -0.371843 1.669025 -0.438570

Wes -0.539741 NaN NaN -1.021228 -0.577087

Jim 0.124121 0.302614 0.523772 0.000940 1.343810

Travis -0.713544 -0.831154 -2.370232 -1.860761 -0.860757

--------------------------------------------------------------------------

code: Python

mapping = {'a': 'red', 'b': 'red', 'c': 'blue',

'd': 'blue', 'e': 'red', 'f' : 'orange'}

by_column = people.groupby(mapping, axis=1)

by_column.sum()

--------------------------------------------------------------------------

blue red

Joe 0.503905 1.063885

Steve 1.297183 -1.553778

Wes -1.021228 -1.116829

Jim 0.524712 1.770545

Travis -4.230992 -2.405455

--------------------------------------------------------------------------

code: Python

map_series = pd.Series(mapping)

map_series

--------------------------------------------------------------------------

a red

b red

c blue

d blue

e red

f orange

dtype: object

--------------------------------------------------------------------------

code: Python

people.groupby(map_series, axis=1).count()

--------------------------------------------------------------------------

blue red

Joe 2 3

Steve 2 3

Wes 1 2

Jim 2 3

Travis 2 3

--------------------------------------------------------------------------

関数を使ったグループ化

code: Python

people.groupby(len).sum()

--------------------------------------------------------------------------

a b c d e

3 0.591569 -0.993608 0.798764 -0.791374 2.119639

5 0.886429 -2.001637 -0.371843 1.669025 -0.438570

6 -0.713544 -0.831154 -2.370232 -1.860761 -0.860757

--------------------------------------------------------------------------

code: Python

key_list = 'one', 'one', 'one', 'two', 'two'

people.groupby(len, key_list).min()

--------------------------------------------------------------------------

a b c d e

3 one -0.539741 -1.296221 0.274992 -1.021228 -0.577087

two 0.124121 0.302614 0.523772 0.000940 1.343810

5 one 0.886429 -2.001637 -0.371843 1.669025 -0.438570

6 two -0.713544 -0.831154 -2.370232 -1.860761 -0.860757

--------------------------------------------------------------------------

インデックス階層によるグループ化

code: Python

columns = pd.MultiIndex.from_arrays(['US', 'US', 'US', 'JP', 'JP',

1, 3, 5, 1, 3],

names='cty', 'tenor')

hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)

hier_df

--------------------------------------------------------------------------

cty US JP

tenor 1 3 5 1 3

0 0.560145 -1.265934 0.119827 -1.063512 0.332883

1 -2.359419 -0.199543 -1.541996 -0.970736 -1.307030

2 0.286350 0.377984 -0.753887 0.331286 1.349742

3 0.069877 0.246674 -0.011862 1.004812 1.327195

--------------------------------------------------------------------------

code: Python

hier_df.groupby(level='cty', axis=1).count()

--------------------------------------------------------------------------

cty JP US

0 2 3

1 2 3

2 2 3

3 2 3

--------------------------------------------------------------------------