pandasの重要な機能 - Note for Machine Learning

pandasの重要な機能

再インデックス付け

pandasのオブジェクトの非常に重要なメソッドに、reindexがあります。このメソッドは、新しいインデックスに従ったデータを持つ新しいオブジェクトを作成します。

code: Python

import pandas as pd

code: Python

obj = pd.Series(4.5, 7.2, -5.3, 3.6, index='d', 'b', 'a', 'c')

obj

--------------------------------------------------------------------------

d 4.5

b 7.2

a -5.3

c 3.6

dtype: float64

--------------------------------------------------------------------------

code: Python

obj2 = obj.reindex('a', 'b', 'c', 'd', 'e')

obj2

--------------------------------------------------------------------------

a -5.3

b 7.2

c 3.6

d 4.5

e NaN

dtype: float64

--------------------------------------------------------------------------

code: Python

obj3 = pd.Series('blue', 'purple', 'yellow', index=0, 2, 4)

obj3

--------------------------------------------------------------------------

0 blue

2 purple

4 yellow

dtype: object

--------------------------------------------------------------------------

reindexはmethodオプションがあります。ffillは前方に穴埋めします。

code: Python

obj3.reindex(range(6), method='ffill')

--------------------------------------------------------------------------

0 blue

1 blue

2 purple

3 purple

4 yellow

5 yellow

dtype: object

--------------------------------------------------------------------------

bfillは後方に穴埋めします。

code: Python

obj3.reindex(range(6), method='bfill')

--------------------------------------------------------------------------

0 blue

1 purple

2 purple

3 yellow

4 yellow

5 NaN

dtype: object

--------------------------------------------------------------------------

code: Python

frame = pd.DataFrame(np.arange(9).reshape((3, 3)),

index='a', 'c', 'd',

columns='Ohio', 'Texas', 'California')

frame

--------------------------------------------------------------------------

Ohio Texas California

a 0 1 2

c 3 4 5

d 6 7 8

--------------------------------------------------------------------------

code: Python

states = 'Texas', 'Utah', 'California'

frame.reindex(columns=states)

--------------------------------------------------------------------------

Texas Utah California

a 1 NaN 2

c 4 NaN 5

d 7 NaN 8

--------------------------------------------------------------------------

軸から要素を削除する

dropメソッドを使うと、指定した要素が軸から削除された新しいオブジェクトを作成します。

code: Python

obj = pd.Series(np.arange(5.), index='a', 'b', 'c', 'd', 'e')

obj

--------------------------------------------------------------------------

a 0.0

b 1.0

c 2.0

d 3.0

e 4.0

dtype: float64

--------------------------------------------------------------------------

code: Python

new_obj = obj.drop('c')

new_obj

--------------------------------------------------------------------------

a 0.0

b 1.0

d 3.0

e 4.0

dtype: float64

--------------------------------------------------------------------------

code: Python

obj.drop('d', 'c')

--------------------------------------------------------------------------

a 0.0

b 1.0

e 4.0

dtype: float64

--------------------------------------------------------------------------

code: Python

data = pd.DataFrame(np.arange(16).reshape((4, 4)),

index='Ohio', 'Colorado', 'Utah', 'New York',

columns='one', 'two', 'three', 'four')

data

--------------------------------------------------------------------------

one two three four

Ohio 0 1 2 3

Colorado 4 5 6 7

Utah 8 9 10 11

New York 12 13 14 15

--------------------------------------------------------------------------

code: Python

data.drop('Colorado', 'Ohio')

--------------------------------------------------------------------------

one two three four

Utah 8 9 10 11

New York 12 13 14 15

--------------------------------------------------------------------------

code: Python

# data.drop('two', axis=1)

data.drop('two', 'four', axis='columns')

--------------------------------------------------------------------------

one three

Ohio 0 2

Colorado 4 6

Utah 8 10

New York 12 14

--------------------------------------------------------------------------

直接削除する

code: Python

obj.drop('c', inplace=True)

obj

--------------------------------------------------------------------------

a 0.0

b 1.0

d 3.0

e 4.0

dtype: float64

--------------------------------------------------------------------------

locとilocによるデータの選択

locやilocフィールドを使うと、NumPyのように軸を指定して、データフレームから行や列の一部分を選択することができます。軸のラベルを使うときはloc、整数のインデックス位置による参照を使うときはilocを使います。

code: Python

data = pd.DataFrame(np.arange(16).reshape((4, 4)),

index='Ohio', 'Colorado', 'Utah', 'New York',

columns='one', 'two', 'three', 'four')

data

--------------------------------------------------------------------------

one two three four

Ohio 0 1 2 3

Colorado 4 5 6 7

Utah 8 9 10 11

New York 12 13 14 15

--------------------------------------------------------------------------

code: Python

data.loc['Colorado', 'two', 'three']

--------------------------------------------------------------------------

two 5

three 6

Name: Colorado, dtype: int32

--------------------------------------------------------------------------

code: Python

data.iloc[2, 3, 0, 1]

--------------------------------------------------------------------------

four 11

one 8

two 9

Name: Utah, dtype: int32

--------------------------------------------------------------------------

code: Python

data.loc:'Utah', 'two'

--------------------------------------------------------------------------

Ohio 0

Colorado 5

Utah 9

Name: two, dtype: int32

--------------------------------------------------------------------------

code: Python

data.iloc:, :3data.three > 5

--------------------------------------------------------------------------

one two three

Colorado 0 5 6

Utah 8 9 10

New York 12 13 14

--------------------------------------------------------------------------

code: Python

data.iloc2

--------------------------------------------------------------------------

one 8

two 9

three 10

four 11

Name: Utah, dtype: int32

--------------------------------------------------------------------------

code: Python

data.iloc1, 2], [3, 0, 1

--------------------------------------------------------------------------

four one two

Colorado 7 0 5

Utah 11 8 9

--------------------------------------------------------------------------

算術メソッドと値の変換

算術メソッドには、add (radd), sub (rsub), div (rdiv), floordiv (rfloordiv), mul (rmul), pow (rpow)などがあります。

code: Python

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),

columns=list('abcd'))

df1

--------------------------------------------------------------------------

a b c d

0 0.0 1.0 2.0 3.0

1 4.0 5.0 6.0 7.0

2 8.0 9.0 10.0 11.0

--------------------------------------------------------------------------

code: Python

df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),

columns=list('abcde'))

df2

--------------------------------------------------------------------------

a b c d e

0 0.0 1.0 2.0 3.0 4.0

1 5.0 6.0 7.0 8.0 9.0

2 10.0 11.0 12.0 13.0 14.0

3 15.0 16.0 17.0 18.0 19.0

--------------------------------------------------------------------------

code: Python

df1.add(df2, fill_value=0)

--------------------------------------------------------------------------

a b c d e

0 0.0 2.0 4.0 6.0 4.0

1 9.0 5.0 13.0 15.0 9.0

2 18.0 20.0 22.0 24.0 14.0

3 15.0 16.0 17.0 18.0 19.0

--------------------------------------------------------------------------

データフレームとシリーズでの演算

code: Python

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),

columns=list('bde'),

index='Utah', 'Ohio', 'Texas', 'Oregon')

frame

--------------------------------------------------------------------------

b d e

Utah 0.0 1.0 2.0

Ohio 3.0 4.0 5.0

Texas 6.0 7.0 8.0

Oregon 9.0 10.0 11.0

--------------------------------------------------------------------------

code: Python

series = frame.iloc0

series

--------------------------------------------------------------------------

b 0.0

d 1.0

e 2.0

Name: Utah, dtype: float64

--------------------------------------------------------------------------

code: Python

frame - series

--------------------------------------------------------------------------

b d e

Utah 0.0 0.0 0.0

Ohio 3.0 3.0 3.0

Texas 6.0 6.0 6.0

Oregon 9.0 9.0 9.0

--------------------------------------------------------------------------

関数の適用とマッピング

NumPyのufunc（配列の要素に適用可能なメソッド群）は、pandasのオブジェクトでも機能します。

code: Python

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),

index='Utah', 'Ohio', 'Texas', 'Oregon')

frame

--------------------------------------------------------------------------

b d e

Utah -0.204708 0.478943 -0.519439

Ohio　 -0.555730 1.965781 1.393406

Texas 0.092908 0.281746 0.769023

Oregon 1.246435 1.007189 -1.296221

--------------------------------------------------------------------------

code: Python

np.abs(frame)

--------------------------------------------------------------------------

b d e

Utah 0.204708 0.478943 0.519439

Ohio 0.555730 1.965781 1.393406

Texas 0.092908 0.281746 0.769023

Oregon 1.246435 1.007189 1.296221

--------------------------------------------------------------------------

code: Python

f = lambda x: x.max() - x.min()

frame.apply(f)

--------------------------------------------------------------------------

b 1.802165

d 1.684034

e 2.689627

dtype: float64

--------------------------------------------------------------------------

code: Python

frame.apply(f, axis='columns')

--------------------------------------------------------------------------

Utah 0.998382

Ohio 2.521511

Texas 0.676115

Oregon 2.542656

dtype: float64

--------------------------------------------------------------------------

code: Python

def f(x):

return pd.Series(x.min(), x.max(), index='min', 'max')

frame.apply(f)

--------------------------------------------------------------------------

b d e

min -0.555730 0.281746 -1.296221

max 1.246435 1.965781 1.393406

--------------------------------------------------------------------------

要素ごとに適用するにはapplymapメソッドを使う。

code: Python

format = lambda x: '%.2f' % x

frame.applymap(format)

--------------------------------------------------------------------------

b d e

Utah -0.20 0.48 -0.52

Ohio -0.56 1.97 1.39

Texas 0.09 0.28 0.77

Oregon 1.25 1.01 -1.30

--------------------------------------------------------------------------

ソートとランク

行や列のインデックスを辞書順でソートするためには、sort_indexメソッドを使います。このメソッドは新しいソート済みのオブジェクトを返します。

code: Python

obj = pd.Series(range(4), index='d', 'a', 'b', 'c')

obj.sort_index()

--------------------------------------------------------------------------

a 1

b 2

c 3

d 0

dtype: int64

--------------------------------------------------------------------------

code: Python

frame = pd.DataFrame(np.arange(8).reshape((2, 4)),

index='three', 'one',

columns='d', 'a', 'b', 'c')

frame.sort_index()

--------------------------------------------------------------------------

d a b c

one 4 5 6 7

three 0 1 2 3

--------------------------------------------------------------------------

code: Python

frame.sort_index(axis=1)

--------------------------------------------------------------------------

a b c d

three 1 2 3 0

one 5 6 7 4

--------------------------------------------------------------------------

code: Python

frame.sort_index(axis=1, ascending=False)

--------------------------------------------------------------------------

d c b a

three 0 3 2 1

one 4 7 6 5

--------------------------------------------------------------------------

特定の列だけソートする。

code: Python

frame = pd.DataFrame({'b': 4, 7, -3, 2, 'a': 0, 1, 0, 1})

frame.sort_values(by='b')

--------------------------------------------------------------------------

a b

2 0 -3

3 1 2

0 0 4

1 1 7

--------------------------------------------------------------------------

code: Python

obj = pd.Series(7, -5, 7, 4, 2, 0, 4)

obj

--------------------------------------------------------------------------

0 7

1 -5

2 7

3 4

4 2

5 0

6 4

dtype: int64

--------------------------------------------------------------------------

code: Python

obj.rank()

--------------------------------------------------------------------------

0 6.5

1 1.0

2 6.5

3 4.5

4 3.0

5 2.0

6 4.5

dtype: float64

--------------------------------------------------------------------------

code: Python

obj.rank(method='first')

--------------------------------------------------------------------------

0 6.0

1 1.0

2 7.0

3 4.0

4 3.0

5 2.0

6 5.0

dtype: float64

--------------------------------------------------------------------------