要約統計量の集計と計算 - Note for Machine Learning

要約統計量の集計と計算

集約や要約統計量

pandasオブジェクトでは、一般的な数学的、統計的なメソッドが使えます。これらのメソッドのほとんどは、集約や要約統計量に分類されるようなものです。これらのメソッドでは、データフレームの行や列にあるシリーズから合計値や平均値などの1つの値を計算する。

code: Python

df = pd.DataFrame([1.4, np.nan, 7.1, -4.5,

np.nan, np.nan, 0.75, -1.3],

index='a', 'b', 'c', 'd',

columns='one', 'two')

--------------------------------------------------------------------------

one two

a 1.40 NaN

b 7.10 -4.5

c NaN NaN

d 0.75 -1.3

--------------------------------------------------------------------------

code: Python

df.sum()

--------------------------------------------------------------------------

one 9.25

two -5.80

dtype: float64

--------------------------------------------------------------------------

code: Python

df.sum(axis='columns')

--------------------------------------------------------------------------

a 1.40

b 2.60

c 0.00

d -0.55

dtype: float64

--------------------------------------------------------------------------

code: Python

df.mean(axis='columns', skipna=False)

--------------------------------------------------------------------------

a NaN

b 1.300

c NaN

d -0.275

dtype: float64

--------------------------------------------------------------------------

code: Python

df.idxmax()

--------------------------------------------------------------------------

one b

two d

dtype: object

--------------------------------------------------------------------------

code: Python

# 累積

df.cumsum()

--------------------------------------------------------------------------

one two

a 1.40 NaN

b 8.50 -4.5

c NaN NaN

d 9.25 -5.8

--------------------------------------------------------------------------

code: Python

df.describe()

--------------------------------------------------------------------------

one two

count 3.000000 2.000000

mean 3.083333 -2.900000

std 3.493685 2.262742

min 0.750000 -4.500000

25% 1.075000 -3.700000

50% 1.400000 -2.900000

75% 4.250000 -2.100000

max 7.100000 -1.300000

--------------------------------------------------------------------------

相関と共分散

相関や共分散などの統計量は、変数のペアから求めることができます。

Yahoo!Financeから取得した株価と出来高のデータフレーム。

code: Python

price = pd.read_pickle('yahoo_price.pkl')

volume = pd.read_pickle('yahoo_volume.pkl')

returns = price.pct_change()

returns.tail()

--------------------------------------------------------------------------

AAPL GOOG IBM MSFT

Date

2016-10-17 -0.000680 0.001837 0.002072 -0.003483

2016-10-18 -0.000681 0.019616 -0.026168 0.007690

2016-10-19 -0.002979 0.007846 0.003583 -0.002255

2016-10-20 -0.000512 -0.005652 0.001719 -0.004867

2016-10-21 -0.003930 0.003011 -0.012474 0.042096

--------------------------------------------------------------------------

相関を求める。

code: Python

returns'MSFT'.corr(returns'IBM')

# returns.MSFT.corr(returns.IBM)

--------------------------------------------------------------------------

0.4997636114415114

--------------------------------------------------------------------------

共分散を求める。

code: Python

returns'MSFT'.cov(returns'IBM')

--------------------------------------------------------------------------

8.870655479703546e-05

--------------------------------------------------------------------------

code: Python

returns.corr()

--------------------------------------------------------------------------

AAPL GOOG IBM MSFT

AAPL 1.000000 0.407919 0.386817 0.389695

GOOG 0.407919 1.000000 0.405099 0.465919

IBM 0.386817 0.405099 1.000000 0.499764

MSFT 0.389695 0.465919 0.499764 1.000000

--------------------------------------------------------------------------

code: Python

returns.cov()

--------------------------------------------------------------------------

AAPL GOOG IBM MSFT

AAPL 0.000277 0.000107 0.000078 0.000095

GOOG 0.000107 0.000251 0.000078 0.000108

IBM 0.000078 0.000078 0.000146 0.000089

MSFT 0.000095 0.000108 0.000089 0.000215

--------------------------------------------------------------------------

code: Python

returns.corrwith(returns.IBM)

--------------------------------------------------------------------------

AAPL 0.386817

GOOG 0.405099

IBM 1.000000

MSFT 0.499764

dtype: float64

--------------------------------------------------------------------------

code: Python

returns.corrwith(volume)

--------------------------------------------------------------------------

AAPL -0.075565

GOOG -0.007067

IBM -0.204849

MSFT -0.092950

dtype: float64

--------------------------------------------------------------------------

一意な値、頻度の確認、所属の確認

他の関連するメソッドには、1次元のシリーズに含まれる値の情報を抽出するものがあります。

code: Python

obj = pd.Series('c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c')

obj

--------------------------------------------------------------------------

0 c

1 a

2 d

3 a

4 a

5 b

6 b

7 c

8 c

dtype: object

--------------------------------------------------------------------------

code: Python

uniques = obj.unique()

uniques

--------------------------------------------------------------------------

array('c', 'a', 'd', 'b', dtype=object)

--------------------------------------------------------------------------

code: Python

obj.value_counts()

--------------------------------------------------------------------------

c 3

a 3

b 2

d 1

dtype: int64

--------------------------------------------------------------------------

code: Python

pd.value_counts(obj.values, sort=False)

--------------------------------------------------------------------------

a 3

b 2

d 1

c 3

dtype: int64

--------------------------------------------------------------------------

code: Python

obj

--------------------------------------------------------------------------

0 c

1 a

2 d

3 a

4 a

5 b

6 b

7 c

8 c

dtype: object

--------------------------------------------------------------------------

code: Python

mask = obj.isin('b', 'c')

mask

--------------------------------------------------------------------------

0 True

1 False

2 False

3 False

4 False

5 True

6 True

7 True

8 True

dtype: bool

--------------------------------------------------------------------------

code: Python

objmask

--------------------------------------------------------------------------

0 c

5 b

6 b

7 c

8 c

dtype: object

--------------------------------------------------------------------------

code: Python

to_match = pd.Series('c', 'a', 'b', 'b', 'c', 'a')

unique_vals = pd.Series('c', 'b', 'a')

pd.Index(unique_vals).get_indexer(to_match)

--------------------------------------------------------------------------

array(0, 2, 1, 1, 0, 2, dtype=int64)

--------------------------------------------------------------------------

code: Python

data = pd.DataFrame({'Qu1': 1, 3, 4, 3, 4,

'Qu2': 2, 3, 1, 2, 3,

'Qu3': 1, 5, 2, 4, 4})

data

--------------------------------------------------------------------------

Qu1 Qu2 Qu3

0 1 2 1

1 3 3 5

2 4 1 2

3 3 2 4

4 4 3 4

--------------------------------------------------------------------------

出現回数を見る。

code: Python

result = data.apply(pd.value_counts).fillna(0)

result

--------------------------------------------------------------------------

Qu1 Qu2 Qu3

1 1.0 1.0 1.0

2 0.0 2.0 1.0

3 2.0 2.0 0.0

4 2.0 0.0 2.0

5 0.0 0.0 1.0

--------------------------------------------------------------------------