要約統計量の集計と計算
集約や要約統計量
pandasオブジェクトでは、一般的な数学的、統計的なメソッドが使えます。これらのメソッドのほとんどは、集約や要約統計量に分類されるようなものです。これらのメソッドでは、データフレームの行や列にあるシリーズから合計値や平均値などの1つの値を計算する。
code: Python
df
--------------------------------------------------------------------------
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
--------------------------------------------------------------------------
code: Python
df.sum()
--------------------------------------------------------------------------
one 9.25
two -5.80
dtype: float64
--------------------------------------------------------------------------
code: Python
df.sum(axis='columns')
--------------------------------------------------------------------------
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
--------------------------------------------------------------------------
code: Python
df.mean(axis='columns', skipna=False)
--------------------------------------------------------------------------
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
--------------------------------------------------------------------------
code: Python
df.idxmax()
--------------------------------------------------------------------------
one b
two d
dtype: object
--------------------------------------------------------------------------
code: Python
# 累積
df.cumsum()
--------------------------------------------------------------------------
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
--------------------------------------------------------------------------
code: Python
df.describe()
--------------------------------------------------------------------------
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
--------------------------------------------------------------------------
相関と共分散
相関や共分散などの統計量は、変数のペアから求めることができます。
Yahoo!Financeから取得した株価と出来高のデータフレーム。
code: Python
price = pd.read_pickle('yahoo_price.pkl')
volume = pd.read_pickle('yahoo_volume.pkl')
returns = price.pct_change()
returns.tail()
--------------------------------------------------------------------------
AAPL GOOG IBM MSFT
Date
2016-10-17 -0.000680 0.001837 0.002072 -0.003483
2016-10-18 -0.000681 0.019616 -0.026168 0.007690
2016-10-19 -0.002979 0.007846 0.003583 -0.002255
2016-10-20 -0.000512 -0.005652 0.001719 -0.004867
2016-10-21 -0.003930 0.003011 -0.012474 0.042096
--------------------------------------------------------------------------
相関を求める。
code: Python
# returns.MSFT.corr(returns.IBM)
--------------------------------------------------------------------------
0.4997636114415114
--------------------------------------------------------------------------
共分散を求める。
code: Python
--------------------------------------------------------------------------
8.870655479703546e-05
--------------------------------------------------------------------------
code: Python
returns.corr()
--------------------------------------------------------------------------
AAPL GOOG IBM MSFT
AAPL 1.000000 0.407919 0.386817 0.389695
GOOG 0.407919 1.000000 0.405099 0.465919
IBM 0.386817 0.405099 1.000000 0.499764
MSFT 0.389695 0.465919 0.499764 1.000000
--------------------------------------------------------------------------
code: Python
returns.cov()
--------------------------------------------------------------------------
AAPL GOOG IBM MSFT
AAPL 0.000277 0.000107 0.000078 0.000095
GOOG 0.000107 0.000251 0.000078 0.000108
IBM 0.000078 0.000078 0.000146 0.000089
MSFT 0.000095 0.000108 0.000089 0.000215
--------------------------------------------------------------------------
code: Python
returns.corrwith(returns.IBM)
--------------------------------------------------------------------------
AAPL 0.386817
GOOG 0.405099
IBM 1.000000
MSFT 0.499764
dtype: float64
--------------------------------------------------------------------------
code: Python
returns.corrwith(volume)
--------------------------------------------------------------------------
AAPL -0.075565
GOOG -0.007067
IBM -0.204849
MSFT -0.092950
dtype: float64
--------------------------------------------------------------------------
一意な値、頻度の確認、所属の確認
他の関連するメソッドには、1次元のシリーズに含まれる値の情報を抽出するものがあります。
code: Python
obj
--------------------------------------------------------------------------
0 c
1 a
2 d
3 a
4 a
5 b
6 b
7 c
8 c
dtype: object
--------------------------------------------------------------------------
code: Python
uniques = obj.unique()
uniques
--------------------------------------------------------------------------
--------------------------------------------------------------------------
code: Python
obj.value_counts()
--------------------------------------------------------------------------
c 3
a 3
b 2
d 1
dtype: int64
--------------------------------------------------------------------------
code: Python
pd.value_counts(obj.values, sort=False)
--------------------------------------------------------------------------
a 3
b 2
d 1
c 3
dtype: int64
--------------------------------------------------------------------------
code: Python
obj
--------------------------------------------------------------------------
0 c
1 a
2 d
3 a
4 a
5 b
6 b
7 c
8 c
dtype: object
--------------------------------------------------------------------------
code: Python
mask
--------------------------------------------------------------------------
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
--------------------------------------------------------------------------
code: Python
--------------------------------------------------------------------------
0 c
5 b
6 b
7 c
8 c
dtype: object
--------------------------------------------------------------------------
code: Python
pd.Index(unique_vals).get_indexer(to_match)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
code: Python
data
--------------------------------------------------------------------------
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
--------------------------------------------------------------------------
出現回数を見る。
code: Python
result = data.apply(pd.value_counts).fillna(0)
result
--------------------------------------------------------------------------
Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 0.0 2.0 1.0
3 2.0 2.0 0.0
4 2.0 0.0 2.0
5 0.0 0.0 1.0
--------------------------------------------------------------------------