pandasのデータ構造
シリーズ(Series)
シリーズは1次元配列のようなオブジェクトです。シリーズには連続した値(NumPyのデータ型と似たような型を持つ)とそれに関連付けられたインデックスというデータラベルの配列が含まれます。
code: Python
import pandas as pd
import numpy as np
code: Python
obj
--------------------------------------------------------------------------
0 4
1 7
2 -5
3 3
dtype: int64
--------------------------------------------------------------------------
code: Python
obj.values
--------------------------------------------------------------------------
--------------------------------------------------------------------------
code: Python
obj.index
--------------------------------------------------------------------------
RangeIndex(start=0, stop=4, step=1)
--------------------------------------------------------------------------
code: Python
obj2
--------------------------------------------------------------------------
d 4
b 7
a -5
c 3
dtype: int64
--------------------------------------------------------------------------
code: Python
obj2'c', 'a', 'd'
--------------------------------------------------------------------------
c 3
a -5
d 6
dtype: int64
--------------------------------------------------------------------------
code: Python
'e' in obj2
--------------------------------------------------------------------------
False
--------------------------------------------------------------------------
code: Python
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3
--------------------------------------------------------------------------
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
--------------------------------------------------------------------------
code: Python
obj4 = pd.Series(sdata, index=states)
obj4
--------------------------------------------------------------------------
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
--------------------------------------------------------------------------
code: Python
pd.isnull(obj4)
# pd.notnull(obj4)
# obj4.isnull()
--------------------------------------------------------------------------
California True
Ohio False
Oregon False
Texas False
dtype: bool
--------------------------------------------------------------------------
code: Python
obj3
obj4
obj3 + obj4
--------------------------------------------------------------------------
California NaN
Ohio 70000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64
--------------------------------------------------------------------------
code: Python
obj4.name = 'population'
obj4.index.name = 'state'
obj4
--------------------------------------------------------------------------
state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
--------------------------------------------------------------------------
データフレーム(DataFrame)
データフレームはテーブル形式のデータ構造を持ち、順序付けられた列を持っています。データフレームは行と列の両方にインデックスを持っています。データフレームはシリーズをバリューとして持つディクショナリと見ることができます。
code: Python
frame = pd.DataFrame(data)
frame
--------------------------------------------------------------------------
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
5 3.2 Nevada 2003
--------------------------------------------------------------------------
code: Python
frame.head()
--------------------------------------------------------------------------
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
--------------------------------------------------------------------------
code: Python
--------------------------------------------------------------------------
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2
--------------------------------------------------------------------------
code: Python
frame2
--------------------------------------------------------------------------
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
--------------------------------------------------------------------------
行を参照する
code: Python
--------------------------------------------------------------------------
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
--------------------------------------------------------------------------
code: Python
frame2
--------------------------------------------------------------------------
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
six 2003 Nevada 3.2 5.0
--------------------------------------------------------------------------
code: Python
frame2
--------------------------------------------------------------------------
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
six 2003 Nevada 3.2 NaN
--------------------------------------------------------------------------
code: Python
frame2.columns
--------------------------------------------------------------------------
--------------------------------------------------------------------------
インデックスとカラムに名前をつける
code: Python
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3
--------------------------------------------------------------------------
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
--------------------------------------------------------------------------
インデックスオブジェクト
pandasのインデックスオブジェクトは、軸のラベルやその他のメタデータ(軸のname属性やnames属性など)を保持する役目を持っています。シリーズやデータフレームを初期化するときに、配列やシーケンスなどで指定したラベルは、内部的にはインデックスオブジェクトに変換されます。
code: Python
index = obj.index
index
--------------------------------------------------------------------------
--------------------------------------------------------------------------
code: Python
--------------------------------------------------------------------------
--------------------------------------------------------------------------
code: Python
labels = pd.Index(np.arange(3))
labels
--------------------------------------------------------------------------
--------------------------------------------------------------------------
code: Python
obj2
--------------------------------------------------------------------------
0 1.5
1 -2.5
2 0.0
dtype: float64
--------------------------------------------------------------------------
code: Python
obj2.index is labels
--------------------------------------------------------------------------
True
--------------------------------------------------------------------------