データの変形 - Note for Machine Learning

データの変形

重複の除去

code: Python

data = pd.DataFrame({'k1': 'one', 'two' * 3 + 'two',

'k2': 1, 1, 2, 3, 3, 4, 4})

data

--------------------------------------------------------------------------

k1 k2

0 one 1

1 two 1

2 one 2

3 two 3

4 one 3

5 two 4

6 two 4

--------------------------------------------------------------------------

code: Python

data.duplicated()

--------------------------------------------------------------------------

0 False

1 False

2 False

3 False

4 False

5 False

6 True

dtype: bool

--------------------------------------------------------------------------

code: Python

data.drop_duplicates()

--------------------------------------------------------------------------

k1 k2

0 one 1

1 two 1

2 one 2

3 two 3

4 one 3

5 two 4

--------------------------------------------------------------------------

code: Python

data'v1' = range(7)

data

--------------------------------------------------------------------------

k1 k2 v1

0 one 1 0

1 two 1 1

2 one 2 2

3 two 3 3

4 one 3 4

5 two 4 5

6 two 4 6

--------------------------------------------------------------------------

code: Python

# 重複したら最初の値だけを残す

data.drop_duplicates('k1')

--------------------------------------------------------------------------

k1 k2 v1

0 one 1 0

1 two 1 1

--------------------------------------------------------------------------

code: Python

# 重複したら最後の値だけを残す

data.drop_duplicates('k1', 'k2', keep='last')

--------------------------------------------------------------------------

k1 k2 v1

0 one 1 0

1 two 1 1

2 one 2 2

3 two 3 3

4 one 3 4

6 two 4 6

--------------------------------------------------------------------------

関数やマッピングを用いたデータの変換

code: Python

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',

'Pastrami', 'corned beef', 'Bacon',

'pastrami', 'honey ham', 'nova lox'],

'ounces': 4, 3, 12, 6, 7.5, 8, 3, 5, 6})

data

--------------------------------------------------------------------------

food ounces

0 bacon 4.0

1 pulled pork 3.0

2 bacon 12.0

3 Pastrami 6.0

4 corned beef 7.5

5 Bacon 8.0

6 pastrami 3.0

7 honey ham 5.0

8 nova lox 6.0

--------------------------------------------------------------------------

code: Python

meat_to_animal = {

'bacon': 'pig',

'pulled pork': 'pig',

'pastrami': 'cow',

'corned beef': 'cow',

'honey ham': 'pig',

'nova lox': 'salmon'

}

lowercased = data'food'.str.lower()

data'animal' = lowercased.map(meat_to_animal)

data

# 1行でも書ける

# data'food'.map(lambda x: meat_to_animalx.lower())

--------------------------------------------------------------------------

food ounces animal

0 bacon 4.0 pig

1 pulled pork 3.0 pig

2 bacon 12.0 pig

3 Pastrami 6.0 cow

4 corned beef 7.5 cow

5 Bacon 8.0 pig

6 pastrami 3.0 cow

7 honey ham 5.0 pig

8 nova lox 6.0 salmon

--------------------------------------------------------------------------

データの置き換え

code: Python

data = pd.Series(1., -999., 2., -999., -1000., 3.)

data

--------------------------------------------------------------------------

0 1.0

1 -999.0

2 2.0

3 -999.0

4 -1000.0

5 3.0

dtype: float64

--------------------------------------------------------------------------

code: Python

data.replace(-999, np.nan)

--------------------------------------------------------------------------

0 1.0

1 NaN

2 2.0

3 NaN

4 -1000.0

5 3.0

dtype: float64

--------------------------------------------------------------------------

code: Python

data.replace(-999, -1000, np.nan)

--------------------------------------------------------------------------

0 1.0

1 NaN

2 2.0

3 NaN

4 NaN

5 3.0

dtype: float64

--------------------------------------------------------------------------

code: Python

data.replace(-999, -1000, np.nan, 0)

--------------------------------------------------------------------------

0 1.0

1 NaN

2 2.0

3 NaN

4 0.0

5 3.0

dtype: float64

--------------------------------------------------------------------------

code: Python

data.replace({-999: np.nan, -1000: 0})

--------------------------------------------------------------------------

0 1.0

1 NaN

2 2.0

3 NaN

4 0.0

5 3.0

dtype: float64

--------------------------------------------------------------------------

軸のインデックスの名前を変更する

code: Python

data = pd.DataFrame(np.arange(12).reshape((3, 4)),

index='Ohio', 'Colorado', 'New York',

columns='one', 'two', 'three', 'four')

data

--------------------------------------------------------------------------

one two three four

Ohio 0 1 2 3

Colorado 4 5 6 7

New York 8 9 10 11

--------------------------------------------------------------------------

code: Python

transform = lambda x: x.upper()

data.index = data.index.map(transform)

data

--------------------------------------------------------------------------

one two three four

OHIO 0 1 2 3

COLO 4 5 6 7

NEW 8 9 10 11

--------------------------------------------------------------------------

code: Python

data.rename(index=str.title, columns=str.upper)

--------------------------------------------------------------------------

ONE TWO THREE FOUR

Ohio 0 1 2 3

Colo 4 5 6 7

New 8 9 10 11

--------------------------------------------------------------------------

code: Python

data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'})

--------------------------------------------------------------------------

one two peekaboo four

INDIANA 0 1 2 3

COLO 4 5 6 7

NEW 8 9 10 11

--------------------------------------------------------------------------

離散化とビニング

code: Python

ages = 20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32

# 18-25, 26-35, 36-60, 61-100に分割

bins = 18, 25, 35, 60, 100

cats = pd.cut(ages, bins)

cats

--------------------------------------------------------------------------

(18, 25, (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]

Length: 12

Categories (4, intervalint64): (18, 25 < (25, 35] < (35, 60] < (60, 100]]

--------------------------------------------------------------------------

code: Python

cats.codes

--------------------------------------------------------------------------

array(0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1, dtype=int8)

--------------------------------------------------------------------------

code: Python

cats.categories

--------------------------------------------------------------------------

IntervalIndex((18, 25, (25, 35], (35, 60], (60, 100]]

closed='right',

dtype='intervalint64')

--------------------------------------------------------------------------

code: Python

pd.value_counts(cats)

--------------------------------------------------------------------------

(18, 25] 5

(35, 60] 3

(25, 35] 3

(60, 100] 1

dtype: int64

--------------------------------------------------------------------------

code: Python

# 左側を閉区間（境界を含む）にする

pd.cut(ages, 18, 26, 36, 61, 100, right=False)

--------------------------------------------------------------------------

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), 26, 36)

Length: 12

Categories (4, intervalint64): [[18, 26) < [26, 36) < [36, 61) < 61, 100)

--------------------------------------------------------------------------

code: Python

group_names = 'Youth', 'YoungAdult', 'MiddleAged', 'Senior'

# labelsでビンの名前を設定できる

pd.cut(ages, bins, labels=group_names)

--------------------------------------------------------------------------

Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult

Length: 12

Categories (4, object): Youth < YoungAdult < MiddleAged < Senior

--------------------------------------------------------------------------

code: Python

data = np.random.rand(20)

# dataを4つに分割

# preceisionは小数点以下2桁にする

pd.cut(data, 4, precision=2)

--------------------------------------------------------------------------

(0.34, 0.55, (0.34, 0.55], (0.76, 0.97], (0.76, 0.97], (0.34, 0.55], ..., (0.34, 0.55], (0.34, 0.55], (0.55, 0.76], (0.34, 0.55], (0.12, 0.34]]

Length: 20

Categories (4, intervalfloat64): (0.12, 0.34 < (0.34, 0.55] < (0.55, 0.76] < (0.76, 0.97]]

--------------------------------------------------------------------------

code: Python

data = np.random.randn(1000)

# 4つの四分位範囲のビンに分割（cut関数だとビンのデータ数が同じにならない）

cats = pd.qcut(data, 4)

cats

--------------------------------------------------------------------------

(-0.0265, 0.62, (0.62, 3.928], (-0.68, -0.0265], (0.62, 3.928], (-0.0265, 0.62], ..., (-0.68, -0.0265], (-0.68, -0.0265], (-2.9499999999999997, -0.68], (0.62, 3.928], (-0.68, -0.0265]]

Length: 1000

Categories (4, intervalfloat64): (-2.9499999999999997, -0.68 < (-0.68, -0.0265] < (-0.0265, 0.62] < (0.62, 3.928]]

--------------------------------------------------------------------------

code: Python

pd.value_counts(cats)

--------------------------------------------------------------------------

(0.62, 3.928] 250

(-0.0265, 0.62] 250

(-0.68, -0.0265] 250

(-2.9499999999999997, -0.68] 250

dtype: int64

--------------------------------------------------------------------------

code: Python

pd.qcut(data, 0, 0.1, 0.5, 0.9, 1.)

--------------------------------------------------------------------------

(-0.0265, 1.286, (-0.0265, 1.286], (-1.187, -0.0265], (-0.0265, 1.286], (-0.0265, 1.286], ..., (-1.187, -0.0265], (-1.187, -0.0265], (-2.9499999999999997, -1.187], (-0.0265, 1.286], (-1.187, -0.0265]]

Length: 1000

Categories (4, intervalfloat64): (-2.9499999999999997, -1.187 < (-1.187, -0.0265] < (-0.0265, 1.286] < (1.286, 3.928]]

--------------------------------------------------------------------------

外れ値の検出と除去

code: Python

data = pd.DataFrame(np.random.randn(1000, 4))

data.describe()

--------------------------------------------------------------------------

0 1 2 3

count 1000.000000 1000.000000 1000.000000 1000.000000

mean -0.008212 -0.002558 -0.059165 -0.048681

std 1.041077 1.019451 0.980169 1.003074

min -3.183867 -3.481593 -3.194414 -3.108915

25% -0.746147 -0.699027 -0.753809 -0.736424

50% 0.005656 0.039159 -0.055737 -0.037073

75% 0.696967 0.705667 0.616784 0.603749

max 3.189940 2.961194 3.023720 2.916153

--------------------------------------------------------------------------

code: Python

# 絶対値が3より大きかったら外れ値にする

col = data2

colnp.abs(col) > 3

--------------------------------------------------------------------------

5 3.248944

102 3.176873

324 3.260383

499 -3.056990

586 -3.184377

Name: 2, dtype: float64

--------------------------------------------------------------------------

code: Python

# 絶対値が3より大きい値を一つ以上持つなら外れ値にする

data(np.abs(data) > 3).any(1)

--------------------------------------------------------------------------

0 1 2 3

25 0.336788 -3.333767 -1.240685 -0.650855

107 -3.018842 -0.298748 0.406954 0.183282

131 0.781753 -0.555434 -0.048478 -3.108915

262 -3.183867 1.050471 -1.042736 1.680374

309 -3.140963 -1.509976 -0.389818 -0.273253

474 1.090038 -0.848098 -3.194414 0.077839

504 0.003349 -0.011807 3.023720 -1.105312

533 0.452649 -3.481593 0.789944 1.737746

702 3.082067 -0.516982 0.251909 -0.029354

730 3.189940 0.070978 0.516982 -0.805171

894 -0.436479 0.901529 -3.044612 -1.193980

928 -1.148738 -3.170292 -1.017073 -1.147658

--------------------------------------------------------------------------

code: Python

# np.sign()は正なら1、負なら-1を返す

np.sign(data).head()

--------------------------------------------------------------------------

0 1 2 3

0 -1.0 1.0 -1.0 1.0

1 -1.0 1.0 -1.0 -1.0

2 -1.0 -1.0 -1.0 -1.0

3 1.0 1.0 1.0 1.0

4 -1.0 -1.0 1.0 -1.0

--------------------------------------------------------------------------

順列（ランダムな並び替え）やランダムサンプリング

code: Python

df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))

--------------------------------------------------------------------------

0 1 2 3

0 0 1 2 3

1 4 5 6 7

2 8 9 10 11

3 12 13 14 15

4 16 17 18 19

--------------------------------------------------------------------------

code: Python

# 順列をランダムに並び替える

sampler = np.random.permutation(5)

sampler

--------------------------------------------------------------------------

array(0, 3, 2, 1, 4)

--------------------------------------------------------------------------

code: Python

df.take(sampler)

--------------------------------------------------------------------------

0 1 2 3

0 0 1 2 3

3 12 13 14 15

2 8 9 10 11

1 4 5 6 7

4 16 17 18 19

--------------------------------------------------------------------------

code: Python

# ランダムに一部分だけ非復元抽出（一度抽出したらその後抽出の対象とならない）

df.sample(n=3)

--------------------------------------------------------------------------

0 1 2 3

1 4 5 6 7

4 16 17 18 19

2 8 9 10 11

--------------------------------------------------------------------------

code: Python

choices = pd.Series(5, 7, -1, 6, 4)

# 復元抽出

draws = choices.sample(n=10, replace=True)

draws

--------------------------------------------------------------------------

4 4

2 -1

0 5

4 4

0 5

3 6

0 5

2 -1

dtype: int64

--------------------------------------------------------------------------

標識変数やダミー変数の計算

code: Python

df = pd.DataFrame({'key': 'b', 'b', 'a', 'c', 'a', 'b',

'data1': range(6)})

--------------------------------------------------------------------------

data1 key

0 0 b

1 1 b

2 2 a

3 3 c

4 4 a

5 5 b

--------------------------------------------------------------------------

code: Python

pd.get_dummies(df'key')

--------------------------------------------------------------------------

a b c

0 0 1 0

1 0 1 0

2 1 0 0

3 0 0 1

4 1 0 0

5 0 1 0

--------------------------------------------------------------------------

code: Python

dummies = pd.get_dummies(df'key', prefix='key')

df_with_dummy = df'data1'.join(dummies)

df_with_dummy

--------------------------------------------------------------------------

data1 key_a key_b key_c

0 0 0 1 0

1 1 0 1 0

2 2 1 0 0

3 3 0 0 1

4 4 1 0 0

5 5 0 1 0

--------------------------------------------------------------------------