主成分分析による教師なし次元削減

Overview

https://gyazo.com/b5267d5879783ac1d0d696f43165f299

主成分分析の主要なステップ

1. 主成分を抽出する

データを標準化する。

共分散行列を作成する。

共分散行列の固有値と固有ベクトル（主成分）を取得する。

データに含まれる大半の情報（分散）を含んでいる固有ベクトル（主成分）を把握する。

2. 特徴変換

最も大きいk個の固有値に対応するk個の固有ベクトルを選択する（kは新しい特徴部分空間の次元数を表す（k <= d））。

上位k個の固有ベクトルから射影行列Wを作成する。

射影行列Wを使ってd次元の入力データセットXを変換し、新しいk次元の特徴部分空間を取得する。

Coding（No library）

1. 主成分を抽出する

データを標準化する

code: Python

import pandas as pd

df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)

from sklearn.cross_validation import train_test_split

from sklearn.preprocessing import StandardScaler

# 2列目以降のデータをXに、1列目のデータをyに格納

X, y = df_wine.iloc:, 1:.values, df_wine.iloc:, 0.values

# トレーニングデータとテストデータに分割

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

# 平均と標準偏差を用いて標準化

sc = StandardScaler()

X_train_std = sc.fit_transform(X_train)

X_test_std = sc.transform(X_test)

共分散行列を作成する

code: Python

import numpy as np

# 共分散行列を作成

cov_mat = np.cov(X_train_std.T)

共分散行列の固有値と固有ベクトル（主成分）を取得する

code: Python

# 固有値と固有ベクトル（主成分）を計算

eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)

print('\nEigenvalues \n{}'.format(eigen_vals))

------------------------------------------------------------------------------

Eigenvalues

[4.84274532 2.41602459 1.54845825 0.96120438 0.84166161 0.6620634

0.51828472 0.34650377 0.3131368 0.10754642 0.21357215 0.15362835

0.1808613 ]

------------------------------------------------------------------------------

データに含まれる大半の情報（分散）を含んでいる固有ベクトル（主成分）を把握する

code: Python

# 固有値の分散説明率（固有値の合計に対する固有値の割合）を計算する

# 固有値を合計

tot = sum(eigen_vals)

# 分散説明率を計算

var_exp = (i / tot) for i in sorted(eigen_vals, reverse=True)

# 分散説明率の累積和を取得

cum_var_exp = np.cumsum(var_exp)

%matplotlib inline

import matplotlib.pyplot as plt

# 分散説明率の棒グラフを作成

plt.bar(range(1, 14), var_exp, alpha=0.5, align='center', label='individual explained variance')

# 分散説明率の累積和の階段グラフを作成

plt.step(range(1, 14), cum_var_exp, where='mid', label='cumulative explained variance')

plt.ylabel('Explained variance ratio')

plt.xlabel('Principal component index')

plt.legend(loc='best')

plt.tight_layout()

plt.show()

https://gyazo.com/d846f1f84a5e0143a3160092b223f845

上図から、1つ目の主成分だけで分散の40%近くを占めていることがわかります。また、最初の2つの主成分を合わせると、分散の60%近くになることもわかります。

2. 特徴変換

最も大きいk個の固有値に対応するk個の固有ベクトルを選択する（kは新しい特徴部分空間の次元数）

code: Python

# （固有値、固有ベクトル）のタプルのリストを作成

eigen_pairs = [(np.abs(eigen_valsi), eigen_vecs:, i) for i in range(len(eigen_vals))]

# （固有値、固有ベクトル）のタプルを大きいものから順に並び替え

eigen_pairs.sort(key=lambda k: k0, reverse=True)

上位k個の固有ベクトルから射影行列Wを作成する

code: Python

# 分散説明率のグラフからkを2つとする

w = np.hstack((eigen_pairs01:, np.newaxis, eigen_pairs11:, np.newaxis))

print('Matrix W:\n', w)

------------------------------------------------------------------------------

Matrix W:

[-0.13724218 0.50303478

0.24724326 0.16487119

-0.02545159 0.24456476

0.20694508 -0.11352904

-0.15436582 0.28974518

-0.39376952 0.05080104

-0.41735106 -0.02287338

0.30572896 0.09048885

-0.30668347 0.00835233

0.07554066 0.54977581

-0.32613263 -0.20716433

-0.36861022 -0.24902536

-0.29669651 0.38022942]

------------------------------------------------------------------------------

射影行列Wを使ってd次元の入力データセットXを変換し、新しいk次元の特徴部分空間を取得する

code: Python

print('元々のサンプル数：{}、特徴量：{}'.format(*X_train_std.shape))

# 線形変換する

X_train_pca = X_train_std.dot(w)

print('変換後のサンプル数：{}、特徴量：{}'.format(*X_train_pca.shape))

------------------------------------------------------------------------------

元々のサンプル数：124、特徴量：13

変換後のサンプル数：124、特徴量：2

------------------------------------------------------------------------------

code: Python

colors = 'r', 'b', 'g'

markers = 's', 'x', 'o'

# 「クラスラベル」「点の色」「点の種類」の組み合わせからなるリストを生成してプロット

for l, c, m in zip(np.unique(y_train), colors, markers):

plt.scatter(X_train_pcay_train==l, 0, X_train_pcay_train==l, 1, c=c, label=l, marker=m)

plt.xlabel('Principle Component 1')

plt.ylabel('Principle Component 2')

plt.legend(loc='lower left')

plt.tight_layout()

plt.show()

https://gyazo.com/a6702f66439b1b785e9da1a96cc68d3b

主成分1と主成分2の関係は、分散説明率と一致します。

Coding（scikit-learn）

code: Python

from sklearn.linear_model import LogisticRegression

from sklearn.decomposition import PCA

# 主成分数を指定して、PCAのインスタンスを生成

pca = PCA(n_components=2)

# ロジスティック回帰のインスタンスを生成

lr = LogisticRegression()

# トレーニングデータとテストデータでPCAを実行

X_train_pca = pca.fit_transform(X_train_std)

X_test_pca = pca.transform(X_test_std)

# トレーニングデータでロジスティック回帰を実行

lr.fit(X_train_pca, y_train)

# 決定境界をプロット

plot_decision_regions(X_train_pca, y_train, classifier=lr)

plt.xlabel('Principle Component 1')

plt.ylabel('Principle Component 2')

plt.legend(loc='lower left')

plt.tight_layout()

plt.show()

https://gyazo.com/eef6441b6146b5988861ae5f838ad5d6