More than 1 year has passed since last update.

主成分分析と因子分析でのbiplotの作り方

Posted at 2024-07-21

R言語ではbiplotが関数としてありますがPythonではコーディングしないとできません。
そこでR言語と同じようなbiplotの作り方を紹介します。

ライブラリのインポート

共通のライブラリ

from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

次元圧縮ライブラリ

from sklearn.decomposition import PCA #主成分分析
from sklearn.decomposition import FactorAnalysis as FA #因子分析

データの読み込み

df = pd.read_csv("wine.csv")

説明変数と目的変数を分離

教師ありの場合などは圧縮後に色を付けると構造が分かりやすくなります。

y = df["Wine"]
x = df.drop("Wine", axis=1)

データの標準化

基本的に単位尺度を揃えないと大きな値や外れ値に影響されやすくなり本来の分布と異なるため標準化を行います。

ss = StandardScaler()
ss.fit(x)
sx = ss.transform(x)

主成分分析

ここからは実際に主成分分析で次元を圧縮します。

PCAで圧縮

model = PCA()
model.fit(sx)
tx = model.transform(sx)

寄与率と固有ベクトルから因子負荷量を計算

evr = model.explained_variance_ratio_
com = model.components_
fac = []
for i in range(len(evr)):
    fac.append(np.sqrt(evr[i])*com[i])

両軸グラフを作る

片方のX軸Y軸は圧縮後のデータ用に、もう片方のX軸Y軸は因子負荷量にします。

fig, ax = plt.subplots()
ax1 = ax.twinx()
ax2 = ax.twiny()

圧縮したデータをプロット

ax1.scatter(tx[:, 0], tx[:, 1], c=y, cmap="brg")

プロット範囲のリストを作る

xlim = [abs(min(fac[0])), abs(max(fac[0]))]
ylim = [abs(min(fac[1])), abs(max(fac[1]))]

プロット範囲を指定する

ax2.set_xlim(-max(xlim), max(xlim))
ax2.set_ylim(-max(ylim), max(ylim))

因子負荷量をベクトル化する

for i in range(len(x.columns)):
    ax2.annotate("", xytext=[0, 0], xy=[fac[0][i], fac[1][i]],
                 arrowprops=dict(shrink=0, width=1, headwidth=6, 
                                headlength=10, connectionstyle='arc3',
                                facecolor='red', edgecolor='red'))
    ax2.text(fac[0][i], fac[1][i], x.columns[i])
plt.show()

完成しました。

因子分析

因子分析では「components_」が因子負荷量に相当します。

次元圧縮

model = FA()
model.fit(sx)
tx = model.transform(sx)

因子負荷量の計算

com = model.components_

両軸グラフを作る

基本的な流れは主成分分析と変わりません

fig, ax = plt.subplots()
ax1 = ax.twinx()
ax2 = ax.twiny()

圧縮したデータをプロット

ax1.scatter(tx[:, 0], tx[:, 1], c=y, cmap="brg")

プロット範囲のリストを作成

xlim = [abs(min(com[0])), abs(max(com[0]))]
ylim = [abs(min(com[1])), abs(max(com[1]))]

プロット範囲を指定

ax2.set_xlim(-max(xlim), max(xlim))
ax2.set_ylim(-max(ylim), max(ylim))

因子負荷量のベクトルを図示

for i in range(len(x.columns)):
    ax2.annotate("", xytext=[0, 0], xy=[com[0][i], com[1][i]],
                 arrowprops=dict(shrink=0, width=1, headwidth=6, 
                                headlength=10, connectionstyle='arc3',
                                facecolor='red', edgecolor='red'))
    ax2.text(com[0][i], com[1][i], x.columns[i])
plt.show()

完成しました。

全体コード

主成分分析

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("wine.csv")
y = df["Wine"]
x = df.drop("Wine", axis=1)
ss = StandardScaler()
ss.fit(x)
sx = ss.transform(x)
model = PCA()
model.fit(sx)
tx = model.transform(sx)
evr = model.explained_variance_ratio_
com = model.components_
fac = []
for i in range(len(evr)):
    fac.append(np.sqrt(evr[i])*com[i])
fig, ax = plt.subplots()
ax1 = ax.twinx()
ax2 = ax.twiny()
ax1.scatter(tx[:, 0], tx[:, 1], c=y, cmap="brg")
xlim = [abs(min(fac[0])), abs(max(fac[0]))]
ylim = [abs(min(fac[1])), abs(max(fac[1]))]
ax2.set_xlim(-max(xlim), max(xlim))
ax2.set_ylim(-max(ylim), max(ylim))
for i in range(len(x.columns)):
    ax2.annotate("", xytext=[0, 0], xy=[fac[0][i], fac[1][i]],
                 arrowprops=dict(shrink=0, width=1, headwidth=6, 
                                headlength=10, connectionstyle='arc3',
                                facecolor='red', edgecolor='red'))
    ax2.text(fac[0][i], fac[1][i], x.columns[i])
plt.show()

因子分析

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import FactorAnalysis as FA
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("wine.csv")
y = df["Wine"]
x = df.drop("Wine", axis=1)
ss = StandardScaler()
ss.fit(x)
sx = ss.transform(x)
model = FA()
model.fit(sx)
tx = model.transform(sx)
com = model.components_
fig, ax = plt.subplots()
ax1 = ax.twinx()
ax2 = ax.twiny()
ax1.scatter(tx[:, 0], tx[:, 1], c=y, cmap="brg")
xlim = [abs(min(com[0])), abs(max(com[0]))]
ylim = [abs(min(com[1])), abs(max(com[1]))]
ax2.set_xlim(-max(xlim), max(xlim))
ax2.set_ylim(-max(ylim), max(ylim))
for i in range(len(x.columns)):
    ax2.annotate("", xytext=[0, 0], xy=[com[0][i], com[1][i]],
                 arrowprops=dict(shrink=0, width=1, headwidth=6, 
                                headlength=10, connectionstyle='arc3',
                                facecolor='red', edgecolor='red'))
    ax2.text(com[0][i], com[1][i], x.columns[i])
plt.show()

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up