More than 3 years have passed since last update.

線形回帰　用語集

Last updated at 2021-08-15Posted at 2021-08-11

はじめに

オライリージャパン様のscikit-learn, Keras,TensorFlowによる実践機械学習の４章、モデルの訓練について学んだことをメモするための記事になります。

線形回帰

正規方程式

$$\theta = (X^TX)^{-1}X^Ty$$
で与えられる式のこと。
$$目的変数 = (X^TX\theta)-X^Ty$$
$X\theta$は仮定関数のこと。
これを目的変数 = 0で解くと正規方程式の形になる。

import numpy as np
import matplotlib.pyplot as plt

X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
plt.scatter(X, y);

# 切片の追加
X_b = np.c_[np.ones((100, 1)), X]
# linalg.inv は逆行列を返す
theta_best = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
display(theta_best)

# 出力----------------
array([[3.9500543 ],
       [2.87523365]])

plt.plot(X_new, y_pred, 'r')
plt.scatter(X, y);

最急降下法を使用しなくとも、正規方程式を用いることで近似曲線を求めることができた。
逆行列を求める必要があるが、存在しない場合は疑似逆行列を使用して解くことが可能。
その場合特異値分解を使用する。

固有値分解

$$A\vec{v} = \lambda\vec{v}$$

これを成し遂げたい。Aは正方行列、$\lambda$はスカラー値になる

\begin{pmatrix}
2 & 5 \\\
3 & 4
\end{pmatrix}
*
\begin{pmatrix}
1\\\
1
\end{pmatrix}
=
7
*
\begin{pmatrix}
1 \\\
1
\end{pmatrix}

目的はこれを行いたい。
$\lambda, \vec{v}$は一つではないので行列で表す.
$$AV = V\Lambda$$
の式で表す

\begin{pmatrix}
2 & 5 \\\
3 & 4
\end{pmatrix}
*
\begin{pmatrix}
5 & 1\\\
-3 & 1
\end{pmatrix}
=
\begin{pmatrix}
5 & 1\\\
-3 & 1
\end{pmatrix}
*
\begin{pmatrix}
-1 & 0\\\
0 & 7
\end{pmatrix}

その後
$$A = V\Lambda V^{-1}$$
の形に変えたもの

\begin{pmatrix}
2 & 5 \\\
3 & 4
\end{pmatrix}
=
\begin{pmatrix}
5 & 1\\\
-3 & 1
\end{pmatrix}
*
\begin{pmatrix}
-1 & 0\\\
0 & 7
\end{pmatrix}
*
\begin{pmatrix}
5 & 1\\\
-3 & 1
\end{pmatrix}^{-1}

の形に直した物になる。
ベクトルの掛け算等が楽になる。

特異値分解

固有値分解は、正方行列限定のものであり逆行列が存在しない直方行列には使用できない。
その場合は特異値分解を使用する。

ものすごく詳しく教えてくれる
https://www.youtube.com/watch?v=CUtT2Pi3ITQ

正方行列ではない行列（直行行列）に対しても固有値分解を行う手段のこと
Uを右得意ベクトル、Vを左特異ベクトル、シグマを特異値と呼ぶ
次元削減に使える、RANKにも
$$A = U\sum V^T$$

確率的勾配効果法

n_epochs = 50 # m回のイテレーションを何セット繰り返すか
m = 100 # サンプル数

# 学習スケジュール、繰り返しごとにだんだん小さくなっていく。
t0, t1 = 5, 50
def learning_schedule(t):
    return t0 / (t + t1)
`

theta = np.random.randn(2, 1) # ランダムに初期化

for epoch in range(n_epochs):
    for i in range (m): # 回数はサンプル数回
        random_index = np.random.randint(m) # サンプル数mの中からランダムに一つの値を取り出す
        xi = X_b[random_index : random_index + 1]
        yi = y[random_index : random_index + 1]
        gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
        eta = learning_schedule(epoch + m + i)
        theta = theta - eta * gradients

display(theta)

# 出力
array([[4.01986671],
       [2.85402954]])


from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, eta0=0.1) # tolは誤差率
sgd_reg.fit(X, y.ravel()) # 一次元化
display(sgd_reg.intercept_, sgd_reg.coef_)

# 出力
(array([3.96178765]), array([2.88206974]))

注意事項

test = np.arange(12).reshape(6, 2)
# 出力
array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

display(test[1])
display(test[1].shape)
#　　出力
array([2, 3])
(2,)

display(test[1:2])
display(test[1:2].shape)
# 出力
array([[2, 3]])
(1, 2)

場合によって使い分ける必要がある。

多項式回帰

m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X ** 2 + X + 2 + np.random.randn(m, 1) 
plt.scatter(X, y, c='r')
plt.xlabel('X')
plt.ylabel('y');

このグラフの回帰直線は以下の通り

from sklearn.linear_model import LinearRegression
sk_model = LinearRegression()
sk_model.fit(X, y)
plt.plot(X, sk_model.intercept_ + sk_model.coef_ * X)
plt.scatter(X, y, color='r');

特徴量の2乗を新しい特徴量として説明変数に加える。

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2, include_bias=False)
X_poly = poly.fit_transform(X)
sk_model_3 = LinearRegression()
sk_model_3.fit(X_poly, y)
plt.plot(np.sort(X, axis=0), sk_model_3.intercept_ + sk_model_3.coef_[:, 0] *np.sort(X, axis=0)+ sk_model_3.coef_[:, 1] * np.sort(X, axis=0) **2)
plt.scatter(X, y, color='r');

学習曲線

横軸にイテレーション数、縦軸に評価指標をプロットしたもの

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
    
def plot_learning_curves(model, X, y):
    train_error, val_error = [], []
    for m in range(1, len(X_train)):
        model.fit(X_train[:m], y_train[:m])
        y_train_pred = model.predict(X_train[:m])
        y_val_pred = model.predict(X_val)
        train_error.append(mean_squared_error(y_train[:m], y_train_pred[:m]))
        val_error.append(mean_squared_error(y_val, y_val_pred))
    plt.plot(train_error, label='train')
    plt.plot(val_error, label='val')
    plt.xlabel('train set size')
    plt.ylabel('MSE')
    plt.ylim(0, 7)
    plt.legend()

過小適合

MSEが全体的に高い

lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)

過適合

trainにたいして、testがさがっていない
回数が増えると同じ

from sklearn.pipeline import Pipeline

poly = Pipeline([
    ('poly', PolynomialFeatures(degree=10, include_bias=False)),
    ('lin_reg', LinearRegression())
])
plot_learning_curves(poly, X ,y)

Ridge

$$J(\theta)=MSE(\theta)+\alpha \frac{1}{2}\sum_{i=1}^n\theta_i^2$$

$$\theta_i = \theta_0 + \theta_1 +\theta_2 +\theta_3 +...$$
を目的変数とする。
2乗を取るので、ベクトルは合成ベクトルとなりその距離で表される
L2ノルムの上を減らす方向に進んでいく。
合成ベクトル上なので、どちらかのベクトルが0になることはない。
収束値に近づくにつれ、勾配が穏やかになるが完全に収束はしない

Lasso

$$J(\theta)=MSE(\theta)+\alpha \frac{1}{2}\sum_{i=1}^n|\theta_i|$$
を目的変数とする。
L1ノルム上を減らす方向に進んでいく。
合成ベクトルではないので、両方の重みは均等に減っていき小さい重みは先に0になる。

Elastic Net

$$J(\theta)=MSE(\theta)+r\alpha \frac{1}{2}\sum_{i=1}^n|\theta_i| + \frac{1-r}{2}\alpha\sum_{i=1}^n\theta_i^2$$
r=0でRidge、r=1でLasso
基本的にはRidgeを使うのが一般(データを落としてしまうから)
ただ、必要のなさそうな特徴量や次元削減を行う場合一般的にはLassoよりかはElastic Net

早期打ち切り(early stopping)

過学習を抑えるために、検証データでの損失関数が学習回数と反比例しなくなったタイミングで学習を止めてしまう方法

from sklearn.base import clone
from sklearn.preprocessing import StandardScaler

#データの準備
poly_scaler = Pipeline([
    ('poly', PolynomialFeatures(degree=90, include_bias=False)),
    ('std_scaler', StandardScaler())
])

X_train_poly_scaler = poly_scaler.fit_transform(X_train)
X_val_poly_scaler = poly_scaler.fit_transform(X_val)

from sklearn.linear_model import SGDRegressor

#warm_startは学習経過を保存し、続きから行うもの。そのためiterは1
#learning_rateがconstantの場合、eta=eta0になる
#通常は学習スケジュール(回数を重ねるごとに減る)

sgd_reg = SGDRegressor(max_iter=1, tol=np.infty, warm_start=True, penalty=None, learning_rate='constant', eta0=0.0005)
minimum_val_error = float('inf')
best_epoch = None
best_model = None
for epoch in range(1000):
    sgd_reg.fit(X_train_poly_scaler, y_train.ravel())
    y_val_pred = sgd_reg.predict(X_val_poly_scaler)
    val_error = mean_squared_error(y_val, y_val_pred)
    if val_error < minimum_val_error:
        minimum_val_error = val_error
        best_epoch = epoch
        best_model = clone(sgd_reg)

display(minimum_val_error, best_epoch, best_model)

# 出力
(13.079674530111143,
 0,
 SGDRegressor(eta0=0.0005, learning_rate='constant', max_iter=1, penalty=None,
              tol=inf, warm_start=True))

早期で打切るっていうよりかは、一番良かったのを保存するイメージ

ロジスティック回帰

from sklearn import datasets
iris = datasets.load_iris()
X = iris['data'][:, 3:]
y = (iris['target'] == 2).astype(int)

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X, y)

X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)
plt.plot(X_new, y_proba[:, 1], c='g', label='iris virginica')
plt.plot(X_new, y_proba[:, 0], c='b', label='not iris virginica')
plt.legend();

log_reg.predict([[1.7], [1.5]])
# 大体1.6が判断の境目

# 出力
array([1, 0])

ソフトマックス関数

2値以上の分類で使われる活性化関数
上のグラフに線を加えていくイメージ、足して1になるのは変わらず

X = iris['data'][:, (2, 3)]
y = iris['target']
softmax_leg = LogisticRegression(multi_class='multinomial', solver='lbfgs', C=10)
softmax_leg.fit(X, y)
display(softmax_leg.predict([[5, 2]]))

# 出力
array([2])

display(softmax_leg.predict_proba([[5, 2]]))

# 出力
array([[6.38014896e-07, 5.74929995e-02, 9.42506362e-01]])

94%で2, 5.7%で1を予測している

多重共線性

重回帰分析で解析を行う際、説明変数同士に相関関係が見られてしまうこと。
片方のデータから得られた情報(残差)が少なく、目的変数を予測できなくなる

VIF（分散拡大要因）

多重共線性があるかどうか判断する指標の一つ。
$$\frac{1}{1-r^2}$$
rは決定係数
相関行列と近いが、相関行列の場合AとBの相関は観れるが、ABとCといった多変量の相関は見れないためこちらが使われることが多い

X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris['target']
display(X)

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2
5	5.4	3.9	1.7	0.4

150 rows × 4 columns

from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['features'] = X.columns
display(vif)

	VIF Factor	features
0	262.96934824146770	sepal length (cm)
1	96.35329172369063	sepal width (cm)
2	172.96096155387588	petal length (cm)
3	55.50205979323787	petal width (cm)

plt.bar(vif['features'], vif["VIF Factor"], color='red')
plt.ylabel('VIF values', fontsize = 16)
plt.xticks(rotation=45);

10以上あると多重共線性が疑われる。
今回の場合かなり、多重共感性の影響はありそう。

#終わりに

自分なりのメモになる。
知っていることが増えたらまた記載する。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2
5	5.4	3.9	1.7	0.4

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2
5	5.4	3.9	1.7	0.4

線形回帰 用語集