以前スチューデントのt分布での最尤推定を実装したときに、この分布を線形回帰にも適用してみるということをいっていたので今回やってみました。

線形回帰

$\{x_n,t_n\}_{n=1}^N$を学習用データとすると、線形回帰では、${\bf \phi}(\cdot)$を特徴ベクトルとして、二乗誤差が最小になるようなパラメータ${\bf w}^*$を推定します。

{\bf w}^* = \arg \min_{\bf w}\sum_{n=1}^N(t_n - {\bf w}^\top{\bf \phi}(x_n))^2

これは確率分布を用いて解釈しなおすと、次のガウス分布を用いた尤度関数を最大化するようなパラメータを推定することと同じです。

\prod_{n=1}^N\mathcal{N}(t_n|{\bf w}^\top{\bf\phi}(x_n), \sigma^2)

こういった意味でもガウス分布はよく用いられるのですが、ノイズに対して比較的頑健でないなどの欠点もあります。
今回は、尤度関数においてスチューデントのt分布を用いてパラメータを推定します。

\prod_{n=1}^N{\rm St}(t_n|{\bf w}^\top{\bf\phi}(x_n),\lambda,\nu)

スチューデントのt分布の概形は下の図(PRML図2.15より)のようになっています。$\nu\to\infty$(緑)のときはガウス分布と同じ形状になります。緑の分布と比べて、青、赤の分布は裾が広いので外れ値に対しても頑健になっています。

コード

students_t_regression.py

import functools
import itertools
import matplotlib.pyplot as plt
import numpy as np
import scipy.special as sp


class PolynomialFeatures(object):

    def __init__(self, degree=2):
        assert isinstance(degree, int)
        self.degree = degree

    def transform(self, x):
        if x.ndim == 1:
            x = x[:, None]
        x_t = x.transpose()
        features = [np.ones(len(x))]
        for degree in range(1, self.degree + 1):
            for items in itertools.combinations_with_replacement(x_t, degree):
                features.append(functools.reduce(lambda x, y: x * y, items))
        return np.asarray(features).transpose()


class LinearRegressor(object):

    def fit(self, X, t):
        self.w = np.linalg.pinv(X) @ t

    def predict(self, X):
        return X @ self.w


class RobustRegressor(LinearRegressor):

    def __init__(self, precision=1., dof=1.):
        self.precision = precision
        self.dof = dof

    def fit(self, X, t, learning_rate=0.01):
        super().fit(X, t)
        params = np.hstack((self.w.ravel(), self.precision, self.dof))
        while True:
            E_precision, E_ln_precision = self._expect(X, t)
            self._maximize(X, t, E_precision, E_ln_precision, learning_rate)
            params_new = np.hstack((self.w.ravel(), self.precision, self.dof))
            if np.allclose(params, params_new):
                break
            params = np.copy(params_new)

    def _expect(self, X, t):
        sq_diff = (t - X @ self.w) ** 2
        E_eta = (self.dof + 1) / (self.dof + self.precision * sq_diff)
        E_ln_eta = (
            sp.digamma(0.5 * (self.dof + 1))
            - np.log(0.5 * (self.dof + self.precision * sq_diff)))
        return E_eta, E_ln_eta

    def _maximize(self, X, t, E_eta, E_ln_eta, learning_rate):
        sq_diff = (t - X @ self.w) ** 2
        self.w = np.linalg.inv(X.T @ (E_eta[:, None] * X)) @ X.T @ (E_eta * t)
        self.precision = 1 / np.mean(E_eta * sq_diff)
        N = len(t)
        self.dof += learning_rate * (
            N * np.log(0.5 * self.dof) + N
            - N * sp.digamma(0.5 * self.dof)
            + np.sum(E_ln_eta - E_eta))


def create_toy_data(n):
    x = np.linspace(-1, 1, n)
    y = np.random.randint(-5, 6) * x + np.random.normal(scale=0.1, size=n)
    y[np.random.randint(len(y))] += np.random.uniform(-10, 10)
    return x, y


def main():
    x_train, y_train = create_toy_data(10)

    feature = PolynomialFeatures(degree=1)
    X_train = feature.transform(x_train)

    linear_regressor = LinearRegressor()
    robust_regressor = RobustRegressor()
    linear_regressor.fit(X_train, y_train)
    robust_regressor.fit(X_train, y_train)

    x = np.linspace(-1, 1, 100)
    X = feature.transform(x)
    y_linear = linear_regressor.predict(X)
    y_robust = robust_regressor.predict(X)

    plt.scatter(
        x_train, y_train,
        facecolors="none", edgecolors="g", label="training data")
    plt.plot(x, y_linear, c="b", label="Gaussian")
    plt.plot(x, y_robust, c="r", label="Student's t")
    plt.legend(loc="best")
    plt.show()


if __name__ == '__main__':
    main()

結果

尤度関数にガウス分布を用いた青線のモデルでは外れ値に引っ張られていますが、スチューデントのt分布の方では外れ値があっても頑健に線形回帰しています。

スチューデントのt分布による線形回帰

線形回帰

コード

結果