More than 3 years have passed since last update.

GLM パッケージによる定数項を含まない重回帰分析

Posted at 2021-10-18

GLM パッケージによる定数項を含まない重回帰分析の結果に一部不備がある

以下のデータを用いて検証する。

data = [25.43  78.11  34.35   98.88  53.28
        18.56  76.34  35.76  100.48  56.91
        28.66  79.05  44.00  103.21  54.52
        13.52  72.24  35.26   97.97  50.13
        30.52  77.86  41.97   97.32  59.89
        30.16  77.88  50.14   98.35  53.43
        29.30  79.26  35.15   98.55  57.35
        31.28  79.38  49.57   97.74  55.13
        31.46  79.14  46.80  101.98  53.71
        29.21  77.63  37.07  102.35  36.10
        22.70  76.80  37.95   88.83  47.18
        34.47  78.88  53.90  100.21  59.90
        37.38  78.90  54.16  104.02  67.05
        13.38  73.97  42.15   92.01  42.44
        16.98  75.59  30.25   92.59  32.98];
using DataFrames
df = DataFrame(data, [:x1, :x2, :x3, :x4, :y]);
first(df, 2)

2 rows × 5 columns

	x1	x2	x3	x4	y
	Float64	Float64	Float64	Float64	Float64
1	25.43	78.11	34.35	98.88	53.28
2	18.56	76.34	35.76	100.48	56.91

Julia の GLM.lm( ) では，定数項を含まない場合の決定係数，調整済み決定係数が普通の定義（たとえば R での結果）とは異なる結果となる。

Julia での分析結果

using GLM
result2 = lm(@formula(y ~ 0 + x1 + x2 + x3 + x4), df);
println(result2)

StatsModels.DataFrameRegressionModel{LinearModel{LmResp{Vector{Float64}}, DensePredChol{Float64, Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}

Formula: y ~ x1 + x2 + x3 + x4

Coefficients:
─────────────────────────────────────────────
     Estimate  Std.Error    t value  Pr(>|t|)
─────────────────────────────────────────────
x1   0.338163   0.377074   0.896807    0.3890
x2  -0.216267   0.687945  -0.314368    0.7591
x3   0.412138   0.367005   1.12298     0.2854
x4   0.433544   0.54367    0.797439    0.4421
─────────────────────────────────────────────

println("R2      = $(r2(result2))")
println("R2Adj.  = $(adjr2(result2))")

R2      = 0.4549168173881398
R2Adj.  = 0.3062577675849053

R での実行結果

using RCall
R"""
result2 = lm(y ~ 0 + x1 + x2 + x3 + x4, $df)
summary2 = summary(result2)
print(summary2, digits=6)
""";

R"""
options(digits=16)
cat(sprintf("R2     = %.15g\n", summary2$r.squared))
cat(sprintf("R2Adj. = %.15g\n", summary2$adj.r.squared))
""";

Call:
lm(formula = y ~ 0 + x1 + x2 + x3 + x4, data = `#JL`$df)

Residuals:
       Min         1Q     Median         3Q        Max 
-16.640026  -3.271851  -0.356926   4.361203   8.843050 

Coefficients:
    Estimate Std. Error  t value Pr(>|t|)
x1  0.338163   0.377074  0.89681  0.38903
x2 -0.216267   0.687945 -0.31437  0.75912
x3  0.412138   0.367005  1.12298  0.28536
x4  0.433544   0.543670  0.79744  0.44207

Residual standard error: 7.57951 on 11 degrees of freedom
Multiple R-squared:  0.984853,	Adjusted R-squared:  0.979344 
F-statistic:   178.8 on 4 and 11 DF,  p-value: 6.29747e-10

R2     = 0.984852626318541
R2Adj. = 0.979344490434375

R での結果において，決定係数は 0.9848526263185413，自由度調整済み決定係数は 0.9793444904343745 である。

Julia の GLM.ml() ではそれぞれ，0.4549168173881398，0.3062577675849053 と大きく異なる。

Julia での R2, R2adj の計算法

結果が違う原因は，Julia の GLM.ml() での決定係数と自由度調整済み決定係数は定数項を含む場合も含まない場合も同じ計算法（平均値を中心とする平方和に基づいている）をとっていることである。
¹

$$
R^2 = 1 - \frac{S_e}{S_t}, \ \ \ \
R^{2}_{adj} = 1 - \frac{S_e / (n - p)}{S_t / (n - 1)}
$$

ここで，$n$ はサンプルサイズ，$p$ は独立変数の個数，$S_t$ は従属変数 $y$ の平方和，$S_e$ は残差平方和である。
$$
S_t = \displaystyle \sum_{i=1}^n \left (y_i-\bar{y}\right )^2, \ \ \ \
S_e = \displaystyle \sum_{i=1}^n \left (y_i-\hat{y} \right )^2
$$

計算をしてみると確かに GLM.lm( ) と同じ結果になる。

n, p = size(df)
p -= 1
println("n = $n, p = $p")
St = var(df.y) * (n - 1)
Se = sum(residuals(result2) .^ 2)
R² = 1 - Se / St
R²adj = 1 - (Se / (n - p)) / (St / (n - 1))
println("R²     = $R²")
println("R²adj  = $R²adj")

n = 15, p = 4
R²     = 0.4549168173881398
R²adj  = 0.3062577675849052

R などでの R2, R2adj の計算法

原点を中心とした平方和に基づく。

$$
R^2 = \frac{S_{predict}}{S_{observe}}, \ \ \ \
R^{2}_{adj} = 1 - \left (1 - R^2 \right )\frac{n}{n - p}
$$

$$
S_{observe} = \sum_{i=1}^n y_i^2, \ \ \ \
S_{predict} = \sum_{i=1}^n \hat{y}_i^2
$$

Spredict = sum(predict(result2) .^ 2)
Sobserve = sum(df.y .^ 2)
R2 = Spredict / Sobserve
R2adj = 1 - (1 - R2) * (n / (n - p))
println("R2    = $R2")
println("R2adj = $R2adj")

R2    = 0.9848526263185415
R2adj = 0.9793444904343748

定数項を使う場合は $(n - p)$ の項は $(n - p - 1)$ になる。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up