More than 3 years have passed since last update.

一元配置分散分析

Posted at 2022-05-13

一元配置分散分析

帰無仮説 $H_0$：「各群の母平均値はすべて等しい」
対立仮説 $H_1$：「各群の母平均値はすべて等しいわけではない」
有意水準 $\alpha$ で検定を行う
独立2標本の場合以外では，片側検定は定義できない

独立二標本の平均値の差の検定で Welch の方法を取り上げていても，独立 $k$ 標本($k \geqq 3$) の場合の Welch の方法を取り上げていないサイトや教科書も数多い。
検定の多重性の問題も考えれば，「各群の分散が等しい」と仮定する方法を採用する意義はほとんどない。

更にひどい話だが，Julia の OneWayANOVATest() は，各群のサンプルサイズが等しくないときには誤った答えを出すというバグがあり，このバグは 2021/08/02 に報告されているにも関わらず，2022/05/13 現在でも未だに修正されていない。

1. 各群の分散が等しいと仮定する場合

群の数を $k$，全ケース数を $n$，各群のケース数を $n_j$，全体の平均値を $\bar{X}$，第 $j$ 群における平均値を$\bar{X}_j$ とする（$j= 1, 2, \dots, n;\ n = \sum n_j$）

平方和 $S_t$，$S_b$，$S_w$ を求める。

$\hspace{2cm}S_t = \displaystyle \sum_{j=1}^k \sum_{i=1}^{n_j} \left(X_{ij} - \bar{X}\right)^2$

$\hspace{2cm}S_b = \displaystyle \sum_{j=1}^k n_j\left(\bar{X}_j - \bar{X}\right)^2$

$\hspace{2cm}S_w = \displaystyle \sum_{j=1}^k \sum_{i=1}^{n_j} \left(X_{ij} - \bar{X}_j\right)^2$

なお，$S_t = S_b + S_w$ の関係があるので，求めやすい $S_t$, $S_w$ を先に求め， $S_b = S_t - S_w$ とするとよい。

以下のような分散分析表を構成する¹。

表 2.　一元配置分散分析表
変動要因	$平方和$	$自由度$	$平均平方$	$F\ 値$	$p\ 値$
$群間$	$S_{b}$	$df_{b} = k - 1$	$V_{b} = S_{b}\ /\ df_{b}$	$F_{0} = V_{b}\ /\ V_{w}$	$p = \Pr\{F \geqq F_0\}$
$群内$	$S_{w}$	$df_{w} = n - k$	$V_{w} = S_{w}\ /\ df_{w}$
$全体$	$S_{t} = S_{b} + S_{w}$	$df_{t} = n - 1$

例題

3 群のデータが，以下のように定義された。平均値に差があるといえるか？

using Distributions
using Random; Random.seed!(123)
x(μ, σ, n, digits=1) = round.(rand(Normal(μ, σ), n), digits=digits)
x1 = x(50, 10, 5)
x2 = x(52, 9, 9)
x3 = x(54, 8, 12);

Julia の OneWayANOVATest() にはバグがある。

using HypothesisTests
result = OneWayANOVATest(x1, x2, x3)

One-way analysis of variance (ANOVA) test
-----------------------------------------
Population details:
    parameter of interest:   Means
    value under h_0:         "all equal"
    point estimate:          NaN

Test summary:
    outcome with 95% confidence: reject h_0
    p-value:                     0.0298

Details:
    number of observations: [5, 9, 12]
    F statistic:            4.10778
    degrees of freedom:     (2, 23)

簡単なので関数を書く。今後使う予定もないので，長ったらしい関数名にする。

using Printf
function EqVarOnewayANOVA(x...; verbose=true, returns=false)
    X = vcat(x...)
    n, grandmean = length(X), mean(X)
    St = var(X) * (n -1)
    k = length(x)
    ns = zeros(Int, k)
    means = zeros(k)
    for (i, d) in enumerate(x)
        ns[i] = length(d)
        means[i] = mean(d)
    end
    Sb = sum(ni * (meani - grandmean)^2 for (ni, meani) in zip(ns, means))
    Sw = St - Sb
    dfb, dfw, dft = k - 1, n - k, n - 1
    Vb, Vw = Sb / dfb, Sw / dfw
    F0 = Vb / Vw
    p = ccdf(FDist(dfb, dfw), F0)
    if verbose
        @printf("One-way analysis of variance (ANOVA) test\n")
        @printf("F statistic: %.5f,  degrees of freedom: (%d, %d),  p-value: %.5g\n", F0, dfb, dfw, p)
    end
    returns && Dict(:Sb=>Sb, :Sw=>Sw, :St=>St, :dfb=>dfb, :dfw=>dfw, :dft=>dft, :Vb=>Vb, :Vw=>Vw, :F0=>F0, :p=>p) 
end

function ANOVATable(a)
    println("\nANOVA Table")
    @printf("%-18s %10s %7s %10s %10s %10s\n", "Source of variance", "SS", "df", "MS", "F", "p")
    @printf("%-18s %10.5g %7d %10.5g %10.5g %10.5g\n", "Between groups", a[:Sb], a[:dfb], a[:Vb], a[:F0], a[:p])
    @printf("%-18s %10.5g %7d %10.5g\n", "Within group", a[:Sw], a[:dfw], a[:Vw])
    @printf("%-18s %10.5g %7d\n", "Total", a[:St], a[:dft])
end；

ANOVATable (generic function with 1 method)

分析してみる。

a = EqVarOnewayANOVA(x1, x2, x3; verbose=true, returns=true)
ANOVATable(a)

One-way analysis of variance (ANOVA) test
F statistic: 3.64120,  degrees of freedom: (2, 23),  p-value: 0.042285

ANOVA Table
Source of variance         SS      df         MS          F          p
Between groups         378.56       2     189.28     3.6412   0.042285
Within group           1195.6      23     51.983
Total                  1574.2      25

$p \lt 0.05$ ゆえ，有意水準 5% で帰無仮説は棄却される（$p = 0.042285$）。すなわち，すべての平均値が等しいわけではない。

R の結果と同じになることを確かめておこう。

using RCall
R"""
df = data.frame(x = c($x1, $x2, $x3), g = rep(1:3, c(5, 9, 12)))
result = oneway.test(x ~ g, data=df, var.equal=TRUE)
print(result)
""";

	One-way analysis of means

data:  x and g
F = 3.6412, num df = 2, denom df = 23, p-value = 0.04228

2. 各群の分散が等しいと仮定しない場合

先に「，独立 $k$ 標本($k \geqq 3$) の場合の Welch の方法を取り上げていないサイトや教科書も数多い。」と書いたが，Julia でも，Welch の方法に対する公式プログラムは用意されていない。

簡単なので関数を書く。

using Printf
function OnewayANOVA(x...; verbose=false, returns=false)
    k = length(x)
    n, m, u, w = zeros(Int, k), zeros(k), zeros(k), zeros(k)
    for (i, d) in enumerate(x)
        n[i], m[i], u[i] = length(d), mean(d), var(d)
        w[i] = n[i] / u[i]
    end
    sw = sum(w)
    wm = sum(w .* m) / sum(w)
    a = sum((1 .- w ./ sw).^2 ./ (n .- 1)) / (k^2 - 1)
    F0 = sum(w .* (m .- wm).^2) / (k - 1) / (1 +2a*(k - 2))
    df1 = k - 1
    df2 = 1 / 3a
    p = ccdf(FDist(df1, df2), F0)
    if verbose
        @printf("One-way analysis of variance (ANOVA) test -- Welch's method\n")
        @printf("F statistic: %.5f,  degrees of freedom: (%d, %.5g),  p-value: %.5g\n", F0, df1, df2, p)
    end
    if returns
        Dict(:w=>w, :sw=>sw, :wm=>wm, :a=>a, :F0=>F0, :df1=>df1, :df2=>df2, :p=>p)
    end
end；

分析してみる。

OnewayANOVA(x1, x2, x3; verbose=true, returns=false)

One-way analysis of variance (ANOVA) test -- Welch's method
F statistic: 2.52874,  degrees of freedom: (2, 9.4895),  p-value: 0.13174

R と同じ結果になることを確かめておく。

using RCall
R"""
df = data.frame(x = c($x1, $x2, $x3), g = rep(1:3, c(5, 9, 12)))
result = oneway.test(x ~ g, data=df, var.equal=FALSE)
print(result)
""";

	One-way analysis of means (not assuming equal variances)

data:  x and g
F = 2.5287, num df = 2.0000, denom df = 9.4895, p-value = 0.1317

3. 結論

わずか 1 例ではあるが，等分散を仮定した場合には $p$ 値は 0.05 よりほんの少し小さく，帰無仮説を棄却するという結論を下すことになったが，等分散を仮定しない場合には帰無仮説は棄却できない（$p$ 値は有意水準よりかなり大きい）ということになった。

ちなみに，そのデータで等分散性の検定を行うと，帰無仮説は棄却できないのであるが。

色々考えると，どんな場合にも，即決で等分散性を仮定しない Welch の方法で検定すればよい。

なお，独立 2 標本の場合も，一元配置分散分析を行えばよい。別々のプログラムが用意されているかもしれないが，$F \equiv t^2$, $F$ の第1自由度 $\equiv t$ の自由度で，どちらの $p$ 値も同じになる。

今までは，等分散を仮定する/仮定しない $t$ 検定，等分散を仮定する/仮定しない $F$ 検定の 4 通りのどれを使おうか迷っていたかもしれないが，これからは等分散を仮定しない $F$ 検定だけを使えばよい（どうせ，両側検定でしょ）。

これからのあなたは，今までのあなたとは違う。

3.1. 上で定義した，独立 $k$ 標本の Welch の方法

using Distributions
using Random; Random.seed!(123)
x(μ, σ, n, digits=1) = round.(rand(Normal(μ, σ), n), digits=digits)
x1 = x(50, 10, 15)
x2 = x(52, 9, 19)
a = OnewayANOVA(x1, x2, verbose=true, returns=true)
println("\nF0 = $(a[:F0]),  df2 = $(a[:df2]),  p = $(a[:p])")

One-way analysis of variance (ANOVA) test -- Welch's method
F statistic: 1.13201,  degrees of freedom: (1, 24.97),  p-value: 0.29753

F0 = 1.1320134138353004,  df2 = 24.969789488264624,  p = 0.2975276878435646

3.2. R の独立 2 標本の Welch の方法

using RCall
R"""
result = t.test($x1, $x2)
print(result)
options(digits=17)
cat("t^2 =", result$statistic^2, "  df =", result$parameter, "  p =", result$p.value)
""";

	Welch Two Sample t-test

data:  `#JL`$x1 and `#JL`$x2
t = -1.064, df = 24.97, p-value = 0.2975
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -9.059917  2.887988
sample estimates:
mean of x mean of y 
 48.66667  51.75263 

t^2 = 1.1320134138353   df = 24.969789488264624   p = 0.29752768784356465

3.3. R の独立 k 標本の Welch の方法

using RCall
R"""
df = data.frame(x=c($x1, $x2), g=rep(1:2, c(length($x1), length($x2))))
result = oneway.test(x ~ g, data=df)
print(result)
options(digits=17)
cat("F =", result$statistic, "  df =", result$parameter[2], "  p =", result$p.value)
""";

	One-way analysis of means (not assuming equal variances)

data:  x and g
F = 1.1320134138353, num df = 1.0000000000000, denom df =
24.9697894882646, p-value = 0.29752768784356

F = 1.1320134138353002   df = 24.96978948826462   p = 0.29752768784356465

大仰な表であるが，必要なのは $p$ 値だけである。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up