More than 5 years have passed since last update.

RとPythonでWilcoxonの符号順位検定

Posted at 2018-02-22

始めに

Wilcoxonの符号順位検定(Wilcoxon signed-rank test)と順位和検定(Wilcoxon rank-sum test)とは別物ですが、Rでは同一の関数でパラメータで切り分けしています。Scipyでは別関数です。
やはりデフォルトの挙動が微妙に違います。

とりあえず適当なデータを作る。

>>> import numpy as np
>>> import pandas as pd
>>> n = 50
>>> a0 = np.random.randint(10, 100, n)
>>> a1 = a0 * np.random.normal(1, 0.1, n)
>>> a2 = a0 * np.random.normal(1, 0.1, n)
>>> a = pd.DataFrame(dict(a1=a1, a2=a2))
>>> a.head()
          a1         a2
0  20.320022  21.060679
1  17.190202  17.669290
2  37.014636  25.695973
3  86.272272  94.483048
4  40.095449  45.042668
>>> a.to_csv('hoge.tsv', index=False, sep='\t')

R

> a <- read.table('hoge.tsv', header=T)
> head(a)
        a1       a2
1 20.32002 21.06068
2 17.19020 17.66929
3 37.01464 25.69597
4 86.27227 94.48305
5 40.09545 45.04267
6 18.10515 18.57657
> wilcox.test(a$a1, a$a2, paired=T, exact=F)

        Wilcoxon signed rank test with continuity correction

data:  a$a1 and a$a2
V = 768, p-value = 0.2095
alternative hypothesis: true location shift is not equal to 0

> wilcox.test(a$a1, a$a2, paired=T, exact=F, correct=F)

        Wilcoxon signed rank test

data:  a$a1 and a$a2
V = 768, p-value = 0.2078
alternative hypothesis: true location shift is not equal to 0

Python

>>> import pandas as pd
>>> from scipy import stats
>>> a = pd.read_table('hoge.tsv')
>>> stats.wilcoxon(a['a1'], a['a2'])
WilcoxonResult(statistic=507.0, pvalue=0.20775852744976409)
>>> stats.wilcoxon(a['a1'], a['a2'], correction=True)
WilcoxonResult(statistic=507.0, pvalue=0.20950554185822134)

Rはオプションとして、paired=Tを付ける必要があるほか、Scipyと同じ結果にするには、順位和検定と同じくexact=Fを付ける必要があります。
さらに、デフォルトでContinuity CorrectionがScipyでは無効、Rでは有効です。なので、同じ結果にするには、Pythonのオプションcorrection=Trueをつけるか、Rのオプションcorrect=Fを付けなくてはいけません。

# Python
    correction : bool, optional
        If True, apply continuity correction by adjusting the Wilcoxon rank
        statistic by 0.5 towards the mean value when computing the
        z-statistic.  Default is False.
# R
 correct: a logical indicating whether to apply continuity correction
          in the normal approximation for the p-value.

これでp valueは同じになりますが、統計量は異なることがあります。
ただこれも、この統計量の定義から、データサイズ$n$に対して、

V_1+V_2=\sum_{i=1}^ni

という関係にある統計量$V_1, V_2$のp valueは同じになるので、どちらを使うか、の違いしかありません。上記のケースで言えば、下記の等式が成り立つので問題なしです。

768+507=\frac{50 \times \left(50+1\right)}{2}

Scipyは常に小さいほうの統計量、Rは引数の順序に依存するようです。そういう意味では、Rの方が少しinformativeです。

なお、R、Python共に、2個のベクトルの差のみの単一引数の形式にも対応しています。単一引数の場合は、Rのpaired=Tオプションは不要。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up