More than 5 years have passed since last update.

RとPythonでMann–Whitney-Wilcoxon検定

Posted at 2018-02-20

始めに

Mann-WhitneyのU検定と呼ばれたり、Wilcoxonの順位和検定(Wilcoxon rank-sum test)とも呼ばれる。Wilcoxonの符号順位検定(Wilcoxon signed-rank test)は別物。これだけで既にややこしいのに、Rの関数名はwilcox.test()なのに、Python(scipy)の関数名はmannwhitneyu()、しかもRはデフォルトが両側検定なのにPythonは片側検定という、両方の環境を行ったり来たりする人泣かせな検定。
今後迷わないように差をまとめておきます。

とりあえず適当なデータを作る。

>>> import numpy as np
>>> import pandas as pd
>>> n = 50
>>> val = np.random.randn(n)
>>> cls = np.random.choice(['A', 'B'], n)
>>> a = pd.DataFrame(dict(cls=cls, val=val))
>>> a.head()
  cls       val
0   A  0.717399
1   B  0.556095
2   B -0.644795
3   A  0.615479
4   B  0.352685
>>> a.to_csv('hoge.tsv', index=False, sep='\t')

R

> a <- read.table('hoge.tsv', header=T)
> head(a)
   cls        val
1   A  0.7173995
2   B  0.5560949
3   B -0.6447947
4   A  0.6154794
5   B  0.3526846
6   B -0.1734241
> wilcox.test(a$val[a$cls == "A"], a$val[a$cls == "B"])         

        Wilcoxon rank sum test

data:  a$val[a$cls == "A"] and a$val[a$cls == "B"]
W = 267, p-value = 0.3908 # Pythonでの結果とちょっと違う。
alternative hypothesis: true location shift is not equal to 0

> wilcox.test(a$val[a$cls == "A"], a$val[a$cls == "B"], exact=F)

        Wilcoxon rank sum test with continuity correction

data:  a$val[a$cls == "A"] and a$val[a$cls == "B"]
W = 267, p-value = 0.3875
alternative hypothesis: true location shift is not equal to 0

> wilcox.test(a$val[a$cls == "A"], a$val[a$cls == "B"], exact=F, alternative='less')

        Wilcoxon rank sum test with continuity correction

data:  a$val[a$cls == "A"] and a$val[a$cls == "B"]
W = 267, p-value = 0.1938
alternative hypothesis: true location shift is less than 0

Python

>>> import pandas as pd
>>> from scipy import stats
>>> a = pd.read_table('hoge.tsv')
>>> stats.mannwhitneyu(a['val'][a['cls'] == 'A'], a['val'][a['cls'] == 'B'])
MannwhitneyuResult(statistic=267.0, pvalue=0.19376142700269727)
>>> stats.mannwhitneyu(a['val'][a['cls'] == 'A'], a['val'][a['cls'] == 'B'], alternative='two-sided')
MannwhitneyuResult(statistic=267.0, pvalue=0.38752285400539455)

両側・片側の違いを加味しても、そのままだと少し結果が変わってしまいますが、Rの方にexact=Fのオプションをつけると同じ結果が得られます。

   exact: a logical indicating whether an exact p-value should be
          computed.

Pythonの結果は "exact" ではなのでしょうか・・・

Rのcorrect=T/F、Pythonのuse_continuity=True/Falseオプションは全く同じように機能するようです。離散分布由来の値を扱う場合は、trueにしたほうがよいようです。

# Python
use_continuity : bool, optional
        Whether a continuity correction (1/2.) should be taken into
        account. Default is True.
# R
 correct: a logical indicating whether to apply continuity correction
          in the normal approximation for the p-value.

   exact: a logical indicating whether an exact p-value should be
          computed.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up