LoginSignup
10
7

More than 5 years have passed since last update.

RとPythonでMann–Whitney-Wilcoxon検定

Posted at

始めに

Mann-WhitneyのU検定と呼ばれたり、Wilcoxonの順位和検定(Wilcoxon rank-sum test)とも呼ばれる。Wilcoxonの符号順位検定(Wilcoxon signed-rank test)は別物。これだけで既にややこしいのに、Rの関数名はwilcox.test()なのに、Python(scipy)の関数名はmannwhitneyu()、しかもRはデフォルトが両側検定なのにPythonは片側検定という、両方の環境を行ったり来たりする人泣かせな検定。
今後迷わないように差をまとめておきます。

とりあえず適当なデータを作る。

>>> import numpy as np
>>> import pandas as pd
>>> n = 50
>>> val = np.random.randn(n)
>>> cls = np.random.choice(['A', 'B'], n)
>>> a = pd.DataFrame(dict(cls=cls, val=val))
>>> a.head()
  cls       val
0   A  0.717399
1   B  0.556095
2   B -0.644795
3   A  0.615479
4   B  0.352685
>>> a.to_csv('hoge.tsv', index=False, sep='\t')

R

> a <- read.table('hoge.tsv', header=T)
> head(a)
   cls        val
1   A  0.7173995
2   B  0.5560949
3   B -0.6447947
4   A  0.6154794
5   B  0.3526846
6   B -0.1734241
> wilcox.test(a$val[a$cls == "A"], a$val[a$cls == "B"])         

        Wilcoxon rank sum test

data:  a$val[a$cls == "A"] and a$val[a$cls == "B"]
W = 267, p-value = 0.3908 # Pythonでの結果とちょっと違う。
alternative hypothesis: true location shift is not equal to 0

> wilcox.test(a$val[a$cls == "A"], a$val[a$cls == "B"], exact=F)

        Wilcoxon rank sum test with continuity correction

data:  a$val[a$cls == "A"] and a$val[a$cls == "B"]
W = 267, p-value = 0.3875
alternative hypothesis: true location shift is not equal to 0

> wilcox.test(a$val[a$cls == "A"], a$val[a$cls == "B"], exact=F, alternative='less')

        Wilcoxon rank sum test with continuity correction

data:  a$val[a$cls == "A"] and a$val[a$cls == "B"]
W = 267, p-value = 0.1938
alternative hypothesis: true location shift is less than 0

Python

>>> import pandas as pd
>>> from scipy import stats
>>> a = pd.read_table('hoge.tsv')
>>> stats.mannwhitneyu(a['val'][a['cls'] == 'A'], a['val'][a['cls'] == 'B'])
MannwhitneyuResult(statistic=267.0, pvalue=0.19376142700269727)
>>> stats.mannwhitneyu(a['val'][a['cls'] == 'A'], a['val'][a['cls'] == 'B'], alternative='two-sided')
MannwhitneyuResult(statistic=267.0, pvalue=0.38752285400539455)

両側・片側の違いを加味しても、そのままだと少し結果が変わってしまいますが、Rの方にexact=Fのオプションをつけると同じ結果が得られます。

   exact: a logical indicating whether an exact p-value should be
          computed.

Pythonの結果は "exact" ではなのでしょうか・・・

Rのcorrect=T/F、Pythonのuse_continuity=True/Falseオプションは全く同じように機能するようです。離散分布由来の値を扱う場合は、trueにしたほうがよいようです。

# Python
use_continuity : bool, optional
        Whether a continuity correction (1/2.) should be taken into
        account. Default is True.
# R
 correct: a logical indicating whether to apply continuity correction
          in the normal approximation for the p-value.

   exact: a logical indicating whether an exact p-value should be
          computed.
10
7
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
10
7