More than 5 years have passed since last update.

pythonで始める古典的仮説検定

Posted at 2018-05-05

はじめに

以下のリンク先の内容を抜粋してます。
Chapter 9 Hypothesis testing

目的

The goal of classical hypothesis testing is to answer the question, “Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?” Here’s how we answer that question:
古典仮説検定の目的は、「標本と明白な効果を与えられたら、そのような影響を偶然に見る確率は？」という質問に答えることです。

ここではコインを中に投げて、得られた結果（表140回、裏110回）より、その硬貨は表が出やすいはずという仮説を検証していきます。

検定のステップ

The first step is to quantify the size of the apparent effect by choosing a test statistic. In the NSFG example, the apparent effect is a difference in pregnancy length between first babies and others, so a natural choice for the test statistic is the difference in means between the two groups.
最初のステップは、検定統計量を選択することによって見かけの効果のサイズを定量化することです。 NSFGの例では、明らかな効果は最初の乳児と他の乳児の妊娠期間の差であるため、試験統計量は当然2つの群間の平均値の差になります。

まず検定統計量を選択します。第１子とそれ以外の子の妊娠期間という2つのグループ間の平均値の差がそれとなります。
※ NSFG : National Center for Health Statistics

The second step is to define a null hypothesis, which is a model of the system based on the assumption that the apparent effect is not real. In the NSFG example the null hypothesis is that there is no difference between first babies and others; that is, that pregnancy lengths for both groups have the same distribution.
第2のステップは、明らかな効果が現実ではないという仮定に基づくシステムのモデルである帰無仮説を定義することである。 NSFGの例では、帰無仮説は最初の乳児と他の乳児との間に差はないということです。すなわち、両方のグループの妊娠期間は同じ分布を有する。

つまり、直接的にその仮説を検定するわけではなく、その仮説の逆（帰無仮説）を真として検定しそれに矛盾が発生することを突き止めて当初の仮説を検定します。

The third step is to compute a p-value, which is the probability of seeing the apparent effect if the null hypothesis is true. In the NSFG example, we would compute the actual difference in means, then compute the probability of seeing a difference as big, or bigger, under the null hypothesis.
第3のステップは、帰無仮説が真である場合に見かけの効果を見る確率であるp値を計算することである。 NSFGの例では、平均の実際の差を計算し、帰無仮説の下で差がそれだけか、それよりも大きい差が現れる確率を計算しする。

実際にどうやるかというとNSFGの例では２グループの実際の平均の差よりも大きい差が現れる確率（p値）を計算します。

The last step is to interpret the result. If the p-value is low, the effect is said to be statistically significant, which means that it is unlikely to have occurred by chance. In that case we infer that the effect is more likely to appear in the larger population.
最後のステップは結果を解釈することです。 p値が低い場合、その効果は統計的に有意であると言われ、偶然に起こる可能性は低いことを意味する。その場合、我々はその影響がより大きな集団に現れる可能性がより高いと推定する。

先ほどのp値が小さければ（例えば0.05以下）、効果は偶然によって起こり得ないはずなのに実際には効果が見られている。つまり帰無仮説が真でないことが結論づけられます。

例題

As a simple example, suppose we toss a coin 250 times and see 140 heads and 110 tails. Based on this result, we might suspect that the coin is biased; that is, more likely to land heads. To test this hypothesis, we compute the probability of seeing such a difference if the coin is actually fair:
簡単な例として、コインを250回投げ、表が140回と裏が110回出たとしましょう。この結果に基づくと、コインが真正でない可能性があります。つまり、表が出やすい硬貨ということです。この仮説を検証するために、コインが実際に真正であると仮定して、その効果が現れる確率を計算します。

サンプルコード

抽象親クラスとテストクラス

テストクラスは実際の検定に応じて設計されます

class HypothesisTest:
    """Represents a hypothesis test."""

    def __init__(self, data):
        """Initializes.

        data: data in whatever form is relevant
        """
        self.data = data
        self.actual = self.TestStatistic(data)
        self.test_stats = None

    def PValue(self, iters=1000):
        """Computes the distribution of the test statistic and p-value.

        iters: number of iterations

        returns: float p-value
        """
        self.test_stats = [self.TestStatistic(self.RunModel()) for _ in range(iters)]
        self.test_cdf = CDF(pd.Series(self.test_stats))
        count = sum(1 for x in self.test_stats if x >= self.actual)
        return count / iters

    def TestStatistic(self, data):
        """Computes the test statistic.

        data: data in whatever form is relevant        
        """
        raise UnimplementedMethodException()

    def RunModel(self):
        """Run the model of the null hypothesis.

        returns: simulated data
        """
        raise UnimplementedMethodException()

class CoinTest(HypothesisTest):

    def TestStatistic(self, data):
        heads, tails = data
        test_stat = abs(heads - tails)
        return test_stat

    def RunModel(self):
        """
        H:hist.y[0]
        T:hist.y[1]
        """
        heads, tails = self.data
        n = heads + tails
        sample = [random.choice('HT') for _ in range(n)]
        hist = PMF(pd.Series(sample),model='hist')
        data = hist.y[0],hist.y[1]
        return data

PMF(確率質量関数)クラス

class Base:
    def __init__(self,ser):
        self.x = ser.sort_values()

class PMF(Base):
    def __init__(self,ser):
        Base.__init__(self,ser)
        self.y = self.get_prob_mass(self.x)

    def get_prob_mass(self,ser):
        ys = ys/ys.sum()
        return ys

実行コード

ct = CoinTest((140, 110))
pvalue = ct.PValue()
pvalue

# 実行結果
0.073

結果の解釈

The result is about 0.07, which means that if the coin is fair, we expect to see a difference as big as 30 about 7% of the time.
結果は約0.07です。つまり、コインが真正であれば、30の差が約7％で現れると予想されます。

How should we interpret this result? By convention, 5% is the threshold of statistical significance. If the p-value is less than 5%, the effect is considered significant; otherwise it is not.
この結果をどのように解釈すべきでしょうか？慣例により、統計的有意が認められる閾値は5％はです。p値が5％未満の場合、効果は有意であるとみなされます。それ以外の場合はそうではありません。

But the choice of 5% is arbitrary, and (as we will see later) the p-value depends on the choice of the test statistics and the model of the null hypothesis. So p-values should not be considered precise measurements.
しかし、5％の選択は任意であり、（後述するように）p値は、試験統計量の選択と帰無仮説のモデルに依存します。したがって、p値は正確な測定と見なすべきではありません。

I recommend interpreting p-values according to their order of magnitude: if the p-value is less than 1%, the effect is unlikely to be due to chance; if it is greater than 10%, the effect can plausibly be explained by chance. P-values between 1% and 10% should be considered borderline. So in this example I conclude that the data do not provide strong evidence that the coin is biased or not.
p値が大きさの順番に従って解釈することをお勧めします。p値が1％未満の場合、その効果は偶然によるものではありません。それが10％より大きい場合、効果は偶然によって説明される可能性があります。 1％と10％の間のP値は境界線とみなすべきである。したがってこの例では、データがコインが偏っているかどうかの強い証拠を提供していないと結論づけています。

要点は以下になると思います。

p=0.073は検定における一般的な閾値を超えていないため有意とは言えない
もしp値が5%以下であってもそれは必ずしも有意であるとは言えない
- [第一種過誤と第二種過誤]を参照のこと(https://ja.wikipedia.org/wiki/%E7%AC%AC%E4%B8%80%E7%A8%AE%E9%81%8E%E8%AA%A4%E3%81%A8%E7%AC%AC%E4%BA%8C%E7%A8%AE%E9%81%8E%E8%AA%A4)

まとめ

一例ではあるがpythonによる仮説検定の実装ができた
効果が統計的に有意かどうかは検定手法やモデルに依存するため、検定結果の解釈に難しさがあることがわかった

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up