More than 5 years have passed since last update.

正規分布に基づく乱数100万要素から信頼区間を調べてみる。

Python

Last updated at 2019-05-03Posted at 2018-12-14

プログラミング学習者です。
間違えている箇所もあると思いますので、申し訳ございませんがご指摘していただきますと大変助かります。

母集団から無作為抽出したサンプルの個数によって信頼区間の範囲が決まりますが、それが本当にそうなるのかいくつかのサンプル群をもとに調べてみました。手法は以下のとおりです。

Numpyで100万要素分の正規分布乱数母集団を作成
Numpyで10,120,240,10,000要素と、母集団からサンプル要素群を無作為抽出する
プログラムにて各サンプル群に応じた信頼区間を算出する。（tは分布表を参照しました）
ヒストグラムにてサンプル群の信頼区間の幅を簡単に比べられるようにする

※途中、ばらつきが確認したくなったため、ヒストグラムを各サンプル群ごとに母集団と比べて推測した密度で排出しました。また、アルゴリズム体操の練習がてらなるべく既存のメソッドを使わなかったので冗長になっています。

コード


import numpy as np
import matplotlib.pyplot as plt
import math


def average(array):
    return sum(array) / len(array)


pass


def dispersion(array):
    avers = average(array)
    array_sum = []
    num = 0
    # 空配列に分散値要素を足していきます。
    for i in array:
        array_sum.append((i - avers) ** 2)
        num += 1
    return sum(array_sum) / len(array)


pass


# 不偏分散
def unbiased_variance(array):
    avers = average(array)
    array_sum = []
    num = 0
    # 空配列に分散値要素を足していきます。
    for i in array:
        array_sum.append((i - avers) ** 2)
        num += 1
    #     minus 1
    return sum(array_sum) / (len(array) - 1)


pass


def normalise_dispersion(array):
    # 平方根に直しているだけです。（sqrt使いました）
    return math.sqrt(dispersion(array))


pass


def sample_standard_deviation(array):
    # 平方根に直しているだけです。（sqrt使いました）
    return math.sqrt(unbiased_variance(array))


pass


# 信頼区間を求める式「標本平均 ± t × 標本標準偏差 ÷ √標本の数」
# このt値はt分布表を元に変える
def confidence_interval(array, t):
    aver = average(array)
    k = aver - t * sample_standard_deviation(array) / math.sqrt(len(array))
    k2 = aver + t * sample_standard_deviation(array) / math.sqrt(len(array))

    return k, k2


pass

mother_group = np.random.normal(50, 10, 1_000_000)
Sample_10 = np.random.choice(mother_group, 10, replace=False)
Sample_120 = np.random.choice(mother_group, 120, replace=False)
Sample_240 = np.random.choice(mother_group, 240, replace=False)
Sample_10_000 = np.random.choice(mother_group, 10000, replace=False)

print("mother_groupの平均は　　", average(mother_group))
print("mother_groupの分散は　　", dispersion(mother_group))
print("mother_groupの標準偏差は", normalise_dispersion(mother_group))
print()
print("Sample_10の平均は　　        ", average(Sample_10))
print("Sample_10の不偏分散は        ", unbiased_variance(Sample_10))
print("Sample_10の標本標準偏差は     ", sample_standard_deviation(Sample_10))
print("Sample_10の95%信頼区間は     ", confidence_interval(Sample_10, 2.262), "の間")
print()
print("Sample_120の平均は　　        ", average(Sample_120))
print("Sample_120の不偏分散は        ", unbiased_variance(Sample_120))
print("Sample_120の標本標準偏差は     ", sample_standard_deviation(Sample_120))
print("Sample_120の信頼区間は        ", confidence_interval(Sample_120, 1.9799), "の間")

print()
print("Sample_240の平均は　　        ", average(Sample_240))
print("Sample_240の不偏分散は        ", unbiased_variance(Sample_240))
print("Sample_240の標本標準偏差は     ", sample_standard_deviation(Sample_240))
print("Sample_240の信頼区間は        ", confidence_interval(Sample_240, 1.9600), "の間")

print()
print("Sample_10,000の平均は　　        ", average(Sample_10_000))
print("Sample_10,000の不偏分散は        ", unbiased_variance(Sample_10_000))
print("Sample_10,000の標本標準偏差は     ", sample_standard_deviation(Sample_10_000))
print("Sample_10,000の信頼区間は        ", confidence_interval(Sample_10_000, 1.9600), "の間")

fig, axe = plt.subplots(1, 5, constrained_layout=True, figsize=(15, 10))

axe[0].hist(Sample_10, bins=100, alpha=1, density=True, range=[0, 100])
axe[0].hist(mother_group, histtype='step', density=True, bins=100, alpha=0.9, range=[0, 100])

axe[1].hist(Sample_120, bins=100, alpha=1, density=True, range=[0, 100])
axe[1].hist(mother_group, histtype='step', density=True, bins=100, alpha=0.9, range=[0, 100])

axe[2].hist(Sample_240, bins=100, density=True, alpha=1, range=[0, 100])
axe[2].hist(mother_group, histtype='step', density=True, bins=100, alpha=0.9, range=[0, 100])

axe[3].hist(Sample_10_000, density=True, bins=100, alpha=1, range=[0, 100])
axe[3].hist(mother_group, histtype='step', density=True, bins=100, alpha=0.9, range=[0, 100])

axe[4].hist(mother_group, bins=100, density=True, alpha=1, range=[0, 100])
axe[4].hist(mother_group, histtype='step', bins=100, density=True, alpha=0.9, range=[0, 100])

plt.savefig('figure.png')
axe[0].set_title("sample10")
axe[1].set_title("sample120")
axe[2].set_title("sample240")
axe[3].set_title("sample10,000")
axe[4].set_title("mother_group1_000_000")

"""
実行結果
mother_groupの平均は　　 49.99372952128521
mother_groupの分散は　　 99.94983791809062
mother_groupの標準偏差は 9.99749158129631

Sample_10の平均は　　         47.003904439850054
Sample_10の不偏分散は         29.14083176833298
Sample_10の標本標準偏差は      5.398224871967912
Sample_10の95%信頼区間は      (43.142515285381805, 50.8652935943183) の間

Sample_120の平均は　　         51.17657704087571
Sample_120の不偏分散は         103.87245518969463
Sample_120の標本標準偏差は      10.191783709915288
Sample_120の信頼区間は         (49.33452103179205, 53.01863304995938) の間

Sample_240の平均は　　         50.92947232062534
Sample_240の不偏分散は         92.77697401097272
Sample_240の標本標準偏差は      9.632080461196985
Sample_240の信頼区間は         (49.71084600491751, 52.14809863633317) の間

Sample_10,000の平均は　　         49.94405919674324
Sample_10,000の不偏分散は         98.95574400036168
Sample_10,000の標本標準偏差は      9.947650174808203
Sample_10,000の信頼区間は         (49.749085253317, 50.13903314016948) の間


"""

算出された信頼区間を見てみると、やはりサンプル数が多いほうが狭まっている＝精度が高いことがわかります。これをヒストグラムで描画します。

コード


fig, axe = plt.subplots(figsize=(15, 10))
axe.hist(mother_group, ec='black', bins=100, alpha=0.5, label=["mother_group"], range=[0, 100])
axe.plot(confidence_interval(Sample_10, 2.262), (1000, 1000),  lw=10,label="Sample_10_95%confidence_interval")
axe.plot(confidence_interval(Sample_120, 1.9799), (2000, 2000), lw=10,label="Sample_120_95%confidence_interval")
axe.plot(confidence_interval(Sample_240, 1.9600), (3000, 3000), lw=10,label="Sample_240_95%confidence_interval")
axe.plot(confidence_interval(Sample_10_000, 1.9600), (4000, 4000), lw=10,label="Sample_10,000_95%confidence_interval")
plt.legend()
plt.savefig('figure2.png')
plt.show()

あまりスマートな方法ではありませんが、苦肉の策で範囲を示すため信頼区間を太い折れ線グラフで描画して重ねました。
ヒストグラム下部の横に伸びているバーが各サンプル群の信頼区間となります。
きちんとサンプル数の多いほうが範囲が狭まっているのがわかります。
しかし、こうしてみると10から120にかけては劇的に狭まっているのに、それ以上はサンプル数が増えてもそれほどの恩威は受けられていないようです。t分布表もだいたい200程度までしか無いのもこのためですね。
統計って面白いですね。(^^)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up