統計学の中心極限定理の理解を深めるためのPythonコード

Posted at 2024-08-31

概要

中心極限定理（Central Limit Theorem, CLT）は、統計学や確率論で非常に重要な概念です。この定理は、どんな母集団分布からのサンプルでも、そのサンプルの平均が十分に大きい場合、分布が正規分布に近づくというものです。本記事では、Pythonを用いて中心極限定理を視覚的に理解するためのコードを紹介します。元の分布として複数の異なる分布を混合し、それらのサンプル平均が徐々に正規分布に近づく様子を観察します。文章とコードはChatGPT4oで作成しています。

コード全文

以下は、中心極限定理を視覚化するためのPythonコードです。元の分布として複数のランダムな分布を混合し、そのサンプル平均がどのように正規分布に近づくかを確認します。

import numpy as np
import matplotlib.pyplot as plt

# 元の分布として複数の分布を混合
np.random.seed(0)  # 再現性のためのシード
n_samples = 10000  # サンプルの数
sample_size = 50   # 各サンプルのサイズ

# 異なる分布を組み合わせて混合分布を作成
distribution1 = np.random.exponential(scale=2.0, size=(n_samples, sample_size))
distribution2 = np.random.gamma(shape=2.0, scale=2.0, size=(n_samples, sample_size))
distribution3 = np.random.triangular(left=0.0, mode=2.0, right=4.0, size=(n_samples, sample_size))
distribution4 = np.random.normal(loc=3.0, scale=1.5, size=(n_samples, sample_size))
distribution5 = np.random.beta(a=2.0, b=5.0, size=(n_samples, sample_size)) * 6  # スケール調整

# よりランダムな混合分布を作成
distribution = 0.2 * distribution1 + 0.2 * distribution2 + 0.2 * distribution3 + 0.2 * distribution4 + 0.2 * distribution5

# 平均を取り出す回数のリスト
mean_iterations = [100, 500, 1000, 5000]

# 図の設定
plt.figure(figsize=(15, 12))

# 元の混合分布のヒストグラム
plt.subplot(3, 2, 1)
plt.hist(distribution.flatten(), bins=50, density=True, alpha=0.6, color='blue')
plt.title('Original Complex Mixed Distribution')
plt.xlabel('Value')
plt.ylabel('Density')

# 平均を取り出す回数に応じたヒストグラムを描画
for i, n_means in enumerate(mean_iterations):
    sample_means = np.mean(np.random.choice(distribution.flatten(), size=(n_means, sample_size)), axis=1)
    
    plt.subplot(3, 2, i+2)
    plt.hist(sample_means, bins=50, density=True, alpha=0.6, color='g')
    
    # 理論的な正規分布を重ねる
    mu = np.mean(sample_means)
    sigma = np.std(sample_means)
    x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
    plt.plot(x, (1/(sigma * np.sqrt(2 * np.pi))) * np.exp( - (x - mu)**2 / (2 * sigma**2) ), color='red')
    
    plt.title(f'Number of Means = {n_means}')
    plt.xlabel('Sample Mean')
    plt.ylabel('Density')

plt.tight_layout()
plt.show()

出力結果

コードのステップ・バイ・ステップの説明

このセクションでは、上記コードの各ステップを詳しく説明します。

1.ライブラリのインポートと基本設定:

numpyを使用してランダムな分布を生成し、matplotlibを用いて結果を視覚化します。
np.random.seed(0)で乱数生成のシードを固定し、再現性を確保します。
n_samplesはサンプル数、sample_sizeは各サンプルのサイズです。

2.複数の分布を組み合わせた混合分布の作成:

distribution1: 指数分布。
distribution2: ガンマ分布。
distribution3: 三角分布。
distribution4: 正規分布。
distribution5: ベータ分布をスケール調整したもの。
これらの分布を同じ比率で混合し、複雑でランダムな分布を作成します。

3. 元の分布のヒストグラム表示:

plt.subplot(3, 2, 1)で元の混合分布を表示します。これにより、サンプル平均を取る前のランダムな分布を確認できます。

4. サンプル平均の分布を表示:

mean_iterationsリストにある回数（100回、500回、1000回、5000回）に応じてサンプル平均を計算します。
np.mean()を用いて、各回数でサンプル平均を計算し、その分布をヒストグラムとして表示します。
各ヒストグラムには理論的な正規分布を重ねて表示し、分布がどのように正規分布に近づくかを視覚化します。

5. プロットの配置と表示:

plt.tight_layout()を使用してプロットのレイアウトを調整し、プロットが重ならないようにします。
最後に、plt.show()で全てのプロットを表示します。

おわりに

このコードを使うことで、中心極限定理の概念をより直感的に理解できるはずです。ランダムな分布からサンプルを取り出し、その平均がどのように正規分布に近づくかを視覚的に確認することで、理論だけでは得られない深い理解が得られるでしょう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up