More than 1 year has passed since last update.

Pythonでグラフのビン数をスタージェスの公式で計算し、レイアウトを調整する

Last updated at 2023-05-16Posted at 2023-05-14

この記事では、Pythonで作成したグラフのビン数をスタージェスの公式に基づいて計算する方法と、複数のグラフのレイアウトを調整する方法を説明します。

1. スタージェスの公式とヒストグラムのビン数

スタージェスの公式は、データを分析する際にヒストグラムを作成するときの「ビン（bin）」数、つまりデータをいくつの区間に分けるべきかを推定するための公式です。

具体的には、データ数がn個のとき、ビン数kは以下の公式で推定します。

k = 1 + log2(n)

ここで、log2は底が2の対数を表します。この公式は、データ数が増えるにつれてビン数も増えることを示しています。

なぜスタージェスの公式が必要なのか

ヒストグラムは、データの分布を視覚的に理解するのに非常に便利なツールです。しかし、ビン数をどう設定するかによって、ヒストグラムの形状は大きく変わります。ビン数が少なすぎると、データの細かい特徴を見落とすことになります。一方で、ビン数が多すぎると、ノイズによる影響が大きくなり、データの本質的な特徴を見誤る可能性があります。

スタージェスの公式は、このビン数を適切に設定するための一つの基準を提供します。データ数に応じて適切なビン数を自動的に決定することで、データの特徴を適切に捉えることができます。

ただし、スタージェスの公式が常に最適なビン数を提供するわけではありません。データの性質によっては、他の方法（例えば、平方根法、ライスの公式など）を用いた方が良い場合もあります。そのため、ビン数の設定はデータの性質を考慮した上で、適切な方法を選択することが重要です。

2. スタージェスの公式に基づいてビン数を計算する

以下のコードでは、Plotlyを使用して背景を白格子に設定し、KDEプロットを表示し、スタージェスの公式に基づいてビン数を表示するようにコードを変更しています。

import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from scipy.stats import gaussian_kde

def display_value_counts_bar(df, column, xaxis_title, yaxis_title, color='goldenrod'):
    value_counts = df[column].value_counts()

    fig = px.bar(x=value_counts.index, y=value_counts.values, color_discrete_sequence=[color])
    fig.update_layout(xaxis_title=xaxis_title, yaxis_title=yaxis_title)
    fig.show()

def display_histogram(df, column, color='goldenrod'):
    # Calculate the bin number using Sturges' formula
    bin_number = int(1 + np.log2(len(df[column])))

    fig = px.histogram(df, x=column, nbins=bin_number, color_discrete_sequence=[color])

    # Add KDE plot using scipy
    kde = gaussian_kde(df[column])
    x_range = np.linspace(df[column].min(), df[column].max(), 1000)
    kde_values = kde(x_range)
    
    fig.add_trace(go.Scatter(x=x_range, y=kde_values, mode='lines', name='KDE', line=dict(color='blue')))

    # Set background to white grid
    fig.update_layout(
        plot_bgcolor="rgba(255, 255, 255, 1)",
        xaxis_title=column,
        yaxis_title="Frequency",
        xaxis=dict(showgrid=True, gridcolor="rgba(200, 200, 200, 0.5)"),
        yaxis=dict(showgrid=True, gridcolor="rgba(200, 200, 200, 0.5)"),
        title="Distribution of " + str(column)
    )

    fig.show()

3. グラフ間のレイアウトを調整する

以下のコードでは、plt.subplots_adjust()を使ってサブプロット間のスペースを調整し、文字が重ならないようにしています。

import seaborn as sns
import matplotlib.pyplot as plt

features = ["UPDRSIII_On", "UPDRSIII_Off", "Age"]
labels = ["UPDRSIII_On", "UPDRSIII_Off", "Age"]

fig, axs = plt.subplots(nrows=3, ncols=1, figsize=(14, 13))

sns.set_style('darkgrid')

axs = axs.flatten()

sns.set_style('darkgrid')

for x, feature in enumerate(features):
    ax = axs[x]
    _ = sns.histplot(data=subjects, x

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up