Last updated at Posted at 2024-10-09

1. 記述統計と推測統計

記述統計 (Descriptive Statistics)


  • 平均値 (Mean): データの中心的な傾向を表す指標で、すべてのデータを合計し、データ数で割った値。
  • 中央値 (Median): データを小さい順または大きい順に並べたときの中央の値。
  • 分散 (Variance): データの散らばり具合を表す指標で、平均値から各データ点がどれだけ離れているかの二乗平均。
  • 標準偏差 (Standard Deviation): 分散の平方根で、データの散らばり具合を元のデータの単位で表す。


推測統計 (Inferential Statistics)


  • 区間推定 (Confidence Interval): サンプルから母平均や母分散などの母集団の特性を推定し、その推定値がある区間に収まる確率を計算する。
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Creating a sample dataset
data = {'X': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'Y': [2, 4, 5, 4, 5, 7, 8, 9, 10, 12]}
df = pd.DataFrame(data)

# 1. Descriptive Statistics Calculation
mean_X = df['X'].mean()         # Mean
median_X = df['X'].median()     # Median
variance_X = df['X'].var()      # Variance
std_dev_X = df['X'].std()       # Standard Deviation

mean_Y = df['Y'].mean()
median_Y = df['Y'].median()
variance_Y = df['Y'].var()
std_dev_Y = df['Y'].std()

# Display descriptive statistics
print("Descriptive Statistics for X:")
print(f"Mean: {mean_X}")
print(f"Median: {median_X}")
print(f"Variance: {variance_X}")
print(f"Standard Deviation: {std_dev_X}")

print("\nDescriptive Statistics for Y:")
print(f"Mean: {mean_Y}")
print(f"Median: {median_Y}")
print(f"Variance: {variance_Y}")
print(f"Standard Deviation: {std_dev_Y}")

# 2. Linear Regression Model Creation
X = df[['X']]  # Explanatory variable
Y = df['Y']    # Target variable

# Creating and training the model
model = LinearRegression()
model.fit(X, Y)

# Getting the slope and intercept
slope = model.coef_[0]
intercept = model.intercept_
print("\nLinear Regression Model:")
print(f"Coefficient (Slope): {slope}")
print(f"Intercept: {intercept}")

# 3. Calculating the Coefficient of Determination (R²)
r_squared = model.score(X, Y)
print(f"R² (Coefficient of Determination): {r_squared}")

# 4. Plotting the Results
plt.figure(figsize=(8, 6))
plt.scatter(X, Y, color='blue', label='Data Points')
plt.plot(X, model.predict(X), color='red', label='Regression Line')
plt.title('Linear Regression Model')

  • 仮説検定 (Hypothesis Testing): 帰無仮説(H₀)と対立仮説(H₁)を設定し、サンプルデータがどちらの仮説を支持するかを検証する。
    • 例: t検定、カイ二乗検定、ANOVA(分散分析)。
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
sns.set(style="whitegrid")

# Setting style for seaborn

# 1. Creating sample datasets for each hypothesis test
# For t-test: two independent samples
group1 = [20, 22, 19, 24, 21, 23, 22, 21]
group2 = [30, 31, 29, 32, 35, 33, 30, 34]

# For chi-squared test: observed and expected frequencies
observed = np.array([50, 30, 20])  # Observed frequencies in categories
expected = np.array([40, 40, 20])  # Expected frequencies

# For ANOVA: multiple groups with different means
group_A = [23, 24, 22, 25, 26, 23, 24]
group_B = [30, 32, 29, 28, 31, 30, 29]
group_C = [40, 42, 41, 39, 40, 41, 43]

# 2. Performing Hypothesis Tests

# 2.1 t-test (Independent two-sample t-test)
t_stat, p_value_ttest = stats.ttest_ind(group1, group2)
print("t-test Results:")
print(f"t-statistic: {t_stat:.4f}, p-value: {p_value_ttest:.4f}")
print("Reject H₀" if p_value_ttest < 0.05 else "Fail to Reject H₀")

# 2.2 Chi-squared test
chi2_stat, p_value_chi2 = stats.chisquare(observed, f_exp=expected)
print("\nChi-squared Test Results:")
print(f"Chi-squared statistic: {chi2_stat:.4f}, p-value: {p_value_chi2:.4f}")
print("Reject H₀" if p_value_chi2 < 0.05 else "Fail to Reject H₀")

# 2.3 ANOVA (One-way ANOVA test)
f_stat, p_value_anova = stats.f_oneway(group_A, group_B, group_C)
print("\nANOVA Results:")
print(f"F-statistic: {f_stat:.4f}, p-value: {p_value_anova:.4f}")
print("Reject H₀" if p_value_anova < 0.05 else "Fail to Reject H₀")

# 3. Visualization

# 3.1 Visualization of t-test results using boxplots
plt.figure(figsize=(10, 5))
plt.subplot(1, 3, 1)
sns.boxplot(data=[group1, group2], palette="Set3")
plt.title(f"t-test\np-value: {p_value_ttest:.4f}")
plt.xticks([0, 1], ['Group 1', 'Group 2'])

# 3.2 Visualization of Chi-squared test results using bar plot
plt.subplot(1, 3, 2)
categories = ['Category 1', 'Category 2', 'Category 3']
x = np.arange(len(categories))  # Label locations
width = 0.35  # Width of bars

# Bar plot of observed and expected values
plt.bar(x - width/2, observed, width, label='Observed', color='skyblue')
plt.bar(x + width/2, expected, width, label='Expected', color='orange')
plt.title(f"Chi-squared Test\np-value: {p_value_chi2:.4f}")
plt.xticks(x, categories)

# 3.3 Visualization of ANOVA results using boxplot
plt.subplot(1, 3, 3)
sns.boxplot(data=[group_A, group_B, group_C], palette="Set2")
plt.title(f"ANOVA\np-value: {p_value_anova:.4f}")
plt.xticks([0, 1, 2], ['Group A', 'Group B', 'Group C'])

# Show all plots

  • p値 (p-Value): 仮説検定において、帰無仮説が正しいと仮定した場合に、観測されたデータ以上に極端な結果が得られる確率。p値が小さいほど、帰無仮説を棄却する(対立仮説を支持する)根拠が強い。
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

# 1. Create sample datasets for each hypothesis test
# For t-test: two independent samples
group1 = [20, 22, 19, 24, 21, 23, 22, 21]
group2 = [30, 31, 29, 32, 35, 33, 30, 34]

# For chi-squared test: observed and expected frequencies
observed = np.array([50, 30, 20])  # Observed frequencies in categories
expected = np.array([40, 40, 20])  # Expected frequencies

# For ANOVA: multiple groups with different means
group_A = [23, 24, 22, 25, 26, 23, 24]
group_B = [30, 32, 29, 28, 31, 30, 29]
group_C = [40, 42, 41, 39, 40, 41, 43]

# 2. Perform Hypothesis Tests

# 2.1 t-test (Independent two-sample t-test)
t_stat, p_value_ttest = stats.ttest_ind(group1, group2)
print("t-test Results:")
print(f"t-statistic: {t_stat:.4f}, p-value: {p_value_ttest:.4f}")
if p_value_ttest < 0.05:
    print("Reject H₀ (Significant difference between group means)\n")
    print("Fail to Reject H₀ (No significant difference between group means)\n")

# 2.2 Chi-squared test
chi2_stat, p_value_chi2 = stats.chisquare(observed, f_exp=expected)
print("Chi-squared Test Results:")
print(f"Chi-squared statistic: {chi2_stat:.4f}, p-value: {p_value_chi2:.4f}")
if p_value_chi2 < 0.05:
    print("Reject H₀ (Observed frequencies significantly differ from expected)\n")
    print("Fail to Reject H₀ (Observed frequencies do not significantly differ from expected)\n")

# 2.3 ANOVA (One-way ANOVA test)
f_stat, p_value_anova = stats.f_oneway(group_A, group_B, group_C)
print("ANOVA Results:")
print(f"F-statistic: {f_stat:.4f}, p-value: {p_value_anova:.4f}")
if p_value_anova < 0.05:
    print("Reject H₀ (At least one group mean is significantly different)\n")
    print("Fail to Reject H₀ (No significant difference between group means)\n")

# 3. Visualization

# 3.1 Visualization of t-test results using boxplots
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
sns.boxplot(data=[group1, group2], palette="Set3")
plt.title(f"t-test\np-value: {p_value_ttest:.4f}")
plt.xticks([0, 1], ['Group 1', 'Group 2'])

# 3.2 Visualization of Chi-squared test results using bar plot
plt.subplot(1, 3, 2)
categories = ['Category 1', 'Category 2', 'Category 3']
x = np.arange(len(categories))  # Label locations
width = 0.35  # Width of bars

# Bar plot of observed and expected values
plt.bar(x - width/2, observed, width, label='Observed', color='skyblue')
plt.bar(x + width/2, expected, width, label='Expected', color='orange')
plt.title(f"Chi-squared Test\np-value: {p_value_chi2:.4f}")
plt.xticks(x, categories)

# 3.3 Visualization of ANOVA results using boxplot
plt.subplot(1, 3, 3)
sns.boxplot(data=[group_A, group_B, group_C], palette="Set2")
plt.title(f"ANOVA\np-value: {p_value_anova:.4f}")
plt.xticks([0, 1, 2], ['Group A', 'Group B', 'Group C'])

# Show all plots

2. 回帰分析 (Regression Analysis)


線形回帰 (Linear Regression)

  • 単回帰分析: 1つの独立変数が1つの従属変数に与える影響をモデル化する。

    • 例: ( y = \beta_0 + \beta_1 x + \epsilon )
      • ( y ): 従属変数、( x ): 独立変数、( \beta_0 ): 切片、( \beta_1 ): 回帰係数、( \epsilon ): 誤差項。
  • 重回帰分析: 複数の独立変数が1つの従属変数に与える影響をモデル化する。

    • 例: ( y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \epsilon )
      • ( x_1, x_2, \ldots, x_n ) が複数の独立変数。
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Create sample data for multiple linear regression
# Independent variables (features)
np.random.seed(0)  # For reproducibility
X1 = np.random.rand(100) * 10  # Feature 1
X2 = np.random.rand(100) * 50  # Feature 2
X3 = np.random.rand(100) * 30  # Feature 3

# Dependent variable (target)
# Generating a target variable based on a known linear relationship with some noise
y = 3 + 2 * X1 + 0.5 * X2 - 1.5 * X3 + np.random.randn(100) * 5

# Create a DataFrame for better visualization
data = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'y': y})

# 2. Build the multiple linear regression model
# Independent variables matrix
X = data[['X1', 'X2', 'X3']]
# Dependent variable
y = data['y']

# Create and train the model
model = LinearRegression()
model.fit(X, y)

# Get the model parameters
intercept = model.intercept_
coefficients = model.coef_

# Display the model parameters
print("Multiple Linear Regression Model:")
print(f"Intercept: {intercept:.4f}")
for idx, col in enumerate(X.columns):
    print(f"  {col}: {coefficients[idx]:.4f}")

# Calculate the R-squared value to evaluate model performance
r_squared = model.score(X, y)
print(f"\nR² (Coefficient of Determination): {r_squared:.4f}")

# 3. Visualization of the regression results

# Pairplot to visualize the relationships between variables
sns.pairplot(data)
plt.suptitle("Pairplot of Features and Target Variable", y=1.02)
plt.show()

# Heatmap to show correlation matrix
plt.figure(figsize=(6, 4))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Matrix")
plt.show()

# 4. Predictions
# Making predictions using the model
y_pred = model.predict(X)

# Plotting actual vs predicted values
plt.figure(figsize=(8, 6))
plt.scatter(y, y_pred, alpha=0.7, color='blue')
plt.xlabel("Actual y values")
plt.ylabel("Predicted y values")
plt.title("Actual vs Predicted y values")
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')  # Diagonal line for reference
plt.grid()
plt.show()

非線形回帰 (Nonlinear Regression)

  • 非線形の関係をモデル化する手法で、指数関数、対数関数、ポリノミアル関数などを用いる。

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt

# 1. Creating sample data
# Generate random data points for independent variable X
np.random.seed(42)  # For reproducibility
X = np.random.rand(100) * 10  # Random values between 0 and 10
# Generate dependent variable y using a nonlinear function with noise
y = 2 * np.sin(1.5 * X) + np.random.normal(size=X.shape) * 0.5  # Sine function with noise

# Convert X to a 2D array for sklearn
X = X.reshape(-1, 1)

# 2. Polynomial Regression (Degree 3)
# Create polynomial features (x, x^2, x^3)
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)

# Fit the polynomial regression model
poly_model = LinearRegression()
poly_model.fit(X_poly, y)

# Predict using the polynomial model
y_poly_pred = poly_model.predict(X_poly)

# 3. Exponential Regression
# Define the exponential function
def exp_func(x, a, b, c):
    return a * np.exp(b * x) + c

# Use curve_fit to find the best fit parameters for the exponential function
popt, _ = curve_fit(exp_func, X.ravel(), y, p0=(1, 0.1, 1))  # Initial guesses for a, b, c
y_exp_pred = exp_func(X.ravel(), *popt)

# 4. Visualization of Nonlinear Regression Results
plt.figure(figsize=(12, 6))

# Scatter plot of the original data
plt.scatter(X, y, color='blue', label='Original Data', alpha=0.6)

# Plot for Polynomial Regression
plt.plot(X, y_poly_pred, color='red', label='Polynomial Regression (Degree 3)', linewidth=2)

# Plot for Exponential Regression
plt.plot(X, y_exp_pred, color='green', linestyle='--', label='Exponential Regression', linewidth=2)

# Plot settings
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid()
plt.show()

# 5. Display model coefficients and evaluation

# Polynomial Regression Coefficients
print("Polynomial Regression (Degree 3) Coefficients:")
print(f"Intercept: {poly_model.intercept_}")
print(f"Coefficients: {poly_model.coef_}")

# Exponential Regression Parameters
print("\nExponential Regression Parameters:")
print(f"a: {popt[0]:.4f}, b: {popt[1]:.4f}, c: {popt[2]:.4f}")

# 6. Predictions for specific values (example)
x_new = np.array([[2], [4], [6]])  # Example values for prediction

# Predict using polynomial model
x_poly_new = poly.transform(x_new)
y_poly_new = poly_model.predict(x_poly_new)
print(y_poly_new)

# Predict using exponential model
y_exp_new = exp_func(x_new.ravel(), *popt)
print(y_exp_new)

ロジスティック回帰 (Logistic Regression)

  • 2値の結果(成功/失敗、合格/不合格など)を説明するための回帰モデル。確率をロジスティック曲線で表し、出力される値が確率として解釈される。
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# サンプルデータの生成
X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, random_state=42)

# トレーニングデータとテストデータに分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ロジスティック回帰モデルの定義と学習
model = LogisticRegression()
model.fit(X_train, y_train)

# テストデータに対する予測
y_pred = model.predict(X_test)

# 結果の評価
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print("Classification Report:")

# データポイントと決定境界の可視化
def plot_decision_boundary(X, y, model):
    h = .02  # Mesh step size
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(10, 6))
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', marker='o', s=100)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('Logistic Regression Decision Boundary')
plt.show()

plot_decision_boundary(X_test, y_test, model)

3. 実験計画法 (Design of Experiments, DOE)


要因実験 (Factorial Design)

  • 複数の独立変数(要因)が従属変数に与える影響を同時に評価する手法。各要因の組み合わせを全て試行し、相互作用効果を調べる。

直交表 (Orthogonal Array)

  • 直交表を用いて、実験の組み合わせを計画し、すべての要因の影響を効率的に調査する。田口法(Taguchi Method)としても知られる。

応答曲面法 (Response Surface Methodology)

  • 2次または3次の多項式モデルを用いて、実験結果の最適化を図る手法。プロセスの最適条件を見つけるのに役立つ。

4. 品質管理手法 (Quality Control Techniques)


管理図 (Control Chart)

  • 時系列データに対して、平均値や範囲(R)を監視し、プロセスの安定性を評価する。
    • X̄-R 管理図、X̄-S 管理図、p 管理図(不良率の監視)など。

プロセス能力指数 (Process Capability Index)

  • Cp、Cpk などの指標を用いて、プロセスが仕様に対してどの程度適合しているかを評価する。Cp 値が高いほど、プロセスが良好であることを示す。

5. 信頼性工学 (Reliability Engineering)



  • 故障率(Failure Rate)を時間の経過とともにモデル化し、製品寿命を予測する。
  • ワイブル分布 (Weibull Distribution): 寿命データに適した分布で、形状パラメータを調整することで、様々な故障メカニズムに対応できる。

バスタブ曲線 (Bathtub Curve)

  • 故障率が時間とともに変化することを表す曲線で、初期故障期、偶発故障期、摩耗故障期の3段階に分かれる。
    • 初期故障期:製品の初期不良や設計ミスにより、故障率が高い。
    • 偶発故障期:安定した故障率。
    • 摩耗故障期:部品の劣化により、故障率が上昇。

6. 多変量解析 (Multivariate Analysis)


主成分分析 (Principal Component Analysis, PCA)

  • 多次元データを低次元に変換し、データの構造を簡略化する手法。データの分散が最も大きい方向を主成分として抽出する。

因子分析 (Factor Analysis)

  • 複数の観測変数をいくつかの潜在因子にまとめ、データの内在する構造を解析する。

クラスター分析 (Cluster Analysis)

  • データを類似性に基づいてグループ化する手法。K-means 法や階層的クラスタリングなどが用いられる。

7. 統計的プロセス制御 (Statistical Process Control, SPC)


  • 制御図 (Control Chart): 上限管理線(UCL)と下限管理線(LCL)を設定し、データが管理線の範囲内に収まっているかを確認する。
  • **ヒストグラム

**: データの分布を視覚的に表示し、異常値の有無を確認する。


