More than 1 year has passed since last update.

Irisデータセットとstatsmodelsによるロジスティック回帰の実装

Posted at 2023-09-22

この記事では，インターネット上に公開されている Iris plants dataset (以下Iris データセット) を例にロジスティック回帰分析をstatsmodels (Python) で実装する方法を説明します．

概要

Iris (アヤメ) データセットは，セトサ (Iris-Setosa)，バーシクル (Iris-Versicolor)，バージニカ (Iris-virginica) の3種類のアヤメの花の「がく片 (sepal)」と「花弁 (petal)」の「幅」と「長さ」が収められたデータセットです．

このデータを用いて，がく片の長さ ($x_1$)，がく片の幅 ($x_2$)，花弁の長さ ($x_3$)，花弁の幅 ($x_4$) からアヤメの花がバージニカ種かそれ以外か ($y$) を予測するためのモデルを多重ロジスティック回帰分析を用いて構築します．
ここで

$y$ は反応変数で，バージニカ種のときは$y = 1$，そうでないときは$y = 0$です．
$x_1, x_2, x_3, x_4$ は説明変数です．
$x$ の前の数値は係数です．

この例で，アヤメの花がバージニカ種である確率を$p$とすると，ロジスティック回帰モデルでは$p$は次の式で表されます．
$$
p = \frac{1}{1 + \exp[-(\beta_0 + \beta_1 x_1 + \cdots + \beta_4 x_4)]}
$$
ここで$\beta_0$は切片です．この式は次のように書き換えることもできます．
$$
\log(p/(1-p)) = \beta_0 + \beta_1 x_1 + \cdots + \beta_4 x_4
$$

セットアップ

はじめに，本説明で用いるモジュールをあらかじめインポートしておきます．

import pandas as pd
import statsmodels.api as sm

データをロードする

scikit-learnのload_iris()関数を用いてデータをロードします．pandasのDataFrameまたはSeries型でロードするために，as_frameをTrueに設定しています．

from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
iris.keys()

出力:

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

iris.dataに観察したアヤメの花のデータが、iris.targetにアヤメの花の分類データが格納されています．アヤメの花がセトサのときは0，バーシクルのときは1，バージニカのときは2で表現されています (iris.target_namesのインデックスと対応しています)．

iris_data = iris.data
iris_target = iris.target

モデルを構築するためにデータを整形する

反応変数をendog, 説明変数をexogとします．iris_targetには3種それぞれに数値が与えられているので，これを「バージニカ種のときは1，それ以外は0」となるように変換します．statsmodels の統計モデルにデータを渡すとき，デフォルトでは切片は含まれていないので，add_constant() 関数で切片項を追加する必要があります．

endog = (iris_target == 2)
exog = sm.add_constant(iris_data)

単ロジスティック回帰分析

上記で定義したlogit_resに対してsummary()メソッドを呼び出すことで，各説明変数に対する係数，標準偏差，Wald統計検定量，P値，95%信頼区間などの統計報告を得ることができます．

詳細は，statsmodels.discrete.discrete_model.LogitResults.summary - statsmodels 0.14.0をご覧ください．

まずはそれぞれの特徴量について単ロジスティック回帰分析を行い、$P \leq 0.1$である有意な特徴量について探索します。

from io import StringIO

print("有意な特徴量:\n")
# const以外のcolumnをそれぞれattrとして反復する
for attr in exog.columns[1:]:
    logit_mod = sm.Logit(endog, exog[["const", attr]])
    logit_res = logit_mod.fit()
    summary = logit_res.summary()

    table = pd.read_csv(StringIO(summary.tables[1].as_csv()), index_col=0)
    p = table.at[attr, "P>|z| "]
    if p <= 0.1:
        print(f"{attr} (P = {p})")

出力:

有意な特徴量:

Optimization terminated successfully.
         Current function value: 0.391152
         Iterations 7
sepal length (cm) (P = 0.0)
Optimization terminated successfully.
         Current function value: 0.627128
         Iterations 5
sepal width (cm) (P = 0.1)
Optimization terminated successfully.
         Current function value: 0.111440
         Iterations 11
petal length (cm) (P = 0.0)
Optimization terminated successfully.
         Current function value: 0.111403
         Iterations 10
petal width (cm) (P = 0.0)

重ロジスティック回帰分析

単ロジスティック回帰分析の結果をもとに、重ロジスティック回帰モデルを作成します。

logit_mod = sm.Logit(endog, exog)
logit_res = logit_mod.fit()
summary = logit_res.summary()
print(summary)

出力:

Optimization terminated successfully.
         Current function value: 0.039662
         Iterations 14
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                 target   No. Observations:                  150
Model:                          Logit   Df Residuals:                      145
Method:                           MLE   Df Model:                            4
Date:                Fri, 22 Sep 2023   Pseudo R-squ.:                  0.9377
Time:                        22:50:38   Log-Likelihood:                -5.9493
converged:                       True   LL-Null:                       -95.477
Covariance Type:            nonrobust   LLR p-value:                 1.189e-37
=====================================================================================
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const               -42.6378     25.708     -1.659      0.097     -93.024       7.748
sepal length (cm)    -2.4652      2.394     -1.030      0.303      -7.158       2.228
sepal width (cm)     -6.6809      4.480     -1.491      0.136     -15.461       2.099
petal length (cm)     9.4294      4.737      1.990      0.047       0.145      18.714
petal width (cm)     18.2861      9.743      1.877      0.061      -0.809      37.381
=====================================================================================

Possibly complete quasi-separation: A fraction 0.73 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up