More than 3 years have passed since last update.

[Python]カウントデータのモデル比較

Last updated at 2021-10-24Posted at 2021-10-24

学術系のデータ分析をPythonで行い、非負のカウントデータを回帰分析しました。

非負のカウントデータの回帰モデルは、ポアソン回帰モデル、負の二項回帰モデル、ゼロ過剰ポアソン回帰モデル、ゼロ過剰負の二項回帰モデルなどど、いろいろあります。

状況に応じて最適なモデルを選択するのですが、回帰モデルごとの結果を簡単に表示できる関数を作成しました。

投稿内容は個人の見解であり、所属する組織の公式見解ではありません。

記事の対象者

今回の記事は、次の悩みを持つ人を対象に執筆しています。

データ分析をPythonで実施したい
カウントデータの回帰モデルごとの結果を効率よく比較したい

この記事を読むうえで必要な知識

Pythonによるデータ分析の基礎知識を習得している
統計学の基礎を習得している

ざっくりとした要件定義

###Input

説明変数のリスト
被説明変数の列名
対象のDataFrame

###Output

回帰結果
相関行列
AIC値 etc.

アウトプットのイメージ

##コード
scikit-learnのWine recognition datasetを用いて、コードを解説します。

このデータセットにおいて、インプットは次の通りです。

説明変数のリスト

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium']
被説明変数の列名

'class'
対象のDataFrame

df

###ライブラリのインポートとデータの読み込み

#ライブラリをインポート
import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import load_wine

#データの読み込み
dataset = load_wine()
df = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
df['class'] = dataset.target

###回帰モデルを選択し、回帰結果などを表示

#回帰モデルの一覧
model_list = ['Poisson', 'Logistic', 'NB', 'ZIP', 'ZINB']

def show_y_overview(y_col,df):
    total = df.shape[0]
    y_count = df[df[y_col]>0].shape[0]
    
    print('トータル：{}'.format(total))
    print('1以上の被説明変数：{}'.format(y_count))


def regression(x_list, y_col, df, model='Poisson'):
    df_reg=df.copy()
    
    y=df_reg[y_col].copy()
    X=df_reg[x_list].copy()
    
    # ポアソン回帰を実施
    if model == 'Poisson':
        model = sm.GLM(y, sm.add_constant(X), family=sm.families.Poisson())
        result = model.fit()

    # ロジスティック回帰を実施
    elif model == 'Logistic':
        # ロジスティック回帰のために、購入回数が1回以上なら1、そうでないなら0に設定
        y=y.apply(lambda x: 1 if x>=1 else 0)

        model = sm.Logit(y, sm.add_constant(X))
        result = model.fit(disp=0)

    # zero-inflated Poissonを実施
    elif model == 'ZIP':
        model = sm.ZeroInflatedPoisson(endog=y, exog=sm.add_constant(X), exog_infl=sm.add_constant(X), inflation='logit')
        result = model.fit_regularized()

    # 負の二項分布
    elif model == 'NB':
        model = sm.GLM(y, sm.add_constant(X), family=sm.families.NegativeBinomial())
        result = model.fit()

    # zero-inflated NBPを実施
    elif model == 'ZINB':
        model = sm.ZeroInflatedNegativeBinomialP(
            endog=y, exog=sm.add_constant(X), exog_infl=sm.add_constant(X), inflation='logit')
        result = model.fit_regularized()

    else:
        sys.exit("正しいmodel名を入力してください。")

    print('AIC:{}'.format(result.aic.round()))
    display(result.summary())
    
def show_corr(col_list,df):
    df_temp=df.copy()
    df_temp=df_temp[col_list]
    return df_temp.corr()

def show_results(x_list, y_col, df, model='Poisson'):
    show_y_overview(y_col,df)
        
    regression(x_list,y_col,df,model)
    
    total_list = [y_col] + x_list    
    display(show_corr(total_list,df))
    return

x_list = ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium']
y_col = 'class'

show_results(x_list, y_col, df, model='NB')

先ほどもお見せしましたが、結果は次のように表示されます。

一例として算出しただけなので、結果の解釈はスキップします。

上の関数は回帰分析手法の一例になります。
回帰分析のパラメーターを変更すると、回帰結果が変わるので注意してください。

##おわりに
この記事では、回帰モデルごとの回帰結果を簡単に表示できる関数を作成しました。

お困りの場合は、このコードを使ってみてください。

##参考文献
7.1. Toy datasets — scikit-learn 1.0 documentation
Generalized Linear Models — statsmodels
Zero-inflated Negative Binomial Regression | Stata Data Analysis Examples

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up