More than 3 years have passed since last update.

[Python]説明変数群ごとのAICを算出する

Last updated at 2021-10-24Posted at 2021-10-24

学術系のデータ分析をPythonで行い、
**「複数の説明変数群を作成し、どの説明変数群の組み合わせが最適かAICで確認する」**というプロセスがありました。
なおAICとは、'Akaike's Information Criterion'の略で、回帰モデルが最適かどうか判断するための指標の一つです。(詳しくはこちら)

今回の記事では、このプロセスを効率化するためのコードを解説します。

投稿内容は個人の見解であり、所属する組織の公式見解ではありません。

記事の対象者

今回の記事は、次の悩みを持つ人を対象に執筆しています。

データ分析をPythonで実施したい
AICの比較を効率化したい

この記事を読むうえで必要な知識

Pythonによるデータ分析の基礎知識を習得している
統計学の基礎を習得している

ざっくりとした要件定義

Input

必ず含める説明変数群
含めるか検討中の説明変数群

Output

含めるか検討中の説明変数群を全探索し、AICを算出

アウトプットのイメージ
('ash','other_ingredients', 'color'の3つが、含めるか検討中の説明変数群の場合、)

# output
[]
AIC:462.0
['ash']
AIC:451.0
['other_ingredients']
AIC:437.0
['ash', 'other_ingredients']
AIC:434.0
['color']
AIC:437.0
['ash', 'color']
AIC:430.0
['other_ingredients', 'color']
AIC:439.0
['ash', 'other_ingredients', 'color']
AIC:436.0

コード

scikit-learnのWine recognition datasetを用いて、コードを解説します。

今回の具体例では、インプットは次の通りです。

必ず含める説明変数

'alcohol','malic_acid'
検討中の説明変数群の名称

'ash','other_ingredients','color'

ライブラリのインポートとデータの読み込み

# ライブラリをインポート
import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import load_wine
import warnings

# データの読み込み
dataset = load_wine()
df = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
df['class'] = dataset.target

説明変数群ごとのAICを比較

def show_AIC(x_list, y_col, df):
    '''
    対象の説明変数群、被説明変数、dfでのAICを算出する関数
    '''
    warnings.simplefilter('ignore')
    # nanを削除
    df_reg=df.copy()
    df_reg.dropna(axis=0,how="any",inplace=True)
    df_reg.reset_index(inplace=True)
    # yとXをコピー
    y=df_reg[y_col].copy()
    X=df_reg[x_list].copy()
    # 負の二項回帰モデルを使用
    model = sm.GLM(y, sm.add_constant(X), family=sm.families.NegativeBinomial())
    result = model.fit()
    print('AIC:{}'.format(result.aic.round()))

# グループ分けした辞書を作成する
x_dict = {'alc':['alcohol'],
          'acid':['malic_acid'],
          'ash':['ash','alcalinity_of_ash'],
         'other_ingredients':
         ['magnesium', 'total_phenols',  'flavanoids', 'nonflavanoid_phenols','proanthocyanins'],
         'color':['color_intensity','hue','od280/od315_of_diluted_wines']}

# 必ず含める説明変数の名称を指定する
fixed_name = ['alc', 'acid']

# 「必ず含める説明変数のリスト」と「含めるか検討する説明変数のdict」を作成する
fixed_list=[]
AIC_dict = x_dict.copy()

for val in fixed_name:
    temp = AIC_dict.pop(val) #popすると、AIC_dictから削除される
    fixed_list+=temp

n=len(AIC_dict)
for i in range(2 ** n):
    temp_list = []
    temp_title = []
    for j, key in enumerate(AIC_dict):  # このループが一番のポイント
        if ((i >> j) & 1):  # 順に右にシフトさせ最下位bitのチェックを行う
            #最下位bitが1の場合、対象を加える
            temp_title.append(key)
            temp_list += x_dict[key]
    print(temp_title)
    temp_list = fixed_list + temp_list
    
    show_AIC(temp_list,'class',df)

先ほどもお見せしましたが、結果は次のように表示されます。

[]
AIC:462.0
['ash']
AIC:451.0
['other_ingredients']
AIC:437.0
['ash', 'other_ingredients']
AIC:434.0
['color']
AIC:437.0
['ash', 'color']
AIC:430.0
['other_ingredients', 'color']
AIC:439.0
['ash', 'other_ingredients', 'color']
AIC:436.0

一例としてAICを算出しただけなので、結果の解釈はスキップします。

おわりに

この記事では、説明変数群を全探索しAICを算出しました。

お困りの場合は、このコードを使ってみてください。

参考文献

赤池情報量規準
 7.1. Toy datasets — scikit-learn 1.0 documentation
Python de アルゴリズム（bit全探索）
Generalized Linear Models — statsmodels
statsmodels.discrete.discrete_model.NegativeBinomial — statsmodels

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up