More than 1 year has passed since last update.

Feature EngineでMissing Data Imputation 備忘録

Posted at 2022-11-06

概要

データ前処理用パケージでやや有名なfeature-engineの作者の方の講座から、feature-engineを使用したMissing Data(欠損値のこと)のimputationについて、どんなケースでも使えるように網羅的に要点を抽出し残す。feature-engineはこの講座のために作成されたパケージである。

使用するデータは前回と同様に下記である。

実施期間: 2022年11月
環境：Colab
パケージ：feature-engine

1. 概要と処理の流れ

前回と全く同様なので割愛する。

使うのは下記のパケージ
引数は基本的にscikit-learnを継承しているが、詳細はオフィシャルを参照のこと。

インストールは環境に合わせ2通り準備されている。

# pipなら
pip install feature-engine

# condaなら
conda install -c conda-forge feature_engine

2. EDAまで

対象とする説明変数の選択は前回と同じなので割愛。
説明変数を指定しなければfeature-engineがDataFrame中の全列についてnumericalかcategoricalか判断し、全列に対して置き換えを行ってくれる仕様。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# from feature-engine
from feature_engine.imputation import (
    AddMissingIndicator,
    MeanMedianImputer,
    CategoricalImputer
)

入れ替え対象とする説明変数名を確保する。

data = pd.read_csv('houseprice.csv')
print(data.shape)
data.head()

features_categorical = [c for c in data.columns if data[c].dtypes == 'O']
features_numerical = [c for c in data.columns if data[c].dtypes != 'O']
features_numerical.remove('SalePrice')

print(len(features_categorical), features_categorical)
print(len(features_numerical), features_numerical)

features_categorical_w_na = list(data[features_categorical].columns[data[features_categorical].isnull().mean()>0])
features_numerical_w_na = list(data[features_numerical].columns[data[features_numerical].isnull().mean()>0])
print(len(features_categorical_w_na), features_categorical_w_na)
print(len(features_numerical_w_na), features_numerical_w_na)

almost_na = ['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']
for i in almost_na:
    features_categorical_w_na.remove(i)
print(len(features_categorical_w_na), features_categorical_w_na)

numeric_features_mean = ['LotFrontage']
numeric_features_median = ['MasVnrArea', 'GarageYrBlt']

3. Imputation

まず、train/testに分ける。

X_train, X_test, y_train, y_test = train_test_split(
    data[clmns],  # just the features
    data['SalePrice'],  # the target
    test_size=0.3,  # the percentage of obs in the test set
    random_state=0, # for reproducibility
)  

X_train.shape, X_test.shape

((1022, 80), (438, 80))

これ以降が前回のscikit-learn版と異なるパートとなる。
scikit-learndではnumpy array型で出力されたが、feature-engineでは入力も出力もDataFrame型となるため神経を使うnumpy->DataFrame変換が必要ない点が最大のメリットとなる。

pipelineの流れはコードのとおり。
前回と同じステップ順にしているが入力するDataFrameの列順は維持されるためステップは順不同である。

missing_ind
sklearn.impute.MissingIndicator()に相当する処理だが、TransformするDataFrameの末尾に自動的に新しい列として追加してくれる点が異なる。
numeric_mean_imputer
sklearn.impute.SimpleImputer(trategy='mean')に相当
numeric_median_imputer
sklearn.impute.SimpleImputer(trategy='median')に相当
categoric_constant_imputer
sklearn.impute.SimpleImputer(trategy='constant')に相当
categoric_frequent_imputer
sklearn.impute.SimpleImputer(trategy='most_frequent')に相当

preprocessor = Pipeline([
    ('missing_ind', AddMissingIndicator(variables=almost_na)),
    ('numeric_mean_imputer', MeanMedianImputer(imputation_method='mean',
                                         variables=numeric_features_mean)),
    ('numeric_median_imputer', MeanMedianImputer(imputation_method='median',
                                         variables=numeric_features_median)),
    ('categoric_constant_imputer', CategoricalImputer(imputation_method='missing',
                                         variables=features_categorical_w_na[:6])),
    ('categoric_frequent_imputer', CategoricalImputer(imputation_method='frequent', 
                                         variables=features_categorical_w_na[6:])),
])

preprocessor.fit(X_train)

Pipeline(steps=[('missing_ind',
                 AddMissingIndicator(variables=['Alley', 'FireplaceQu',
                                                'PoolQC', 'Fence',
                                                'MiscFeature'])),
                ('numeric_mean_imputer',
                 MeanMedianImputer(imputation_method='mean',
                                   variables=['LotFrontage'])),
                ('numeric_median_imputer',
                 MeanMedianImputer(variables=['MasVnrArea', 'GarageYrBlt'])),
                ('categoric_constant_imputer',
                 CategoricalImputer(variables=['MasVnrType', 'BsmtQual',
                                               'BsmtCond', 'BsmtExposure',
                                               'BsmtFinType1',
                                               'BsmtFinType2'])),
                ('categoric_frequent_imputer',
                 CategoricalImputer(imputation_method='frequent',
                                    variables=['Electrical', 'GarageType',
                                               'GarageFinish', 'GarageQual',
                                               'GarageCond']))])

X_train, X_testに適用する。

X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

列名だけ確認しておく。

print(data.columns)
print(X_train.columns)

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'SalePrice'],
      dtype='object')
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
       'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition', 'Alley_na', 'FireplaceQu_na', 'PoolQC_na', 'Fence_na',
       'MiscFeature_na'],
      dtype='object')

元のDataFrame最後に'Alley_na',..,'MiscFeature_na'が追加されている以外、順序は同じである。

以上

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up