More than 5 years have passed since last update.

カテゴリ変数のエンコーディング

Posted at 2019-05-28

前略

カテゴリカル変数のEncoding手法のまとめなどでカテゴリカル変数のエンコーディング手法の紹介がされていたので、pythonでの実現をやってみました。N番煎じですが。。。

ほとんどsklearn.preprocessingで実現できるので、一度公式ドキュメントをじっくり読むと良いかもしれませんね。他にも前処理に便利なものがありそうでした。

データ準備

タイタニックデータを使用します。



from sklearn.preprocessing import LabelBinarizer


from sklearn.preprocessing import LabelBinarizer
from sklearn import preprocessing
import pandas as pd
import numpy as np

# データセットを読み込み
df = pd.read_csv('train.csv')

今回はカテゴリ変数Embarkedを使用します。簡単のためnullの行は削除します。

# nullチェック -> 2件存在
df['Embarked'].isnull().sum()

# 今回はnull行を落とす
df = df.dropna(subset=['Embarked'])

ラベルエンコーディング

# ラベルエンコーディング（LabelEncoder）
from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
encoded = le.fit_transform(df['Embarked'].values)
decoded = le.inverse_transform(encoded)
df['encoded'] = encoded

print('存在するクラス: ', le.classes_)
print('変換先: C, Q, S ->', le.transform(['C', 'Q', 'S']))
print('エンコード結果: ', encoded)
print('元に戻す: ', decoded)

複数変数を一気に変換したいときはこちら。

# ラベルエンコーディング（OrdinalEncoder）
from sklearn.preprocessing import OrdinalEncoder


oe = preprocessing.OrdinalEncoder()
encoded = oe.fit_transform(df[['Embarked', 'Sex']].values)
decoded = oe.inverse_transform(encoded)

print('エンコード結果: ', encoded)
print('元に戻す: ', decoded)

カウントエンコーディング

# カウントエンコーディング
import collections


counter = collections.Counter(df['Embarked'].values)
count_dict = dict(counter.most_common())
encoded = df['Embarked'].map(lambda x: count_dict[x]).values
df['encoded'] = encoded

print('エンコード結果: ', encoded)

ラベルカウントエンコーディング

# ラベルカウントエンコーディング
import collections


counter = collections.Counter(df['Embarked'].values)
count_dict = dict(counter.most_common())
label_count_dict = {key:i for i, key in enumerate(count_dict.keys(), start=1)}
encoded = df['Embarked'].map(lambda x: label_count_dict[x]).values
df['encoded'] = encoded

print('エンコード結果: ', encoded)

ターゲットエンコーディング

sklearn.preprocessingにも用意されていますが、pandasを使ったほうがデータフレームの整形が楽です。

# ターゲットエンコーディング
target_dict = df[['Embarked','Survived']].groupby(['Embarked'])['Survived'].mean().to_dict()
encoded = df['Embarked'].map(lambda x: target_dict[x]).values
df['encoded'] = encoded

print('エンコード結果: ', encoded)

One-hotエンコーディング

# One-hotエンコーディング
# 多重共線性を取り除くためにdrop_firstをTrueとする
df_ = pd.get_dummies(df, drop_first=True, columns=['Embarked'])

df_.head()

こちらは複数変数を変換できますが、dropが使いにくいです。dropをTrueにしてもohe.get_feature_namesで取得できる列ラベルはdropしないので、データフレームにする場合、ラベルを取得するのがひと手間あります。

# One-hotエンコーディング（OneHotEncoder）
from sklearn.preprocessing import OneHotEncoder


ohe = OneHotEncoder(sparse=False)
# ohe = OneHotEncoder(sparse=False, drop='first')
encoded = ohe.fit_transform(df[['Embarked', 'Sex']].values)

print('カテゴリ: ', ohe.categories_)
print('カテゴリ名: ', ohe.get_feature_names(['Embarked', 'Sex']))

# 列名を取得
label = ohe.get_feature_names(['Embarked', 'Sex'])

# データフレーム化
df_ = pd.DataFrame(encoded, columns=label, dtype=np.int8)

# データフレームを結合
pd.concat([df, df_], axis=1)

バイナリの値専用の変換器。

# 2クラスのデータを2値に変換（OneHotEncoder）
lb = LabelBinarizer()
encoded = lb.fit_transform(df[['Sex']].values)

print('エンコード結果: ', encoded)

ハッシュエンコーディング

# ハッシュエンコーディング
# http://contrib.scikit-learn.org/categorical-encoding/index.html#
import category_encoders as ce

encoder = ce.HashingEncoder(cols=['Embarked', 'Sex'], n_components=4)
encoder.fit(df['Embarked'], df['Survived'])
encoded = encoder.transform(df['Embarked'])
encoded

草々

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up