このようなデータがあるとします。
import pandas as pd
train = pd.DataFrame({'Categorical':['A','B','C','E','G','I','J','K'],
'Numerical':[1, 2, 3, 4, 5, 6, 7, 8]})
test = pd.DataFrame({'Categorical':['A','B','D','F','H','J','K'],
'Numerical':[1, 2, 3, 4, 5, 6, 7]})
print(train)
print(test)
Categorical Numerical
0 A 1
1 B 2
2 C 3
3 E 4
4 G 5
5 I 6
6 J 7
7 K 8
Categorical Numerical
0 A 1
1 B 2
2 D 3
3 F 4
4 H 5
5 J 6
6 K 7
train, testをそれぞれダミー変数化します。
train = pd.get_dummies(train)
test = pd.get_dummies(test)
print(train)
print(test)
Numerical Categorical_A Categorical_B Categorical_C Categorical_E \
0 1 1 0 0 0
1 2 0 1 0 0
2 3 0 0 1 0
3 4 0 0 0 1
4 5 0 0 0 0
5 6 0 0 0 0
6 7 0 0 0 0
7 8 0 0 0 0
Categorical_G Categorical_I Categorical_J Categorical_K
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 1 0 0 0
5 0 1 0 0
6 0 0 1 0
7 0 0 0 1
Numerical Categorical_A Categorical_B Categorical_D Categorical_F \
0 1 1 0 0 0
1 2 0 1 0 0
2 3 0 0 1 0
3 4 0 0 0 1
4 5 0 0 0 0
5 6 0 0 0 0
6 7 0 0 0 0
Categorical_H Categorical_J Categorical_K
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 1 0 0
5 0 1 0
6 0 0 1
列数並びに列名が合いません。
そこで不足している列の値を全て0で補う関数を定義します。
def fill_missing_columns(df_a, df_b):
columns_for_b = set(df_a.columns) - set(df_b.columns)
for column in columns_for_b:
df_b[column] = 0
columns_for_a = set(df_b.columns) - set(df_a.columns)
for column in columns_for_a:
df_a[column] = 0
関数を実行します。
fill_missing_columns(train, test)
train.sort_index(axis=1, inplace=True)
test.sort_index(axis=1, inplace=True)
print('train')
print(train.shape)
print(train)
print('test')
print(test.shape)
print(test)
train
(8, 12)
Categorical_A Categorical_B Categorical_C Categorical_D Categorical_E \
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 0 1
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 0 0 0 0 0
Categorical_F Categorical_G Categorical_H Categorical_I Categorical_J \
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 1 0 0 0
5 0 0 0 1 0
6 0 0 0 0 1
7 0 0 0 0 0
Categorical_K Numerical
0 0 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 0 7
7 1 8
test
(7, 12)
Categorical_A Categorical_B Categorical_C Categorical_D Categorical_E \
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 0 1 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
Categorical_F Categorical_G Categorical_H Categorical_I Categorical_J \
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 1 0 0 0 0
4 0 0 1 0 0
5 0 0 0 0 1
6 0 0 0 0 0
Categorical_K Numerical
0 0 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 1 7
列名並びに列数を合わせることができました。
追記
@nkay さんにコメントでスマートな下記の書き方を教えていただきました。ありがとうございます。
import pandas as pd
train = pd.DataFrame({'Categorical':['A','B','C','E','G','I','J','K'],
'Numerical':[1, 2, 3, 4, 5, 6, 7, 8]})
test = pd.DataFrame({'Categorical':['A','B','D','F','H','J','K'],
'Numerical':[1, 2, 3, 4, 5, 6, 7]})
cat_list = {*train.Categorical, *test.Categorical}
# 方法1
train.Categorical = train.Categorical.astype(pd.CategoricalDtype(cat_list))
# 方法2
test.Categorical = pd.Categorical(test.Categorical, cat_list)
train = pd.get_dummies(train)
test = pd.get_dummies(test)
print(train)
print(test)
参考:
https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.api.types.CategoricalDtype.html
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Categorical.html