8
8

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

訓練用データとテスト用データをダミー変数化したときに列数が合わない問題を解決する

Last updated at Posted at 2019-07-16

このようなデータがあるとします。

import pandas as pd

train = pd.DataFrame({'Categorical':['A','B','C','E','G','I','J','K'],
                     'Numerical':[1, 2, 3, 4, 5, 6, 7, 8]})
test = pd.DataFrame({'Categorical':['A','B','D','F','H','J','K'],
                    'Numerical':[1, 2, 3, 4, 5, 6, 7]})
print(train)
print(test)
  Categorical  Numerical
0           A          1
1           B          2
2           C          3
3           E          4
4           G          5
5           I          6
6           J          7
7           K          8
  Categorical  Numerical
0           A          1
1           B          2
2           D          3
3           F          4
4           H          5
5           J          6
6           K          7

train, testをそれぞれダミー変数化します。

train = pd.get_dummies(train)
test = pd.get_dummies(test)
print(train)
print(test)
   Numerical  Categorical_A  Categorical_B  Categorical_C  Categorical_E  \
0          1              1              0              0              0   
1          2              0              1              0              0   
2          3              0              0              1              0   
3          4              0              0              0              1   
4          5              0              0              0              0   
5          6              0              0              0              0   
6          7              0              0              0              0   
7          8              0              0              0              0   

   Categorical_G  Categorical_I  Categorical_J  Categorical_K  
0              0              0              0              0  
1              0              0              0              0  
2              0              0              0              0  
3              0              0              0              0  
4              1              0              0              0  
5              0              1              0              0  
6              0              0              1              0  
7              0              0              0              1  
   Numerical  Categorical_A  Categorical_B  Categorical_D  Categorical_F  \
0          1              1              0              0              0   
1          2              0              1              0              0   
2          3              0              0              1              0   
3          4              0              0              0              1   
4          5              0              0              0              0   
5          6              0              0              0              0   
6          7              0              0              0              0   

   Categorical_H  Categorical_J  Categorical_K  
0              0              0              0  
1              0              0              0  
2              0              0              0  
3              0              0              0  
4              1              0              0  
5              0              1              0  
6              0              0              1  

列数並びに列名が合いません。

そこで不足している列の値を全て0で補う関数を定義します。

def fill_missing_columns(df_a, df_b):
    columns_for_b = set(df_a.columns) - set(df_b.columns)
    for column in columns_for_b:
        df_b[column] = 0
    columns_for_a = set(df_b.columns) - set(df_a.columns)
    for column in columns_for_a:
        df_a[column] = 0

関数を実行します。

fill_missing_columns(train, test)

train.sort_index(axis=1, inplace=True)
test.sort_index(axis=1, inplace=True)

print('train')
print(train.shape)
print(train)
print('test')
print(test.shape)
print(test)
train
(8, 12)
   Categorical_A  Categorical_B  Categorical_C  Categorical_D  Categorical_E  \
0              1              0              0              0              0   
1              0              1              0              0              0   
2              0              0              1              0              0   
3              0              0              0              0              1   
4              0              0              0              0              0   
5              0              0              0              0              0   
6              0              0              0              0              0   
7              0              0              0              0              0   

   Categorical_F  Categorical_G  Categorical_H  Categorical_I  Categorical_J  \
0              0              0              0              0              0   
1              0              0              0              0              0   
2              0              0              0              0              0   
3              0              0              0              0              0   
4              0              1              0              0              0   
5              0              0              0              1              0   
6              0              0              0              0              1   
7              0              0              0              0              0   

   Categorical_K  Numerical  
0              0          1  
1              0          2  
2              0          3  
3              0          4  
4              0          5  
5              0          6  
6              0          7  
7              1          8  
test
(7, 12)
   Categorical_A  Categorical_B  Categorical_C  Categorical_D  Categorical_E  \
0              1              0              0              0              0   
1              0              1              0              0              0   
2              0              0              0              1              0   
3              0              0              0              0              0   
4              0              0              0              0              0   
5              0              0              0              0              0   
6              0              0              0              0              0   

   Categorical_F  Categorical_G  Categorical_H  Categorical_I  Categorical_J  \
0              0              0              0              0              0   
1              0              0              0              0              0   
2              0              0              0              0              0   
3              1              0              0              0              0   
4              0              0              1              0              0   
5              0              0              0              0              1   
6              0              0              0              0              0   

   Categorical_K  Numerical  
0              0          1  
1              0          2  
2              0          3  
3              0          4  
4              0          5  
5              0          6  
6              1          7   

列名並びに列数を合わせることができました。

追記

@nkay さんにコメントでスマートな下記の書き方を教えていただきました。ありがとうございます。


import pandas as pd

train = pd.DataFrame({'Categorical':['A','B','C','E','G','I','J','K'],
                     'Numerical':[1, 2, 3, 4, 5, 6, 7, 8]})
test = pd.DataFrame({'Categorical':['A','B','D','F','H','J','K'],
                    'Numerical':[1, 2, 3, 4, 5, 6, 7]})


cat_list = {*train.Categorical, *test.Categorical}
# 方法1
train.Categorical = train.Categorical.astype(pd.CategoricalDtype(cat_list))
# 方法2
test.Categorical = pd.Categorical(test.Categorical, cat_list)


train = pd.get_dummies(train)
test = pd.get_dummies(test)
print(train)
print(test)

参考:
https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.api.types.CategoricalDtype.html
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Categorical.html

8
8
2

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
8
8

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?