More than 1 year has passed since last update.

OneHotEncoder(category_encoders)での変換後の分かり易いcolumn名を得る

Posted at 2022-02-18

機械学習の勉強をしています。sklearn のtrain_test_splitにより、dataをX_train,X_validに分割して、次のステージ・カテゴリエンコーダの段階にいます。
category_encodersのOnehotEncorderを実行します。変換後のカラム名は、即座に理解できるようなものではありません。
[xxxx_n]・・・n(整数)は、データの出現順。分かり易いカラム名にできるような仕組みがないものか、調べましたが見つけていません。そこで、自分なりにコードを作ってみました。
参考にしていただければ幸いです。或は、サジェスチョンをお願いします。

以下、df = sns.load_dataset('titanic')のデータを使って、具体的に進めます。
〇サブルーチン　def my_ce_ohe(X,y,category_selected):
　カラム名付け替えのルーチンは、中ほどの　# column名の付け替えルーチンの部分。
　( .category_mapping, .feature_namesを使う。　
　[.][tab]で、出てきたメンバーをひとつずつ調べてたどり着きました)

〇メインルーチン　データを定義(train dataにnanが存在するもの)し、
　エンコーダーのサブルーチンを呼び出しています。
　カラム名付け替えの検証の為に、比較できるようにプリントアウトしています。

def my_ce_ohe(X,y,category_selected):
    import pandas as pd
    import numpy as np
    import category_encoders as ce
    from sklearn.model_selection import train_test_split

# データをtrain,validに分割する
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=0)
    
    print(f"X({len(X)}) = X_train({len(X_train)}) + X_valid({len(X_valid)})")
    print(f'\n-----------X_train{X_train.shape}-------------')
    display(X_train.head(5))

    print(f'\n-----------X_valid{X_valid.shape}-------------')
    display(X_valid.head(5))
    
# 変換する特徴量の定義　及び　encodingの実施
    ce_ohe  = ce.OneHotEncoder(cols=category_selected, handle_unknown='ignore')
    ce_encoded_T = ce_ohe.fit_transform(X_train)
    ce_encoded_V = ce_ohe.transform(X_valid).fillna(0).astype(int)

#  .fillna(0).astype(int)---　X_validのみに出現するデータには、NaNが出てきます。
#  これを0にするための処置です。汎用性があるのか、今後の検証が必要です。

    print(f'\n-----------category_encoders OneHotencoding for X_train{ce_encoded_T.shape}-------------')
    display(ce_encoded_T.head(5))
    print(f'\n-----------category_encoders OneHotencoding for X_valid{ce_encoded_V.shape}-------------')
    display(ce_encoded_V.head(5))

# column名の付け替えルーチン
    a=pd.DataFrame(data=ce_ohe.category_mapping)   # ce_ohe.category_mapping--list
    b=pd.DataFrame(data=ce_ohe.feature_names)      # ce_ohe.feature_names ----list
    for i in range(len(a)):
        ref=a.iloc[i,0]                                # sex embarked
        for j in range(len(list(a.iloc[i,1].index))):  # ['female', 'male', nan]   
            yy=a.iloc[i,0]+'_'+ str(j+1)
            zz=a.iloc[i,0]+'_nan' if pd.isnull(a.iloc[i,1].index[j])  \
                                     else a.iloc[i,0]+'_'+ a.iloc[i,1].index[j]
            for k in range(len(b)):
                xx=b.iloc[k,0]       # pclass sex_1 sex_2 embarked_1 embarked_2 --
                if xx == yy :
                    b.iloc[k][0] =zz

# mapping table
    myMap=pd.concat([pd.DataFrame(data=ce_ohe.feature_names),b],axis=1)
    myMap.columns=['before','after']

# 検証用データの作成
    print('\n\n\n\n****************************************************************\n\n')
    org=X_train[category_selected]
    test_before=pd.concat([org,ce_encoded_T],axis=1)
    print('column label as is -------------------------')
    display(test_before)
    
    ce_encoded_T.columns=b[0]           # ce_encoded_Tのカラム名を書き替える
    test_after=pd.concat([org,ce_encoded_T],axis=1)
    print('after column label modified -------------------------')
    display(test_after)
    display(myMap)
    
    return

メインルーチン

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
import seaborn as sns

# データセットを読み込む
df = sns.load_dataset('titanic')
df=df[['survived','pclass','sex','embarked','sibsp','parch']] # データを限定する

df.loc[439,'embarked']=np.nan    # 検証しやすいように追加する
   
X=df.drop('survived',axis=1)
y=df['survived']

category_selected=['sex','embarked']
my_ce_ohe(X,y,category_selected)

コード実行後のプリントアウト

X(891) = X_train(712) + X_valid(179)

-----------X_train(712, 5)-------------
    pclass	   sex	embarked	sibsp	parch
140	     3	female	       C	    0	    2
439	     2	  male	     NaN	    0	    0
817	     2	  male	       C	    1	    1
378	     3	  male	       C	    0	    0
491	     3	  male	       S	    0	    0

-----------X_valid(179, 5)-------------
    pclass	   sex	embarked	sibsp	parch
495	     3	  male	       C	    0	    0
648	     3	  male	       S	    0	    0
278	     3	  male	       Q	    4	    1
31	     1	female	       C	    1	    0
255	     3	female	       C	    0       2

-----------category_encoders OneHotencoding for X_train(712, 9)-------------
    pclass sex_1 sex_2 embarked_1 embarked_2 embarked_3 embarked_4 sibsp parch
140	     3	   1	 0	        1	       0	      0	         0	   0     2
439	     2	   0	 1	        0	       1	      0	         0     0     0
817	     2	   0	 1	        1	       0	      0	         0	   1	 1
378	     3	   0	 1	        1	       0	      0	         0	   0	 0
491	     3	   0	 1	        0	       0	      1	         0	   0	 0

-----------category_encoders OneHotencoding for X_valid(179, 9)-------------
    pclass sex_1 sex_2 embarked_1 embarked_2 embarked_3 embarked_4 sibsp parch
495	     3	   0	 1	        1	       0	      0	         0	   0	 0
648	     3	   0	 1	        0	       0	      1	         0	   0	 0
278	     3	   0	 1	        0	       0	      0	         1	   4	 1
31	     1	   1	 0	        1	       0	      0	         0	   1	 0
255	     3	   1	 0	        1	       0	      0	         0	   0	 2




**************************************************************************************


column label as is -------------------------
    sex embarked pclass sex_1 sex_2 embarked_1 embarked_2 embarked_3 embarked_4 sibsp	parch
140	female	   C	  3	    1	  0	         1	        0	       0	      0	    0	2
439	male	 NaN	  2	    0	  1	         0	        1	       0	      0	    0	0
817	male	   C	  2	    0	  1	         1	        0	       0	      0	    1	1
378	male	   C	  3	    0	  1	         1	        0	       0	      0	    0	0
491	male	   S	  3	    0	  1	         0	        0	       1	      0	    0	0
...	...	...	...	...	...	...	...	...	...	...	...
835	female	   C	  1	    1	  0	         1	        0	       0	      0	    1	1
192	female	   S	  3	    1	  0	         0	        0	       1	      0	    1	0
629	male	   Q	  3	    0	  1	         0	        0	       0	      1	    0	0
559	female	   S	  3	    1	  0	         0	        0	       1	      0	    1	0
684	male	   S	  2	    0	  1	         0	        0	       1	      0	    1	1
712 rows × 11 columns

after column label modified -------------------------
    sex embarked pclass sex_female sex_male embarked_C embarked_nan embarked_S
140	female	   C	  3	         1	      0	         1	          0	         0	     
439	male	 NaN	  2	         0	      1	         0	          1	         0	     
817	male	   C	  2	         0	      1	         1	          0	         0	
378	male	   C	  3      	 0	      1	         1	          0	         0
491	male	   S	  3	         0	      1	         0	          0	         1
...	...	...	...	...	...	...	...	...	...	...	...
835	female	   C	  1	         1	      0	         1	          0	         0	
192	female	   S	  3	         1	      0	         0	          0	         1	
629	male	   Q	  3	         0	      1	         0	          0	         0
559	female	   S	  3	         1	      0	         0	          0	         1	
684	male	   S	  2	         0	      1	         0	          0	         1	
712 rows × 11 columns

    before	    after
0	pclass	    pclass
1	sex_1	    sex_female
2	sex_2	    sex_male
3	embarked_1	embarked_C
4	embarked_2	embarked_nan
5	embarked_3	embarked_S
6	embarked_4	embarked_Q
7	sibsp	    sibsp
8	parch	    parch

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up