0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

OneHotEncoder(category_encoders)での変換後の分かり易いcolumn名を得る

Posted at

機械学習の勉強をしています。sklearn のtrain_test_splitにより、dataをX_train,X_validに分割して、次のステージ・カテゴリエンコーダの段階にいます。
category_encodersのOnehotEncorderを実行します。変換後のカラム名は、即座に理解できるようなものではありません。
[xxxx_n]・・・n(整数)は、データの出現順。分かり易いカラム名にできるような仕組みがないものか、調べましたが見つけていません。そこで、自分なりにコードを作ってみました。
参考にしていただければ幸いです。或は、サジェスチョンをお願いします。

以下、df = sns.load_dataset('titanic')のデータを使って、具体的に進めます。
〇サブルーチン def my_ce_ohe(X,y,category_selected):
 カラム名付け替えのルーチンは、中ほどの # column名の付け替えルーチンの部分。
 ( .category_mapping, .feature_namesを使う。 
 [.][tab]で、出てきたメンバーをひとつずつ調べてたどり着きました)

〇メインルーチン データを定義(train dataにnanが存在するもの)し、
 エンコーダーのサブルーチンを呼び出しています。
 カラム名付け替えの検証の為に、比較できるようにプリントアウトしています。

def my_ce_ohe(X,y,category_selected):
    import pandas as pd
    import numpy as np
    import category_encoders as ce
    from sklearn.model_selection import train_test_split

# データをtrain,validに分割する
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=0)
    
    print(f"X({len(X)}) = X_train({len(X_train)}) + X_valid({len(X_valid)})")
    print(f'\n-----------X_train{X_train.shape}-------------')
    display(X_train.head(5))

    print(f'\n-----------X_valid{X_valid.shape}-------------')
    display(X_valid.head(5))
    
# 変換する特徴量の定義 及び encodingの実施
    ce_ohe  = ce.OneHotEncoder(cols=category_selected, handle_unknown='ignore')
    ce_encoded_T = ce_ohe.fit_transform(X_train)
    ce_encoded_V = ce_ohe.transform(X_valid).fillna(0).astype(int)

#  .fillna(0).astype(int)--- X_validのみに出現するデータには、NaNが出てきます。
#  これを0にするための処置です。汎用性があるのか、今後の検証が必要です。

    print(f'\n-----------category_encoders OneHotencoding for X_train{ce_encoded_T.shape}-------------')
    display(ce_encoded_T.head(5))
    print(f'\n-----------category_encoders OneHotencoding for X_valid{ce_encoded_V.shape}-------------')
    display(ce_encoded_V.head(5))

# column名の付け替えルーチン
    a=pd.DataFrame(data=ce_ohe.category_mapping)   # ce_ohe.category_mapping--list
    b=pd.DataFrame(data=ce_ohe.feature_names)      # ce_ohe.feature_names ----list
    for i in range(len(a)):
        ref=a.iloc[i,0]                                # sex embarked
        for j in range(len(list(a.iloc[i,1].index))):  # ['female', 'male', nan]   
            yy=a.iloc[i,0]+'_'+ str(j+1)
            zz=a.iloc[i,0]+'_nan' if pd.isnull(a.iloc[i,1].index[j])  \
                                     else a.iloc[i,0]+'_'+ a.iloc[i,1].index[j]
            for k in range(len(b)):
                xx=b.iloc[k,0]       # pclass sex_1 sex_2 embarked_1 embarked_2 --
                if xx == yy :
                    b.iloc[k][0] =zz

# mapping table
    myMap=pd.concat([pd.DataFrame(data=ce_ohe.feature_names),b],axis=1)
    myMap.columns=['before','after']

# 検証用データの作成
    print('\n\n\n\n****************************************************************\n\n')
    org=X_train[category_selected]
    test_before=pd.concat([org,ce_encoded_T],axis=1)
    print('column label as is -------------------------')
    display(test_before)
    
    ce_encoded_T.columns=b[0]           # ce_encoded_Tのカラム名を書き替える
    test_after=pd.concat([org,ce_encoded_T],axis=1)
    print('after column label modified -------------------------')
    display(test_after)
    display(myMap)
    
    return

メインルーチン

import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
import seaborn as sns

# データセットを読み込む
df = sns.load_dataset('titanic')
df=df[['survived','pclass','sex','embarked','sibsp','parch']] # データを限定する

df.loc[439,'embarked']=np.nan    # 検証しやすいように追加する
   
X=df.drop('survived',axis=1)
y=df['survived']

category_selected=['sex','embarked']
my_ce_ohe(X,y,category_selected)

コード実行後のプリントアウト

X(891) = X_train(712) + X_valid(179)

-----------X_train(712, 5)-------------
    pclass	   sex	embarked	sibsp	parch
140	     3	female	       C	    0	    2
439	     2	  male	     NaN	    0	    0
817	     2	  male	       C	    1	    1
378	     3	  male	       C	    0	    0
491	     3	  male	       S	    0	    0

-----------X_valid(179, 5)-------------
    pclass	   sex	embarked	sibsp	parch
495	     3	  male	       C	    0	    0
648	     3	  male	       S	    0	    0
278	     3	  male	       Q	    4	    1
31	     1	female	       C	    1	    0
255	     3	female	       C	    0       2

-----------category_encoders OneHotencoding for X_train(712, 9)-------------
    pclass sex_1 sex_2 embarked_1 embarked_2 embarked_3 embarked_4 sibsp parch
140	     3	   1	 0	        1	       0	      0	         0	   0     2
439	     2	   0	 1	        0	       1	      0	         0     0     0
817	     2	   0	 1	        1	       0	      0	         0	   1	 1
378	     3	   0	 1	        1	       0	      0	         0	   0	 0
491	     3	   0	 1	        0	       0	      1	         0	   0	 0

-----------category_encoders OneHotencoding for X_valid(179, 9)-------------
    pclass sex_1 sex_2 embarked_1 embarked_2 embarked_3 embarked_4 sibsp parch
495	     3	   0	 1	        1	       0	      0	         0	   0	 0
648	     3	   0	 1	        0	       0	      1	         0	   0	 0
278	     3	   0	 1	        0	       0	      0	         1	   4	 1
31	     1	   1	 0	        1	       0	      0	         0	   1	 0
255	     3	   1	 0	        1	       0	      0	         0	   0	 2




**************************************************************************************


column label as is -------------------------
    sex embarked pclass sex_1 sex_2 embarked_1 embarked_2 embarked_3 embarked_4 sibsp	parch
140	female	   C	  3	    1	  0	         1	        0	       0	      0	    0	2
439	male	 NaN	  2	    0	  1	         0	        1	       0	      0	    0	0
817	male	   C	  2	    0	  1	         1	        0	       0	      0	    1	1
378	male	   C	  3	    0	  1	         1	        0	       0	      0	    0	0
491	male	   S	  3	    0	  1	         0	        0	       1	      0	    0	0
...	...	...	...	...	...	...	...	...	...	...	...
835	female	   C	  1	    1	  0	         1	        0	       0	      0	    1	1
192	female	   S	  3	    1	  0	         0	        0	       1	      0	    1	0
629	male	   Q	  3	    0	  1	         0	        0	       0	      1	    0	0
559	female	   S	  3	    1	  0	         0	        0	       1	      0	    1	0
684	male	   S	  2	    0	  1	         0	        0	       1	      0	    1	1
712 rows × 11 columns

after column label modified -------------------------
    sex embarked pclass sex_female sex_male embarked_C embarked_nan embarked_S
140	female	   C	  3	         1	      0	         1	          0	         0	     
439	male	 NaN	  2	         0	      1	         0	          1	         0	     
817	male	   C	  2	         0	      1	         1	          0	         0	
378	male	   C	  3      	 0	      1	         1	          0	         0
491	male	   S	  3	         0	      1	         0	          0	         1
...	...	...	...	...	...	...	...	...	...	...	...
835	female	   C	  1	         1	      0	         1	          0	         0	
192	female	   S	  3	         1	      0	         0	          0	         1	
629	male	   Q	  3	         0	      1	         0	          0	         0
559	female	   S	  3	         1	      0	         0	          0	         1	
684	male	   S	  2	         0	      1	         0	          0	         1	
712 rows × 11 columns

    before	    after
0	pclass	    pclass
1	sex_1	    sex_female
2	sex_2	    sex_male
3	embarked_1	embarked_C
4	embarked_2	embarked_nan
5	embarked_3	embarked_S
6	embarked_4	embarked_Q
7	sibsp	    sibsp
8	parch	    parch
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?