特徴量重要度エンジニアリングをツールで自動でやっちゃうよ、ということなのだと理解。まあ100も1000も手動で作ってられないですからね。
特徴量まわりは機械学習をやっていて本当新鮮で楽しくて新しいおもちゃを手に入れたように感じる。
ま、言っても、算術平均とかをババって出したりとか、前に使ったPolynomialFeatures
を使って多項式をババっと出したりとか、きっとそんな感じなんでしょう?
train と test をまとめる。
# Add identifying column
app_train['set'] = 'train'
app_test['set'] = 'test'
app_test["TARGET"] = np.nan
# Append the dataframes
app = app_train.append(app_test, ignore_index = True)
なるほどこっちのほうが楽かもと思わせる。今まで色々なtrain,testへの手順の適応をみてきたが・・
datasets = [train,test]
for data in datasets:
みたいなのって無駄だな~と思ってたんですよ。
Relationship
# Entity set with id applications
es = ft.EntitySet(id = 'clients')
# Entities with a unique index
es = es.entity_from_dataframe(entity_id = 'app', dataframe = app, index = 'SK_ID_CURR')
es = es.entity_from_dataframe(entity_id = 'bureau', dataframe = bureau, index = 'SK_ID_BUREAU')
es = es.entity_from_dataframe(entity_id = 'previous', dataframe = previous, index = 'SK_ID_PREV')
# Entities that do not have a unique index
es = es.entity_from_dataframe(entity_id = 'bureau_balance', dataframe = bureau_balance,
make_index = True, index = 'bureaubalance_index')
es = es.entity_from_dataframe(entity_id = 'cash', dataframe = cash,
make_index = True, index = 'cash_index')
es = es.entity_from_dataframe(entity_id = 'installments', dataframe = installments,
make_index = True, index = 'installments_index')
es = es.entity_from_dataframe(entity_id = 'credit', dataframe = credit,
make_index = True, index = 'credit_index')
これってもしかして、dataframe同士のrelationを突っ込んでおけば、後はよしなに特徴量出してくれるという素敵ツールなのでしょうか?そんなの素敵すぎるんじゃないですか。
# Relationship between app and bureau
r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])
# Relationship between bureau and bureau balance
r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])
# Relationship between current app and previous apps
r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])
# Relationships between previous apps and cash, installments, and credit
r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])
素敵な予感がする。
# Relationship between app and bureau
r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])
# Relationship between bureau and bureau balance
r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])
# Relationship between current app and previous apps
r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])
# Relationships between previous apps and cash, installments, and credit
r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])
リレーションシップの張り方が独特に感じられる。
ft_Relationship(親、子)みたいです。
# Add in the defined relationships
es = es.add_relationships([r_app_bureau, r_bureau_balance, r_app_previous,
r_previous_cash, r_previous_installments, r_previous_credit])
# Print out the EntitySet
es
Entityset: clients
Entities:
app [Rows: 2002, Columns: 123]
bureau [Rows: 1001, Columns: 17]
previous [Rows: 1001, Columns: 37]
bureau_balance [Rows: 1001, Columns: 4]
cash [Rows: 1001, Columns: 9]
installments [Rows: 1001, Columns: 9]
credit [Rows: 1001, Columns: 24]
Relationships:
bureau.SK_ID_CURR -> app.SK_ID_CURR
bureau_balance.SK_ID_BUREAU -> bureau.SK_ID_BUREAU
previous.SK_ID_CURR -> app.SK_ID_CURR
cash.SK_ID_PREV -> previous.SK_ID_PREV
installments.SK_ID_PREV -> previous.SK_ID_PREV
credit.SK_ID_PREV -> previous.SK_ID_PREV
Relationshipsは子→親 という表示になってますね。
DFS with Default Primitives
# Default primitives from featuretools
default_agg_primitives = ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"]
default_trans_primitives = ["day", "year", "month", "weekday", "haversine", "numwords", "characters"]
上記のprimitives指定部分、そのままだと手元のfeaturetoolでは動かなかったので機能名を変更した。
# Default primitives from featuretools
default_agg_primitives = ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "num_unique", "mode"]
default_trans_primitives = ["day", "year", "month", "weekday", "haversine", "num_words", "num_characters"]
Here we will use the default aggregation and transformation primitives, a max depth of 2, and calculate primitives for the app entity. Because this process is computationally expensive, we can run the function using features_only = True to return only a list of the features and not calculate the features themselves.
とあったのでなるほどな、と思いつつ、max_depth=3にして Total Featuresを見てが使用できる機能の数自体は max_depth=2の時と変わらず 2221 Total Features
であった。そもそも今のテーブルのRelationsが2レベルしかないからだろうか?
Collinear Features
correlations['MEAN(credit.AMT_PAYMENT_TOTAL_CURRENT)'].sort_values(ascending=False).head()
Variable
MEAN(credit.AMT_PAYMENT_TOTAL_CURRENT) 1.000000
MEAN(previous_app.MEAN(credit.AMT_PAYMENT_TOTAL_CURRENT)) 0.999382
MIN(previous_app.MEAN(credit.AMT_PAYMENT_TOTAL_CURRENT)) 0.999024
MAX(previous_app.MEAN(credit.AMT_PAYMENT_TOTAL_CURRENT)) 0.995957
SUM(previous_app.MEAN(credit.AMT_PAYMENT_TOTAL_CURRENT)) 0.995484
Name: MEAN(credit.AMT_PAYMENT_TOTAL_CURRENT), dtype: float64
相関係数がが0.99以上とかはほぼ一緒なので学習データとしては冗長である。これはまあ分かる。
Feature Importances
とりあえず重要度0の特徴は意味ないだろうから消しちゃおうぜ、とのこと。
print('There are %d features with 0 importance' % sum(fi['importance'] == 0.0))
There are 237 features with 0 importance
ふむ。
from featuretools import selection
# Remove features with only one unique value
feature_matrix2 = selection.remove_low_information_features(feature_matrix)
print('Removed %d features' % (feature_matrix.shape[1]- feature_matrix2.shape[1]))
featuretoolsには1種類しかデータがないとか全部NAとか、そういった特徴量をさくっと削除してくれる便利ツールselection.remove_low_information_features
がもう備わってる。
Removed 371 features
Align Train and Test Sets
# Separate out the train and test sets
train = feature_matrix2[feature_matrix2['set'] == 'train']
test = feature_matrix2[feature_matrix2['set'] == 'test']
# One hot encoding
train = pd.get_dummies(train)
test = pd.get_dummies(test)
# Align dataframes on the columns
train, test = train.align(test, join = 'inner', axis = 1)
test = test.drop(columns = ['TARGET'])
print('Final Training Shape: ', train.shape)
print('Final Testing Shape: ', test.shape)
Final Training Shape: (1001, 2051)
Final Testing Shape: (1001, 2050)
ここのところなのだが、train,testに分割しなくても、feature_matrix2のままget_dummiesしたらいいんじゃないの?と思う。
train.align(test)で消えるはずの情報が残る可能性があるのがいやなのだろうか。
pd.get_dummies(feature_matrix2).shape
(2002, 2162)
確かに列が増えてるが。。
Conclusions
とりあえずfeaturetoolsの基本的な機能を使うだけで、マニュアルでやるのと同じぐらいの効果が短時間で得られるよ、ということは分かった。便利っぽい。
英語
- tedious: 長ったらしくて退屈な、あきあきする、つまらない
- conceive: 想像する、(…と)考える、思う、考える、(…が)思う、抱く、思いつく、(言葉で)表わされる、作られる、創設される
- socioeconomic: 社会経済的な
- supervised classification: 教師付き分類
- feasible: 実行可能な
- infer: 推測する
- fall: into two categories 2つのカテゴリに分類できる
- descripency: (陳述・計算などの)矛盾、不一致、食い違い
- rigolous: 厳しい、厳格な、厳密な、精密な、正確な、苛烈な
- interpretability: 説明力