の続き。どんな便利機能があるのだろうか。
pandasでdateをtimestampに変更。
# Establish a starting date for all applications at Home Credit
start_date = pd.Timestamp("2016-01-01")
start_date
If we were doing manual feature engineering, we might want to create new columns such as by subtracting DAYS_CREDIT_ENDDATE from DAYS_CREDIT to get the planned length of the loan in days, or subtracting DAYS_CREDIT_ENDDATE from DAYS_ENDDATE_FACT to find the number of days the client paid off the loan early. However, in this notebook we will not make any features by hand, but rather let featuretools develop useful features for us.
なるほど、確かに日付データも自動でパパっと作ってくれるならそれに越したことはないか。
time_features, time_feature_names = ft.dfs(entityset = es, target_entity = 'app_train',
trans_primitives = ['cum_sum', 'time_since_previous'], max_depth = 2,
agg_primitives = ['trend'] ,
features_only = False, verbose = True,
chunk_size = len(app_train),
ignore_entities = ['app_test'])
time_since_previous
でよしなに作ってくれる・・のか?
seed feature
よくわからなかった。
By using seed features, we can include domain specific knowledge in feature engineering automation.
domain specificな情報も追加できるということ。
上記ページの例でいうと
In [4]: expensive_purchase = ft.Feature(es["transactions"]["amount"]) > 125
In [5]: feature_matrix, feature_defs = ft.dfs(entityset=es,
...: target_entity="customers",
...: agg_primitives=["percent_true"],
...: seed_features=[expensive_purchase])
...:
In [6]: feature_matrix[['PERCENT_TRUE(transactions.amount > 125)']]
Out[6]:
PERCENT_TRUE(transactions.amount > 125)
customer_id
5 0.227848
4 0.220183
1 0.119048
3 0.182796
2 0.129032
es["transactions"]["amount"]が多いというのがドメイン知識的に重要なので、それをseedとして与える、ということだろうか。
Conclusion
In this notebook we explored some of the advanced functionality in featuretools including:
Time Variables: allow us to track trends over time
Interesting Variables: condition new features on values of existing features
Seed Features: define new features manually that featuretools will then build on top of
Custom feature primitives: design any transformation or aggregation feature that can incorporate domain knowledge
とはいうものの・・まだ理解が追いついていない。とにかく最初にぶん回す時に楽ができそう感。