LoginSignup
0
0

More than 3 years have passed since last update.

Kaggle:Tuning Automated Feature Engineering (Exploratory)

Posted at

の続き。どんな便利機能があるのだろうか。

pandasでdateをtimestampに変更。

# Establish a starting date for all applications at Home Credit
start_date = pd.Timestamp("2016-01-01")
start_date

If we were doing manual feature engineering, we might want to create new columns such as by subtracting DAYS_CREDIT_ENDDATE from DAYS_CREDIT to get the planned length of the loan in days, or subtracting DAYS_CREDIT_ENDDATE from DAYS_ENDDATE_FACT to find the number of days the client paid off the loan early. However, in this notebook we will not make any features by hand, but rather let featuretools develop useful features for us.

なるほど、確かに日付データも自動でパパっと作ってくれるならそれに越したことはないか。

time_features, time_feature_names = ft.dfs(entityset = es, target_entity = 'app_train', 
                                           trans_primitives = ['cum_sum', 'time_since_previous'], max_depth = 2,
                                           agg_primitives = ['trend'] ,
                                           features_only = False, verbose = True,
                                           chunk_size = len(app_train),
                                           ignore_entities = ['app_test'])


time_since_previousでよしなに作ってくれる・・のか?

seed feature

よくわからなかった。

By using seed features, we can include domain specific knowledge in feature engineering automation.

domain specificな情報も追加できるということ。

上記ページの例でいうと

In [4]: expensive_purchase = ft.Feature(es["transactions"]["amount"]) > 125

In [5]: feature_matrix, feature_defs = ft.dfs(entityset=es,
   ...:                                       target_entity="customers",
   ...:                                       agg_primitives=["percent_true"],
   ...:                                       seed_features=[expensive_purchase])
   ...: 

In [6]: feature_matrix[['PERCENT_TRUE(transactions.amount > 125)']]
Out[6]: 
             PERCENT_TRUE(transactions.amount > 125)
customer_id                                         
5                                           0.227848
4                                           0.220183
1                                           0.119048
3                                           0.182796
2                                           0.129032

es["transactions"]["amount"]が多いというのがドメイン知識的に重要なので、それをseedとして与える、ということだろうか。

Conclusion

In this notebook we explored some of the advanced functionality in featuretools including:

Time Variables: allow us to track trends over time
Interesting Variables: condition new features on values of existing features
Seed Features: define new features manually that featuretools will then build on top of
Custom feature primitives: design any transformation or aggregation feature that can incorporate domain knowledge

とはいうものの・・まだ理解が追いついていない。とにかく最初にぶん回す時に楽ができそう感。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0