More than 5 years have passed since last update.

データサイエンスに関して By データミックスコミュニティ Advent Calendar 2019

@daigomiyoshi(大悟三好)in

株式会社データミックス

特徴量作成自動化ライブラリ featuretools を単一テーブルに適用してみる

Last updated at 2019-12-20Posted at 2019-12-20

はじめに

本記事は、データサイエンスに関して By データミックスコミュニティ Advent Calendar 2019 の20日目の記事となります。
データミックスコミュニティを盛り上げようと企画したアドベントカレンダーになります。

概要

機械学習の予測モデルを構築する際の一つのポイントとして、「特徴量作成（Feature Engeneering）」が挙げられます。
データと対峙して（時には可視化やデータ一つ一つとにらめっこし）仮説を立てながら特徴量を作成していきますが、
前処理ということもあり、きちんとやろうと思うと、それなりの時間がかかります。
ただ近年はAutoML（機械学習の自動化）という潮流にもなってきていることからも、この特徴量作成にも自動化の波が来ています。

そのような特徴量を自動化するPythonライブラリの一つとして、featuretoolsというものがあります。
今回は、このfeaturetoolsを使っていく際に、その挙動や使い方がわかるように、動きの解説を簡単にします。

APIリファレンス

こちらに関する記事は既にいくつかありますので、概要や定義、詳細な内容はこちらを参考にしていただければと思います。
（私も参考にさせていただきました、ありがとうございます。）

上記の記事は、実は複数のテーブルに対して適用するイメージとなっています。
つまり、MasterテーブルとTransactionテーブルがあって、Masterテーブルとの共通IDをキーにし、Transactionテーブルに集約関数（SUM, MEAN関数など）を適用してMasterテーブルベースに特徴量を作成していくイメージです。

一方で、単一のテーブル（Kaggleの簡単な課題をイメージしてもらえると良いかと思います）に対してそのままfeaturetoolsを適用しようとすると、どのような感じで処理していくと良いのか、自分で処理しながら色々と気づいた点があるので、その際に理解できた点を列挙しておきます。

後述しますが、単一テーブルに対してfeaturetoolsを適用すると、以下のようなことはできました。

datetime型の変数に対して、日時系の変数を取り出す
連続数値系の変数に対して、各変数同士を（色々と）四則演算させる
Unique KeyではないIndexとなる変数があれば、その変数をKeyとして特定の集約関数にてGROUP BYさせる

1. ライブラリインストール

いつも通り、pipにてインストールしておきましょう。

pip install featuretools

2. データの読込・準備

本来であれば、実際に使っているデータで紹介したかったのですが…、
時間などなどの関係上、今回はfeaturetoolsにあるデモデータを使っての紹介に留めます。
（次回以降時間がある際に、他のデータでも紹介したい。）

featuretoolsには3つほどデモデータがありますが、今回は2017年からのフライトデータを使用します。
（公式APIリファレンスのこちらに簡単な説明記載があります。）

import featuretools as ft
import featuretools.variable_types as vtypes

# month_filter, categorical_filter: 
# データ行数を少なくするために、一部を抽出する際の指定（なので気にしなくて良い）
demo_flight = ft.demo.load_flight(
    verbose=True, 
    month_filter=[1], 
    categorical_filter={'origin_city':['Boston, MA']}
)
demo_flight

demo_flightの出力結果

Entityset: Flight Data
  Entities:
    trip_logs [Rows: 9456, Columns: 21]
    flights [Rows: 613, Columns: 9]
    airlines [Rows: 10, Columns: 1]
    airports [Rows: 55, Columns: 3]
  Relationships:
    trip_logs.flight_id -> flights.flight_id
    flights.carrier -> airlines.carrier
    flights.dest -> airports.dest

通常のデータ処理をイメージしてもらいたいので、一旦pandasのDataframeとして使用するデータを格納します。
またdemo_flightには4つのEntities（要は「テーブル」）を持っているので、単一のデータのみを扱いためtrip_logsEntityのみを使用します。

# .dfを使用するとDataframeとして取り出せます
demo_flight['trip_logs'].df.head(3)

	trip_log_id	flight_id	date_scheduled	scheduled_dep_time	scheduled_arr_time	dep_time	arr_time	dep_delay	taxi_out	taxi_in	arr_delay	scheduled_elapsed_time	air_time	distance	carrier_delay	national_airspace_delay	late_aircraft_delay
5	5	AA-605:BOS->PHL	2016-09-03 00:00:00	2017-01-01 11:15:00	2017-01-01 12:46:00	2017-01-01 11:05:00	2017-01-01 12:24:00	-10	15	3	-22	5460000000000	61	280	0	0	0
19	19	AA-1509:BOS->MIA	2016-09-03 00:00:00	2017-01-01 11:00:00	2017-01-01 14:38:00	2017-01-01 13:13:00	2017-01-01 16:22:00	133	11	3	104	13080000000000	175	1258	0	0	104
73	73	AA-1608:BOS->PHX	2016-09-03 00:00:00	2017-01-01 09:25:00	2017-01-01 15:31:00	2017-01-01 09:40:00	2017-01-01 15:48:00	15	19	9	17	21960000000000	340	2300	6	2	9

ちょっと列が多いので、最低限に絞ります。

trip_logs_df = demo_flight['trip_logs'].df[[
    'trip_log_id', 'flight_id', 'scheduled_dep_time', 'distance', 'late_aircraft_delay', 'security_delay'
]]
trip_logs_df.head()

trip_logs_df（9456行 × 6列）

	trip_log_id	flight_id	scheduled_dep_time	distance	late_aircraft_delay
5	5	AA-605:BOS->PHL	2017-01-01 11:15:00	280	0
19	19	AA-1509:BOS->MIA	2017-01-01 11:00:00	1258	104
73	73	AA-1608:BOS->PHX	2017-01-01 09:25:00	2300	9
166	166	AA-2251:BOS->DFW	2017-01-01 10:00:00	1562	0
218	218	AA-2303:BOS->DFW	2017-01-01 08:00:00	1562	0

trip_log_idがUnique Keyになっています
なお余談ですが、flight_idでGROUP BYした際の行数は以下のようなイメージです。

from collections import Counter
Counter(trip_logs_df['flight_id']).most_common()

出力結果

[('AA-1509:BOS->MIA', 31),
 ('AA-2303:BOS->DFW', 31),
 ('B6-1969:BOS->FLL', 31),
 ('B6-2191:BOS->TPA', 31),
 ('B6-2269:BOS->FLL', 31),
 ('B6-2379:BOS->EWR', 31),
 ('DL-105:BOS->ATL', 31),
 ('DL-476:BOS->JFK', 31),
...

従って、上記の単一テーブルに対して特徴量作成する際には、以下のようなことができると嬉しいかと思います。

datetime型の変数に対して、日時系の変数を取り出す
連続数値系の変数に対して、各変数同士を（色々と）四則演算させる
Unique KeyではないIndexとなる変数（今回であればflight_id）があれば、その変数をKeyとして特定の集約関数にてGROUP BYさせる

3. featuretoolsで特徴量作成をする際に抑えておくべき概念

前提として、featuretoolsを使用する際に抑えておく概念があり、aggregationとtransformの2タイプです。
この2タイプそれぞれに対して、いくつもの計算関数が用意されています。

aggregationは、あるIndexをKeyとしてGROUP BYし、集約関数により計算された値が変数となります
- 今回での例：flight_idをKeyとしてGROUP BYして、各idのdistanceのMEANを変数とする
transformは、ある列をもとに何かしらの計算を行い、その計算値を変数とします
- 今回での例1：distanceとlate_aircraft_delayの掛け算を新たな変数とする
- 今回での例2：scheduled_dep_timeから月（MONTH）を取り出して新たな変数とする

ちなみに、この両者に含まれている関数は以下のコードで確認できます。

ft.primitives.list_primitives()

	name	type	description
0	skew	aggregation	Computes the extent to which a distribution differs from a normal distribution.
1	mean	aggregation	Computes the average for a list of values.
2	count	aggregation	Determines the total number of values, excluding `NaN`.
3	time_since_first	aggregation	Calculates the time elapsed since the first datetime (in seconds).
4	n_most_common	aggregation	Determines the `n` most common elements.
5	all	aggregation	Calculates if all values are 'True' in a list.
6	num_true	aggregation	Counts the number of `True` values.
7	last	aggregation	Determines the last value in a list.
8	max	aggregation	Calculates the highest value, ignoring `NaN` values.
9	entropy	aggregation	Calculates the entropy for a categorical variable
10	std	aggregation	Computes the dispersion relative to the mean value, ignoring `NaN`.
11	mode	aggregation	Determines the most commonly repeated value.
12	median	aggregation	Determines the middlemost number in a list of values.
13	sum	aggregation	Calculates the total addition, ignoring `NaN`.
14	trend	aggregation	Calculates the trend of a variable over time.
15	avg_time_between	aggregation	Computes the average number of seconds between consecutive events.
16	time_since_last	aggregation	Calculates the time elapsed since the last datetime (default in seconds).
17	percent_true	aggregation	Determines the percent of `True` values.
18	any	aggregation	Determines if any value is 'True' in a list.
19	num_unique	aggregation	Determines the number of distinct values, ignoring `NaN` values.
20	first	aggregation	Determines the first value in a list.
21	min	aggregation	Calculates the smallest value, ignoring `NaN` values.
22	less_than_equal_to_scalar	transform	Determines if values are less than or equal to a given scalar.
23	year	transform	Determines the year value of a datetime.
24	less_than_scalar	transform	Determines if values are less than a given scalar.
25	not	transform	Negates a boolean value.
26	modulo_by_feature	transform	Return the modulo of a scalar by each element in the list.
27	week	transform	Determines the week of the year from a datetime.
28	subtract_numeric	transform	Element-wise subtraction of two lists.
29	divide_numeric	transform	Element-wise division of two lists.
30	greater_than_equal_to_scalar	transform	Determines if values are greater than or equal to a given scalar.
31	diff	transform	Compute the difference between the value in a list and the
32	longitude	transform	Returns the second tuple value in a list of LatLong tuples.
33	second	transform	Determines the seconds value of a datetime.
34	add_numeric	transform	Element-wise addition of two lists.
35	greater_than_scalar	transform	Determines if values are greater than a given scalar.
36	multiply_numeric_scalar	transform	Multiply each element in the list by a scalar.
37	day	transform	Determines the day of the month from a datetime.
38	modulo_numeric_scalar	transform	Return the modulo of each element in the list by a scalar.
39	percentile	transform	Determines the percentile rank for each value in a list.
40	time_since	transform	Calculates time from a value to a specified cutoff datetime.
41	cum_max	transform	Calculates the cumulative maximum.
42	not_equal	transform	Determines if values in one list are not equal to another list.
43	num_characters	transform	Calculates the number of characters in a string.
44	scalar_subtract_numeric_feature	transform	Subtract each value in the list from a given scalar.
45	divide_by_feature	transform	Divide a scalar by each value in the list.
46	cum_count	transform	Calculates the cumulative count.
47	time_since_previous	transform	Compute the time since the previous entry in a list.
48	equal	transform	Determines if values in one list are equal to another list.
49	cum_min	transform	Calculates the cumulative minimum.
50	is_weekend	transform	Determines if a date falls on a weekend.
51	less_than_equal_to	transform	Determines if values in one list are less than or equal to another list.
52	month	transform	Determines the month value of a datetime.
53	less_than	transform	Determines if values in one list are less than another list.
54	negate	transform	Negates a numeric value.
55	minute	transform	Determines the minutes value of a datetime.
56	haversine	transform	Calculates the approximate haversine distance between two LatLong
57	or	transform	Element-wise logical OR of two lists.
58	and	transform	Element-wise logical AND of two lists.
59	add_numeric_scalar	transform	Add a scalar to each value in the list.
60	greater_than_equal_to	transform	Determines if values in one list are greater than or equal to another list.
61	isin	transform	Determines whether a value is present in a provided list.
62	hour	transform	Determines the hour value of a datetime.
63	latitude	transform	Returns the first tuple value in a list of LatLong tuples.
64	multiply_boolean	transform	Element-wise multiplication of two lists of boolean values.
65	weekday	transform	Determines the day of the week from a datetime.
66	is_null	transform	Determines if a value is null.
67	not_equal_scalar	transform	Determines if values in a list are not equal to a given scalar.
68	greater_than	transform	Determines if values in one list are greater than another list.
69	multiply_numeric	transform	Element-wise multiplication of two lists.
70	modulo_numeric	transform	Element-wise modulo of two lists.
71	cum_mean	transform	Calculates the cumulative mean.
72	num_words	transform	Determines the number of words in a string by counting the spaces.
73	absolute	transform	Computes the absolute value of a number.
74	equal_scalar	transform	Determines if values in a list are equal to a given scalar.
75	subtract_numeric_scalar	transform	Subtract a scalar from each element in the list.
76	divide_numeric_scalar	transform	Divide each element in the list by a scalar.
77	cum_sum	transform	Calculates the cumulative sum.

いやあ、多いですね…。
自分も全て理解できていないですが、比較的色々なことができそうなことはわかります。

3-1. datetime型の変数に対して、日時系の変数を取り出す場合

まずはEntityとRelationshipを作成します。このあたりは冒頭に挙げた過去記事を参考にしてもらえると良いです。

es = ft.EntitySet(id='demo')  # ここのidの命名はなんでも良いです
es.entity_from_dataframe(
    entity_id='trip_logs',  #  ここのidの命名もご自由に
    dataframe=trip_logs_df,  # 対象とするDataframeを指定
    index='trip_log_id',  # Unique KeyとなるIndex変数を指定
    variable_types={'scheduled_dep_time': vtypes.Datetime}  # 念の為、datetime型の変数は明示的に指定してあげます
)

さて、datetime型の変数から年・月・日・曜日・時・分といった情報を変数として取り出すのは、datetime型への基本的な前処理の一つですので、やりましょう。

先程記載したように、日時系の変数を取り出したい場合は、transformタイプの関数を使用します。

# このリストで、上記の計算関数で何を使用するかを指定します
# 今回は日時系のtransformタイプの関数をいくつか適当にPick upします
aggregation_list = []
transform_list  = ['year', 'month', 'hour', 'weekday', 'is_weekend']

feature_matrix, features_dfs = ft.dfs(
    entityset=es,  # 上記で定義したEntitySetを指定
    target_entity="trip_logs",  # 上記EntitySetで定義したentity_idを指定
    agg_primitives=aggregation_list,  # 使用するaggregationタイプの関数を指定（今回はなし）
    trans_primitives=transform_list,  # 使用するtransformタイプの関数を指定（今回は5つ）
    max_depth=1  # 特徴量作成にて同じ処理を何回するか（詳しくは後述）
)

print(feature_matrix.shape)
feature_matrix.head(3)

さて、出力結果はこちら↓
feature_matrix（9456行 × 9列）

trip_log_id	flight_id	distance	late_aircraft_delay	YEAR(scheduled_dep_time)	MONTH(scheduled_dep_time)	HOUR(scheduled_dep_time)	WEEKDAY(scheduled_dep_time)	IS_WEEKEND(scheduled_dep_time)
5	AA-605:BOS->PHL	280	0	2017	1	11	6	True
19	AA-1509:BOS->MIA	1258	104	2017	1	11	6	True
73	AA-1608:BOS->PHX	2300	9	2017	1	9	6	True

きちんと、scheduled_dep_timeから、以下の要素が取り出せておりますね！

YEAR: 年
MONTH: 月
HOUR: 時間帯
WEEKDAY: 曜日
IS_WEEKEND: 土日かどうか

上記の関数一覧にあるように、他にも日時系の関数はいくつかあるので、必要に応じて使い分けましょう。

3-2. 連続数値系の変数に対して、各変数同士を（色々と）四則演算させる場合

今一度、EntityとRelationshipを作成し直します。
（本来であれば1回やれば大丈夫です。今回は何回かに分けて特徴量生成をするので、再定義しておきます。）

es = ft.EntitySet(id='demo')
es.entity_from_dataframe(
    entity_id='trip_logs',
    dataframe=trip_logs_df,
    index='trip_log_id',
    variable_types={'scheduled_dep_time': vtypes.Datetime}
)

さて、四則演算系は列同士の各行を計算させるだけなので、transformタイプとなります。
四則演算なので、基本的には列同士の足し算・引き算・掛け算・割り算を使用できます。

add_numeric: 列同士を足し算
subtract_numeric: 列同士を引き算
multiply_numeric: 列同士を掛け算
divide_numeric: 列同士を割り算

3-2-1. max_depth=1の場合

（タイトルのmax_depthは後述します。）
とりあえず、列数が増えすぎると見にくいので、まずは引き算と掛け算だけ適用してみましょう。

aggregation_list = []
transform_list  = ['subtract_numeric', 'multiply_numeric']

feature_matrix, features_dfs = ft.dfs(
    entityset=es,
    target_entity="trip_logs",
    agg_primitives=aggregation_list,
    trans_primitives=transform_list,
    max_depth=1
)

print(feature_matrix.shape)
feature_matrix.head(3)

結果は以下です。
feature_matrix（9456行 × 10列）

trip_log_id	flight_id	distance	late_aircraft_delay	distance - late_aircraft_delay	late_aircraft_delay - security_delay	distance - security_delay	distance * late_aircraft_delay
5	AA-605:BOS->PHL	280	0	280	0	280	0
19	AA-1509:BOS->MIA	1258	104	1154	104	1258	130832
73	AA-1608:BOS->PHX	2300	9	2291	9	2300	20700

きちんと、列同士が引き算または掛け算がされています！
元々のデータの特徴量である、連続変数distance・late_aircraft_delay・security_delayの3変数に対して

変数同士の引き算が3通り（= 組み合わせで言う所の「3C2」でしょうか）
変数同士の掛け算も同じく3通り

が加わっているという格好です。
（transformタイプで日時系の関数は指定していないのでscheduled_dep_timeは消えています）

3-2-2. max_depth=2の場合

さて、先程ft.dfs関数を適用した際に、max_depth=1と引数を指定しておりました。
このmax_depthは、「一度作った特徴量同士でさらに同じ処理を何回行うか？」という引数になっており、
1の場合では、指定したtransformの計算を1回やって終わりです。
ではこれを2にするとどうなるのでしょうか？

aggregation_list = []
transform_list  = ['subtract_numeric', 'multiply_numeric']

feature_matrix, features_dfs = ft.dfs(
    entityset=es,
    target_entity="trip_logs",
    agg_primitives=aggregation_list,
    trans_primitives=transform_list,
    max_depth=2  # ここを2と指定
)

print(feature_matrix.shape)
feature_matrix.head(3)

こちらが出力結果。
feature_matrix（9456行 × 22列）

trip_log_id	flight_id	distance	late_aircraft_delay	distance - security_delay	late_aircraft_delay - security_delay	distance - late_aircraft_delay	distance * late_aircraft_delay	late_aircraft_delay * late_aircraft_delay - security_delay	distance * late_aircraft_delay - security_delay	distance - late_aircraft_delay * distance - security_delay	distance - security_delay * late_aircraft_delay - security_delay	distance - late_aircraft_delay * late_aircraft_delay	distance * distance - late_aircraft_delay	distance - security_delay * late_aircraft_delay	distance * distance - security_delay	distance - late_aircraft_delay * late_aircraft_delay - security_delay
5	AA-605:BOS->PHL	280	0	280	0	280	0	0	0	78400	0	0	78400	0	78400	0
19	AA-1509:BOS->MIA	1258	104	1258	104	1154	130832	10816	130832	1.45173e+06	130832	120016	1.45173e+06	130832	1.58256e+06	120016
73	AA-1608:BOS->PHX	2300	9	2300	9	2291	20700	81	20700	5.2693e+06	20700	20619	5.2693e+06	20700	5.29e+06	20619

さあ、ここら辺りからfeaturetoolsがよくわからなくなってきます。笑
先程引き算や掛け算で得られた新たな特徴量に対して、さらに別の連続変数を引き算または掛け算して、さらに新たな特徴量を作り上げているのです…！
当然、特徴量数もグッと増えています。（先程の10列から22列へと増えていますね。）

こんな特徴量が効いてくるかどうか全く想像もつかないですが…
とりあえず、max_depthを増やすことで、入れ子のような形で何回もtransformによる特徴量作成ができるというわけです。

3-3. Unique KeyではないIndexの変数をKeyとして特定の集約関数にてGROUP BYさせる場合

最後に、単一テーブル内に、GROUP BYが可能な変数（=Key）が存在していた場合、その変数をKeyとした集約関数を使用した特徴量を作成します。
今回は、flight_idをKeyとしてGROUP BYして、各idをKeyにして、以下のような集約関数を（対象の変数に対して）計算してみましょう

SUM: （各KEYの）合計値
MEAN: （各KEYの）平均値
STD: （各KEYの）標準偏差
MAX: （各KEYの）最大値
COUNT: （各KEYの）個数
SKEW: （各KEYの）歪度

他にも色々と集約関数ありますが、aggregationタイプの関数を参照して下さい。
指定するほど、様々と作成できます。

従って、コードは以下となります。

aggregation_list = ['sum', 'mean', 'std', 'max', 'count', 'skew']
transform_list  = []

es = ft.EntitySet(id='demo')
es.entity_from_dataframe(
    entity_id='trip_logs', 
    dataframe=trip_logs_df, 
    index='trip_log_id',
    variable_types={'scheduled_dep_time': vtypes.Datetime}
)
es.normalize_entity(base_entity_id='trip_logs', new_entity_id='trip_logs_flight_id_norm', index='flight_id')

feature_matrix, features_dfs = ft.dfs(
    entityset=es, 
    target_entity="trip_logs",
    agg_primitives=aggregation_list, 
    trans_primitives=transform_list,
    max_depth=2
)

print(feature_matrix.shape)
feature_matrix.head(3)

注意点としては、以下の部分が加わっていることです。

es.normalize_entity(base_entity_id='trip_logs', new_entity_id='trip_logs_flight_id_norm', index='flight_id')

こちらで、どのKEYでGROUP BYするかを指定するイメージです。
featuretoolsではEntityという概念が必要なので、flight_idでGROUP BYしたEntityを作成しています。
ちなみにここで作成した新たなEntityであるtrip_logs_flight_normのDataframeは以下のように出力できます。

es['trip_logs_flight_id_norm'].df.head()

	flight_id
AA-605:BOS->PHL	AA-605:BOS->PHL
AA-1509:BOS->MIA	AA-1509:BOS->MIA
AA-1608:BOS->PHX	AA-1608:BOS->PHX
AA-2251:BOS->DFW	AA-2251:BOS->DFW
AA-2303:BOS->DFW	AA-2303:BOS->DFW

さて、結果は以下のようになります。

trip_log_id	flight_id	distance	late_aircraft_delay	trip_logs_flight_id_norm.SUM(trip_logs.distance)	trip_logs_flight_id_norm.SUM(trip_logs.late_aircraft_delay)	trip_logs_flight_id_norm.MEAN(trip_logs.distance)	trip_logs_flight_id_norm.MEAN(trip_logs.late_aircraft_delay)	trip_logs_flight_id_norm.STD(trip_logs.late_aircraft_delay)	trip_logs_flight_id_norm.MAX(trip_logs.distance)	trip_logs_flight_id_norm.MAX(trip_logs.late_aircraft_delay)	trip_logs_flight_id_norm.COUNT(trip_logs)	trip_logs_flight_id_norm.SKEW(trip_logs.security_delay)	trip_logs_flight_id_norm.SKEW(trip_logs.distance)	trip_logs_flight_id_norm.SKEW(trip_logs.late_aircraft_delay)
5	AA-605:BOS->PHL	280	0	560	0	280	0	0	280	0	2	nan	nan	nan
19	AA-1509:BOS->MIA	1258	104	38998	351	1258	11.3226	29.3932	1258	104	31	0	0	2.49488
73	AA-1608:BOS->PHX	2300	9	18400	9	2300	1.125	3.18198	2300	9	8	0	0	2.82843

色々と加わっていますね…。
trip_logs_flight_id_norm.SUM(trip_logs.security_delay)のような感じで、
(新たなEntity名).集約関数(対象となる変数名)という列名になります。

指定した集約関数を、自動で適用できる変数に対して適用できておりますね。

4. まとめ

一応、もし3-1, 3-2, 3-3で作成した変数を同時に作成する場合のコードも記載しておきます。


aggregation_list = ['sum', 'mean', 'std', 'max', 'count', 'skew']
transform_list  = ['year', 'month', 'hour', 'weekday', 'is_weekend', 'subtract_numeric', 'multiply_numeric']

es = ft.EntitySet(id='demo')
es.entity_from_dataframe(
    entity_id='trip_logs', 
    dataframe=trip_logs_df, 
    index='trip_log_id',
    variable_types={'scheduled_dep_time': vtypes.Datetime}
)
es.normalize_entity(base_entity_id='trip_logs', new_entity_id='trip_logs_flight_id_norm', index='flight_id')

feature_matrix, features_dfs = ft.dfs(
    entityset=es, 
    target_entity="trip_logs",
    agg_primitives=aggregation_list, 
    trans_primitives=transform_list,
    max_depth=2
)

aggregationとtransformで指定する関数を増やせば良いだけですね。
非常に楽です。
（出力結果は、ちょっと列数が凄いことになるので、記載は控えます…。）

最後に

改めて、使い所としては、
「とりあえず、予測モデルパイプラインを最小工数で作成しよう」
という時かと思います。自動的にいくつもの特徴量を作成してくれますが、
（今回は紹介できませんでしたが）実際に自分が使用しているデータでモデル構築してみると、変数重要度（Feature Importance）として高かった変数はほんの一部でした。

まあそれは当然で、ただ何も考えずに四則演算したり集約関数で計算しているだけなので、多くはGOMI変数となるに決まっています。

やはり仮説を立てながら有効そうな変数やデータ処理を細かくやっていくことが真には重要であると考えているので、
そのような営みはしばらくは残っていくと思ってはおります。

ただ、少しでも重要そうな変数が出てくるのであれば御の字と考えるのが良いかと思いますし、
一旦1stバージョンとして、何も考えずAutomaticに特徴量を作成してくれるものとしては、非常に使い所のあるツールであると思います。
（無料で使えるわけですし…）

今後は、（borutapyなどを使用した）変数選択、（h2oなどを使用した）モデル選択も組み合わせて、
モデル構築パイプラインを紹介できればと思っております。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up