More than 5 years have passed since last update.

Titanicデータを前処理ライブラリDataLinerで処理してみる(Append編)

Posted at 2020-05-03

はじめに

Pythonの前処理ライブラリDataLinerの各処理を紹介する記事4回目です。
今回はAppend系について紹介したいと思います。これで現状実装されている前処理はすべてになります。

GW明けにいくつか前処理を追加してVer1.2をリリース予定なので、その際にはまた紹介記事を書きたいと思います。

リリース記事:
https://qiita.com/shallowdf20/items/36727c9a18f5be365b37
ドキュメント:
https://shallowdf20.github.io/dataliner/preprocessing.html

インストール

! pip install -U dataliner

データ準備

いつも通りTitanicのデータを準備します。

import pandas as pd
import dataliner as dl

df = pd.read_csv('train.csv')
target_col = 'Survived'

X = df.drop(target_col, axis=1)
y = df[target_col]

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.250	NaN	S
2	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.283	C85	C
3	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	NaN	S
4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.100	C123	S
5	3	Allen, Mr. William Henry	male	35	0	373450	8.050	NaN	S

AppendAnomalyScore

データを元にIsolation Forestを学習させ、その異常値スコアを新たな特徴量として追加します。
使用前に欠損値補完・カテゴリ変数の処理が必要です。

trans = dl.AppendAnomalyScore()
process = make_pipeline(
    dl.ImputeNaN(),
    dl.RankedTargetMeanEncoding(),
    trans
)
process.fit_transform(X, y)

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Anomaly_Score
1	3	640	2	22	1	141	7.250	144	3	0.04805
2	1	554	1	38	1	351	71.283	101	1	-0.06340
3	3	717	1	26	0	278	7.925	144	3	0.04050
4	1	803	1	35	1	92	53.100	33	3	-0.04854
5	3	602	2	35	0	113	8.050	144	3	0.06903

AppendCluster

データをKMeans++でクラスタリングを行い、その結果各データが所属するクラスターの番号を新たな特徴量として追加します。
使用前に欠損値補完・カテゴリ変数の処理が必要です。スケーリングも推奨。

trans = dl.AppendCluster()
process = make_pipeline(
    dl.ImputeNaN(),
    dl.RankedTargetMeanEncoding(),
    dl.StandardScaling(),
    trans
)
process.fit_transform(X, y)

PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Cluster_Number
-1.729	0.8269	0.7538	0.7373	-0.5921	0.4326	-0.4734	-0.8129	-0.5022	0.4561	0.5856	5
-1.725	-1.5652	0.4197	-1.3548	0.6384	0.4326	-0.4734	0.1102	0.7864	-0.6156	-1.9412	2
-1.721	0.8269	1.0530	-1.3548	-0.2845	-0.4743	-0.4734	-0.2107	-0.4886	0.4561	0.5856	4
-1.717	-1.5652	1.3872	-1.3548	0.4077	0.4326	-0.4734	-1.0282	0.4205	-2.3103	0.5856	0
-1.714	0.8269	0.6062	0.7373	0.4077	-0.4743	-0.4734	-0.9359	-0.4861	0.4561	0.5856	5

AppendClusterDistance

データをKMeans++でクラスタリングを行い、その結果各データから各クラスターまでの距離を新たな特徴量として追加します。
使用前に欠損値補完・カテゴリ変数の処理が必要です。スケーリングも推奨。

trans = dl.AppendClusterDistance()
process = make_pipeline(
    dl.ImputeNaN(),
    dl.RankedTargetMeanEncoding(),
    dl.StandardScaling(),
    trans
)
process.fit_transform(X, y)

PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Cluster_Distance_0	Cluster_Distance_1	Cluster_Distance_2	Cluster_Distance_3	Cluster_Distance_4	Cluster_Distance_5	Cluster_Distance_6	Cluster_Distance_7
-1.729	0.8269	0.7538	0.7373	-0.5921	0.4326	-0.4734	-0.8129	-0.5022	0.4561	0.5856	4.580	2.794	3.633	4.188	3.072	2.363	4.852	5.636
-1.725	-1.5652	0.4197	-1.3548	0.6384	0.4326	-0.4734	0.1102	0.7864	-0.6156	-1.9412	3.434	4.637	3.374	4.852	3.675	4.619	6.044	3.965
-1.721	0.8269	1.0530	-1.3548	-0.2845	-0.4743	-0.4734	-0.2107	-0.4886	0.4561	0.5856	4.510	3.410	3.859	3.906	2.207	2.929	5.459	5.608
-1.717	-1.5652	1.3872	-1.3548	0.4077	0.4326	-0.4734	-1.0282	0.4205	-2.3103	0.5856	2.604	5.312	4.063	5.250	4.322	4.842	6.495	4.479
-1.714	0.8269	0.6062	0.7373	0.4077	-0.4743	-0.4734	-0.9359	-0.4861	0.4561	0.5856	4.482	2.632	3.168	4.262	3.097	2.382	5.724	5.593

AppendPrincipalComponent

データに対して主成分分析を行い、その主成分を新たな特徴量として追加します。
使用前に欠損値補完・カテゴリ変数の処理が必要です。スケーリングも推奨。

trans = dl.AppendPrincipalComponent()
process = make_pipeline(
    dl.ImputeNaN(),
    dl.RankedTargetMeanEncoding(),
    dl.StandardScaling(),
    trans
)
process.fit_transform(X, y)

PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Principal_Component_0	Principal_Component_1	Principal_Component_2	Principal_Component_3	Principal_Component_4
-1.729	0.8269	0.7538	0.7373	-0.5921	0.4326	-0.4734	-0.8129	-0.5022	0.4561	0.5856	-1.0239	0.1683	0.2723	-0.7951	-1.839
-1.725	-1.5652	0.4197	-1.3548	0.6384	0.4326	-0.4734	0.1102	0.7864	-0.6156	-1.9412	2.2205	0.1572	1.3115	-0.9589	-1.246
-1.721	0.8269	1.0530	-1.3548	-0.2845	-0.4743	-0.4734	-0.2107	-0.4886	0.4561	0.5856	-0.6973	0.2542	0.6843	-0.5943	-1.782
-1.717	-1.5652	1.3872	-1.3548	0.4077	0.4326	-0.4734	-1.0282	0.4205	-2.3103	0.5856	2.7334	0.2536	-0.2722	-1.5439	-1.530
-1.714	0.8269	0.6062	0.7373	0.4077	-0.4743	-0.4734	-0.9359	-0.4861	0.4561	0.5856	-0.7770	-0.7732	0.2852	-0.9750	-1.641

おわりに

DataLinerのAppend系の項目を紹介しました。
今後はDataLinerをアップデートした際にその機能についての紹介記事を書いていきたいと思います。

Datalinerリリース記事: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37
ドキュメント: https://shallowdf20.github.io/dataliner/preprocessing.html
GitHub: https://github.com/shallowdf20/dataliner
PyPI: https://pypi.org/project/dataliner/

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up