More than 5 years have passed since last update.

DataLiner ver.1.3の紹介とUnionAppendの使い方

Last updated at 2020-05-12Posted at 2020-05-12

はじめに

DataLiner 1.3.1のリリースにあたり想定していた主要機能を一通り実装しました。
今後はbug fixとちょくちょく前処理を足す程度の開発ペースになる予定です。

リリース記事: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37
GitHub: https://github.com/shallowdf20/dataliner
Document: https://shallowdf20.github.io/dataliner/preprocessing.html

インストール

! pip install -U dataliner

ver1.3の変更点

以下の4つです。

UnionAppend実装
StandardizeDataの廃止(StandardScalingに名称変更)
ArithmeticFeatureGeneratorの廃止(AppendArithmeticFeaturesに名称変更)
load_titanic実装

では、具体的な使い方を紹介します。

使い方

まずは今回使うパッケージをインポートします。

import dataliner as dl
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

本バージョンからtitanicのデータをパッケージに含めるようにしたので、試し易くなっています。
サンプルのtitanicデータを読み込むにはload_titanicメソッドを利用します。

X, X_test, y = dl.load_titanic()

これでXはtrain.csvの'Survived'を除いたデータ、X_testはtest.csv、yはtrain.csvの'Survived'列が読み込まれました。

使い方

今回はUnionAppendという処理を実装したので紹介します。

DataLinerの前処理では、基本的に既存の特徴量から新しい特徴量を追加する前処理はすべてAppend〇〇というクラス名を付けています。

本バージョンでArithmeticFeatureGeneratorをAppendArithmeticFeaturesに変更したのはそのためです。
例外はBinarizeNaNとCountRowNaNですが、これらは原則欠損値補完・カテゴリ処理よりも前に行う処理なのであえてこの名前としています。

ここで、例えば一通り特徴量を追加したいとして、以下のようにパイプラインを組んだとします。

process = make_pipeline(
    dl.ImputeNaN(),
    dl.TargetMeanEncoding(),
    dl.StandardScaling(),
    dl.AppendCluster(),
    dl.AppendAnomalyScore(),
    dl.AppendPrincipalComponent(),
    dl.AppendClusterTargetMean(),
    dl.AppendClassificationModel(model=RandomForestClassifier(n_estimators=100, max_depth=5)),
    dl.AppendClusterDistance(),
    dl.AppendArithmeticFeatures(),
)
process.fit_transform(X, y)

この方法では、例えばAppendClusterで追加された特徴量が次のAppendAnomalyScoreの元データとして使われます。(そして以下すべてのAppend〇〇においてどんどん特徴量が増えます)

このような直列の処理ではなく、並列で処理して、Append〇〇のベースとなる特徴量をすべて同じにしたい場合があると思います。
その場合にUnionAppendを使うことができます。

process = make_pipeline(
    dl.ImputeNaN(),
    dl.TargetMeanEncoding(),
    dl.StandardScaling(),
    dl.UnionAppend([
        dl.AppendCluster(),
        dl.AppendAnomalyScore(),
        dl.AppendPrincipalComponent(),
        dl.AppendClusterTargetMean(),
        dl.AppendClassificationModel(model=RandomForestClassifier(n_estimators=100, max_depth=5)),
        dl.AppendClusterDistance(),
        dl.AppendArithmeticFeatures(),
    ]),
)
process.fit_transform(X, y)

UnionAppendに適用したいAppend〇〇のクラスを配列で与えてやることで、UnionAppend内の処理のベース特徴量がすべて統一され、それぞれのAppend〇〇の処理結果が結合されて返されます。
実行結果は以下のようになります。

PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Cluster_Number	Anomaly_Score	Principal_Component_0	Principal_Component_1	Principal_Component_2	Principal_Component_3	Principal_Component_4	cluster_mean	Predicted_RandomForestClassifier	Cluster_Distance_0	Cluster_Distance_1	Cluster_Distance_2	Cluster_Distance_3	Cluster_Distance_4	Cluster_Distance_5	Cluster_Distance_6	Cluster_Distance_7	Age_multiply_SibSp	PassengerId_multiply_SibSp	SibSp_multiply_Parch
-1.729	0.8269	-0.9994	-0.7373	-0.5921	0.4326	-0.4734	-0.1954	-0.5022	-0.3479	-0.5397	1	0.094260	-1.4177	0.1906	-0.35640	-1.398	-0.5801	0.1677	0	2.861	1.265	4.352	3.466	5.616	3.461	2.782	5.667	-0.2561	-0.7479	-0.2048
-1.725	-1.5652	-0.9994	1.3548	0.6384	0.4326	-0.4734	-0.1954	0.7864	0.1665	2.0434	5	-0.047463	1.9956	0.1777	-0.14888	-2.449	0.6941	0.4874	1	3.768	4.335	5.799	3.681	3.946	3.028	4.993	4.830	0.2762	-0.7463	-0.2048
-1.721	0.8269	-0.9994	1.3548	-0.2845	-0.4743	-0.4734	-0.1954	-0.4886	-0.3479	-0.5397	0	0.076929	-0.8234	0.2181	-1.24773	-1.380	-1.2529	0.7321	1	1.870	2.311	4.937	3.759	5.490	3.548	3.376	5.467	0.1349	0.8164	0.2245
-1.717	-1.5652	-0.9994	1.3548	0.4077	0.4326	-0.4734	0.2317	0.4205	0.8250	-0.5397	0	-0.000208	1.5823	0.2699	0.10503	-1.536	-1.6788	0.7321	1	2.835	3.547	5.352	3.058	4.090	3.970	4.338	3.846	0.1763	-0.7429	-0.2048
-1.714	0.8269	-0.9994	-0.7373	0.4077	-0.4743	-0.4734	-0.1954	-0.4861	-0.3479	-0.5397	1	0.106421	-1.2160	-0.7344	-0.09900	-1.500	-0.7327	0.1677	0	2.866	1.148	5.064	2.921	5.567	3.463	2.689	5.594	-0.1934	0.8127	0.2245

テストデータへの処理もいつも通りです。

process.transform(X_test)

PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Cluster_Number	Anomaly_Score	Principal_Component_0	Principal_Component_1	Principal_Component_2	Principal_Component_3	Principal_Component_4	cluster_mean	Predicted_RandomForestClassifier	Cluster_Distance_0	Cluster_Distance_1	Cluster_Distance_2	Cluster_Distance_3	Cluster_Distance_4	Cluster_Distance_5	Cluster_Distance_6	Cluster_Distance_7	Age_multiply_SibSp	PassengerId_multiply_SibSp	SibSp_multiply_Parch
1.733	0.8269	-0.9994	-0.7373	0.3692	-0.4743	-0.4734	-0.1954	-0.4905	-0.3479	0.06949	6	0.08314	-0.92627	-1.0572	0.17814	1.514	0.78456	0.1087	0	3.095	2.724	5.232	2.986	5.397	3.045	1.1646	5.405	-0.17512	-0.8219	0.2245
1.737	0.8269	-0.9994	1.3548	1.3306	0.4326	-0.4734	-0.1954	-0.5072	-0.3479	-0.53969	0	0.01921	-0.45407	-0.2239	0.40615	1.531	-0.36302	0.7321	0	2.677	3.744	5.022	3.451	5.503	3.924	2.7926	5.414	0.57556	0.7513	-0.2048
1.741	-0.3692	-0.9994	-0.7373	2.4843	-0.4743	-0.4734	-0.1954	-0.4531	-0.3479	0.06949	3	0.02651	0.04527	-2.0548	1.70715	1.119	0.49872	0.2277	0	4.047	3.880	6.207	2.345	5.441	3.955	2.9554	5.527	-1.17825	-0.8256	0.2245
1.745	0.8269	-0.9994	-0.7373	-0.2076	-0.4743	-0.4734	-0.1954	-0.4737	-0.3479	-0.53969	6	0.11329	-1.17022	-0.7993	0.02809	1.770	0.37658	0.1087	0	3.011	2.615	5.099	3.238	5.522	3.420	0.9194	5.466	0.09846	-0.8275	0.2245
1.749	0.8269	-0.9994	1.3548	-0.5921	0.4326	0.7672	-0.1954	-0.4008	-0.3479	-0.53969	0	0.02122	-0.63799	1.2879	-0.38498	1.920	-0.06859	0.7321	1	2.269	3.601	3.888	4.139	5.238	3.679	2.6813	5.425	-0.25613	0.7563	0.3319

おわりに

これで当初想定していた機能と使用している前処理は一通り実装が終わりました。
今後はbug fixと新しい前処理を見つけたら / 思いついたら追加していこうと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up