More than 3 years have passed since last update.

[Python] Pipelineって何だよ...

Last updated at 2020-09-21Posted at 2020-09-20

※このページは，Pipelineをなんとなく理解したい人を対象としています．

こんにちは．

突然ですが，自分は機械学習や深層学習に興味があったので，最近kaggleのコンペに参加してみたんです．
kaggleにはNotebook機能があるのでそのコードを理解しよう!と意気込んでいたのですが...

「何これ，全く意味わからん」

全くプログラミングの知識がない状態なので，kaggleのNotebookのコードを見ても，暗号にしか見えませんでした(笑).
そこで，一つずつゆっくり理解していこうと思ったのでここに日記感覚で記載していきたいと思います．

今回は，「Pipeline」についてです．

※今回本記事に載せるデータはirisデータです

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

iris_data = datasets.load_iris()
input_data = iris_data.data
correct = iris_data.target

とりあえず下記のサイトにアクセスしました．
sklearn.pipeline.Pipeline — scikit-learn 0.23.2 documentation

これによると，基本の形は

from sklearn.pipeline import Pipeline
pipe = Pipeline([(前処理方法), (学習方法)])
pipe.fit(説明変数, 目的変数)

のようで，どうやらコードを簡潔化出来るみたいなんです．

これを基にirisデータをランダムフォレストで学習させてみました．

from sklearn.ensemble import RandomForestClassifier as RFC 

X_train, X_test, y_train, y_test = train_test_split(input_data, correct)
pipe = Pipeline([('scaler', StandardScaler()), 
                 ('RandomForestClassifier', RFC())])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

# 0.9473684210526315

上記により，説明変数を標準化してランダムフォレストで学習を行っていることになりました．
このように，Pipelineにまとめることで，コードが「簡潔」になります．

以下，確認のために記載したコードです．

X_train, X_test, y_train, y_test = train_test_split(input_data, correct)
tr_x, te_x, tr_y, te_y = X_train.copy(), X_test.copy(), y_train.copy(), y_test.copy() # 検算用にコピー

pipe = Pipeline([('scaler', StandardScaler()), 
                 ('Classifier', RFC())])
pipe.fit(X_train, y_train)
print("pipe score = " + str(pipe.score(X_test, y_test)))


from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
tr_x = stdsc.fit(tr_x).transform(tr_x)
te_x = stdsc.fit(te_x).transform(te_x)

clf = RFC()
clf.fit(tr_x, tr_y)
print("RFC score = ", clf.score(te_x, te_y))

# pipe score = 0.9473684210526315
# RFC score =  0.9473684210526315

検算でも一致できたので，Pipelineの前処理が正しく動いてくれたことが分かりました．

なるほど，なんとなくPipelineについて分かりました．
でも，前処理っていっても何個もあるのに，一つだけしか実行できないの？

どうやら複数の処理をまとめることが出来るようです．

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier as RFC 
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

preprocessing = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  # 欠損値除去の処理
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])                    # ワンホットエンコーディング


rf = Pipeline([
    ('preprocess', preprocessing),
    ('classifier', RFC())])

rf.fit(X_train, y_train)

このように，
基本形である　pipe = Pipeline([(前処理方法), (学習方法)])
の(前処理方法)には，イメージとしてはBNF記法(あくまでもイメージの話です)のようにPipelineを重ね掛けして行うことが一つの方法として挙げられるようです．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up