More than 1 year has passed since last update.

sklearnのPipeline,(SelectFromModel, GridSearchCV)の利用法、実装を覚書

Last updated at 2022-09-07Posted at 2021-11-25

scikit-learnで特徴量選択や予測モデルはPipelineを用いて、連結して処理ができる

ただ、Pipelineは連結処理を隠してしまうのでどう処理されるのかチェックした内容を覚書

まずPipelineの前提として
Pipeline(model_1, model_2, model_3, model_4)と様々なモデルを渡すことができるが、
・最終モデル（この例ではmodel_4）は.fitのみが必要
・途中のモデル（model_1,2,3）は.fitと.transformが必要

この前提を満たしていれば、sklearn以外のモデル(XGBoost, LightGBM)もpipeline化できる。

ハイパーパラメータ最適化(GridSearchCV)やMultiOutputRegressorなども複合できるため、短いコードで簡潔に記載可能

基本形

selector = SelectFromModel(Lasso())
estimator = RandomForestRegressor()
pipe = Pipeline([('selector', selector), ('estimator', estimator)])
pipe.fit(X,y)

pipe.fit時にはSelectFromModelのtransformが呼ばれ、
特徴量選択されたX_selected がestimatorに渡される

PipelineをGridSearchCVで最適化

selector = SelectFromModel(Lasso())
estimator = RandomForestRegressor()
pipe = Pipeline([('selector', selector), ('estimator', estimator)])
parameters = {'estimator__n_estimators':[10,100,200]}
gscv = GridSearchCV(pipe, parameters)
gscv.fit(X,y)

GridSearchCVに渡すパラメータ名は'estimator__n_estimators'のように
<モデル名>__<パラメータ名>と変わることに注意

更に、GridSearchCVをpipelineの中に入れるトリッキーな使い方もできる

# GridSearchCVをpipelineの中に入れる
selector = SelectFromModel(Lasso())
estimator = RandomForestRegressor()
parameters = {'n_estimators':[10,100,200]}
searcher = GridSearchCV(estimator, parameters )
pipe = Pipeline([('selector', selector), ('searcher', searcher)])
pipe.fit(X,y)

pipe.predict(X_test)

この場合、
pipe.predict時にはGridSearchCVで最適化されたRandomForestが利用され、
predict時に再度GridSearchされることはない

ただし、SelectFromModelの選択される特徴量はfit時とpredict時で変わりうるし、固定する手段もないため,sklearnのオリジナル実装のままでは使いにくいかもしれない。

OptunaSearchCV

Optunaによるハイパーパラメータ探索クラス　OptunaSearchCVを使っても同様の最適化が出来る

尚、scoringを指定しなければデフォルトのscore(r2)が用いられる。
ただし、inner-CVにLeaveOneOutを指定する場合は、r2が計算できずエラーとなるため
"neg_mean_squared_error"で代用すると良い

from optuna.integration impotr OptunaSearchCV
from optuna.distributions import IntUniformDistribution

selector = SelectFromModel(Lasso())
estimator = RandomForestRegressor()
pipe = Pipeline([('selector', selector), ('estimator', estimator)])
parameters = {'estimator__n_estimators':IntUniformDistribution(3, 15)}
opcv = OptunaSearchCV(pipe, parameters, scoring="neg_mean_squared_error")
opcv.fit(X,y)

estimatorの取得

学習後にpipelineから各モデルのインスタンスを取得する場合

def get_estimator(
    model,
    model_type="estimator",
    remove_multioutput=False,
    remove_pipeline=True,
    remove_searcher=True,
):
    if model_type in ["pipeline", "Pipeline", "pipe", "Pipe"]:
        remove_pipeline = False

    if model_type == None:
        return model, "model"

    if (
        model.__class__.__name__
        in [
            "OptunaSearchCV",
            "GridSearchCV",
        ]
    ) and (remove_searcher == True):
        print("SearchCV")
        print(f"Search {model.estimator} {model_type}")
        if hasattr(model, "best_estimator_"):
            return get_estimator(
                model.best_estimator_,
                model_type,
                remove_multioutput,
                remove_pipeline,
            )
        else:
            return get_estimator(
                model.estimator, model_type, remove_multioutput, remove_pipeline
            )

    if model.__class__.__name__ == "Pipeline":
        print("Pipeline")

        if remove_pipeline == False:
            pass

        else:
            indexes = [
                idx
                for idx, (name, class_) in enumerate(model.steps)
                if model_type in str(name)
            ]
            if len(indexes) == 0:
                return IdentityMapping(), ""
            else:
                print(f"Search {model.steps[indexes[0]][1]} {model_type}")

                return get_estimator(
                    model.steps[indexes[0]][1],
                    model_type,
                    remove_multioutput,
                    remove_pipeline,
                )

    if model.__class__.__name__ in [
        "MultiOutputRegressor",
        "MultiOutputClassifier",
    ]:
        print("MultiOutput")

        if remove_multioutput == True:
            print(f"Search {model.estimator} {model_type}")

            return get_estimator(
                model.estimator, model_type, remove_multioutput, remove_pipeline
            )
        elif remove_multioutput == False:
            pass

    if model.__class__.__name__ in [
        "MultiOutputRegressor",
        "MultiOutputClassifier",
    ]:
        model_name = model.estimator.__class__.__name__
    else:
        model_name = model.__class__.__name__

    return model, model_name

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up