Scikit-learnを実務フローで使うチートシート

Posted at 2026-06-17

概観

  -> データセット
  -> 前処理
  -> 学習
  -> 評価
  -> チューニング
  -> Pipeline 化

datasets         -> load_breast_cancer / load_diabetes / load_iris / make_classification / make_regression
preprocessing    -> StandardScaler / MinMaxScaler / MaxAbsScaler / RobustScaler / OneHotEncoder / LabelEncoder
impute           -> SimpleImputer
feature_selection-> SelectKBest
linear_model     -> LinearRegression / LogisticRegression / Ridge / Lasso / ElasticNet
neighbors        -> KNeighborsClassifier / KNeighborsRegressor
tree             -> DecisionTreeClassifier / DecisionTreeRegressor
svm              -> SVC / SVR
ensemble         -> RandomForestClassifier / RandomForestRegressor / AdaBoostClassifier / AdaBoostRegressor / GradientBoostingClassifier / GradientBoostingRegressor / VotingClassifier / StackingClassifier
metrics          -> accuracy_score / precision_score / recall_score / f1_score / confusion_matrix / mean_absolute_error / RMSE / r2_score
model_selection  -> train_test_split / StratifiedKFold / cross_val_score / GridSearchCV / RandomizedSearchCV
pipeline         -> Pipeline
compose          -> ColumnTransformer
cluster          -> KMeans / AgglomerativeClustering / DBSCAN / MeanShift
decomposition    -> PCA / KernelPCA / TruncatedSVD / NMF

機械学習の実務フローで理解する Scikit-learn ライブラリ入門

第1部：実務フローの基本

1. sklearn の共通フロー

用途：学習系 API の最小パターンを押さえる。

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=17,
    stratify=y,
)

model = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=500, random_state=17)),
])
model.fit(X_train, y_train)

pred = model.predict(X_test)
score = model.score(X_test, y_test)

2. サンプルデータの読み込み

用途：練習用データをすぐ試す。

from sklearn.datasets import (
    load_breast_cancer,
    load_diabetes,
    load_iris,
    make_classification,
    make_regression,
)

breast_cancer = load_breast_cancer(as_frame=True)  # 分類
diabetes = load_diabetes(as_frame=True)            # 回帰
iris = load_iris(as_frame=True)                    # 多クラス分類

X_cls, y_cls = make_classification(n_samples=200, n_features=10, random_state=17)
X_reg, y_reg = make_regression(n_samples=200, n_features=10, noise=10.0, random_state=17)

3. 課題整理と EDA で最低限見る点

用途：モデル選択の前に見るべき観点を固定する。

何を予測したいか
分類か回帰か
見逃しと誤検知のどちらが重いか
欠損、外れ値、クラス不均衡が強いか
リークしそうな列が混ざっていないか

第2部：前処理と特徴量エンジニアリング

1. スケーリング

用途：距離ベースや線形モデルでスケール差をそろえる。

from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler, RobustScaler, StandardScaler

standard_scaler = StandardScaler()  # 最初の基準
minmax_scaler = MinMaxScaler()      # 0 から 1 にそろえる
maxabs_scaler = MaxAbsScaler()      # 疎行列と相性がよい
robust_scaler = RobustScaler()      # 外れ値に強め

X_scaled = standard_scaler.fit_transform(X_train)

2. 分布変換

用途：歪みの強い数値特徴量を扱いやすくする。

from sklearn.preprocessing import PowerTransformer, QuantileTransformer

quantile_normal = QuantileTransformer(output_distribution="normal")
quantile_uniform = QuantileTransformer(output_distribution="uniform")
power_transformer = PowerTransformer()

X_quantile = quantile_normal.fit_transform(X_num)
X_power = power_transformer.fit_transform(X_num)

3. ビニング

用途：連続値を区間へ切り分ける。

from sklearn.preprocessing import KBinsDiscretizer

binning_quantile = KBinsDiscretizer(n_bins=5, strategy="quantile", encode="ordinal")
binning_uniform = KBinsDiscretizer(n_bins=5, strategy="uniform", encode="ordinal")

X_binned = binning_quantile.fit_transform(X_num)

4. 欠損補完

用途：数値列とカテゴリ列で補完方法を分ける。

from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")
constant_imputer = SimpleImputer(strategy="constant", fill_value="missing")

X_num_filled = num_imputer.fit_transform(X_num)
X_cat_filled = cat_imputer.fit_transform(X_cat)

5. カテゴリ処理

用途：目的変数と説明変数で使い分ける。

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

label_encoder = LabelEncoder()  # 主に y 用
y_encoded = label_encoder.fit_transform(y)

onehot_encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
X_cat_encoded = onehot_encoder.fit_transform(X_cat)

categories = onehot_encoder.categories_
feature_names = onehot_encoder.get_feature_names_out()

古いバージョンでは sparse_output=False ではなく sparse=False を使うことがあります。

6. 特徴量選択

用途：不要特徴量を減らし、過学習を抑える。

f_regression は回帰向け、mutual_info_classif と chi2 は分類向けです。chi2 は非負値の特徴量に使います。

from sklearn.feature_selection import SelectKBest, chi2, f_regression, mutual_info_classif

selector_reg = SelectKBest(score_func=f_regression, k=10)
selector_cls = SelectKBest(score_func=mutual_info_classif, k=10)
selector_chi2 = SelectKBest(score_func=chi2, k=10)

X_reg_selected = selector_reg.fit_transform(X_reg_train, y_reg_train)
X_cls_selected = selector_cls.fit_transform(X_cls_train, y_cls_train)
X_non_negative_selected = selector_chi2.fit_transform(X_non_negative, y_cls_train)

support_mask = selector_cls.get_support()

7. ColumnTransformer と Pipeline

用途：前処理から学習までを 1 本にまとめる。

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

numeric_features = ["age", "income"]
categorical_features = ["city", "plan"]

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocess = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features),
])

pipeline = Pipeline([
    ("preprocess", preprocess),
    ("model", LogisticRegression(max_iter=500, random_state=17)),
])

第3部：モデル

1. 回帰モデル

用途：連続値を予測する。

from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

linear_regression = LinearRegression()                  # 基準モデル
ridge = Ridge(alpha=1.0)                               # L2 正則化
lasso = Lasso(alpha=0.01)                              # L1 正則化
elastic_net = ElasticNet(alpha=0.01, l1_ratio=0.5)     # L1 と L2 の中間
knn_regressor = KNeighborsRegressor(n_neighbors=8)     # 距離ベース
tree_regressor = DecisionTreeRegressor(max_depth=4, random_state=17)
svr = SVR(C=3.0, epsilon=0.2)                          # 非線形回帰

2. 分類モデル

用途：クラスを予測する。

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

logistic_regression = LogisticRegression(max_iter=500, random_state=17)
gaussian_nb = GaussianNB()
knn_classifier = KNeighborsClassifier(n_neighbors=5, weights="uniform")
tree_classifier = DecisionTreeClassifier(max_depth=4, random_state=17)
svc = SVC(C=1.0, gamma="scale", probability=True, random_state=17)

3. アンサンブル学習

用途：単体モデルより強い基準を作る。

from sklearn.ensemble import (
    AdaBoostClassifier,
    AdaBoostRegressor,
    GradientBoostingClassifier,
    GradientBoostingRegressor,
    RandomForestClassifier,
    RandomForestRegressor,
    StackingClassifier,
    VotingClassifier,
)
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

random_forest_classifier = RandomForestClassifier(n_estimators=200, random_state=17)
random_forest_regressor = RandomForestRegressor(n_estimators=200, random_state=17)
adaboost_classifier = AdaBoostClassifier(n_estimators=200, learning_rate=0.5, random_state=17)
adaboost_regressor = AdaBoostRegressor(n_estimators=200, learning_rate=0.1, random_state=17)
gradient_boosting_classifier = GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, random_state=17)
gradient_boosting_regressor = GradientBoostingRegressor(n_estimators=200, learning_rate=0.05, random_state=17)

voting_classifier = VotingClassifier(
    estimators=[
        ("lr", Pipeline([("scaler", StandardScaler()), ("model", LogisticRegression(max_iter=500, random_state=17))])),
        ("rf", RandomForestClassifier(n_estimators=200, random_state=17)),
        ("svc", Pipeline([("scaler", StandardScaler()), ("model", SVC(probability=True, random_state=17))])),
    ],
    voting="soft",
)

stacking_classifier = StackingClassifier(
    estimators=[
        ("rf", RandomForestClassifier(n_estimators=200, random_state=17)),
        ("svc", Pipeline([("scaler", StandardScaler()), ("model", SVC(probability=True, random_state=17))])),
    ],
    final_estimator=LogisticRegression(max_iter=500, random_state=17),
)

4. 教師なし学習

用途：クラスタリングや次元削減を行う。

from sklearn.cluster import AgglomerativeClustering, DBSCAN, KMeans, MeanShift
from sklearn.decomposition import KernelPCA, NMF, PCA, TruncatedSVD

kmeans = KMeans(n_clusters=4, random_state=17, n_init=10)
agglomerative = AgglomerativeClustering(n_clusters=4)
dbscan = DBSCAN(eps=0.5, min_samples=5)
mean_shift = MeanShift()

pca = PCA(n_components=3)
kernel_pca = KernelPCA(n_components=3, kernel="rbf")
truncated_svd = TruncatedSVD(n_components=50, random_state=17)
nmf = NMF(n_components=10, random_state=17)

第4部：評価指標

1. 分類の評価

用途：accuracy だけで判断しない。

以下は二値分類の例です。多クラス分類では average の指定や、どのクラスの確率を見るかを明示します。

from sklearn.metrics import (
    accuracy_score,
    auc,
    classification_report,
    confusion_matrix,
    f1_score,
    precision_recall_curve,
    precision_score,
    recall_score,
    roc_curve,
)

pred = model.predict(X_test)
prob = model.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)
matrix = confusion_matrix(y_test, pred)
report = classification_report(y_test, pred)

precision_vals, recall_vals, pr_thresholds = precision_recall_curve(y_test, prob)
fpr, tpr, roc_thresholds = roc_curve(y_test, prob)
roc_auc = auc(fpr, tpr)

2. 回帰の評価

用途：平均的なズレと大外しを分けて見る。

import numpy as np

from sklearn.metrics import (
    mean_absolute_error,
    mean_absolute_percentage_error,
    mean_squared_error,
    r2_score,
)

pred = model.predict(X_test)

mae = mean_absolute_error(y_test, pred)
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test, pred))
mape = mean_absolute_percentage_error(y_test, pred)
r2 = r2_score(y_test, pred)

3. 閾値調整

用途：分類の判定基準を変える。

import numpy as np

prob = model.predict_proba(X_test)[:, 1]
pred_threshold = np.where(prob >= 0.7, 1, 0)

第5部：検証・改善・再現性

1. 交差検証

用途：偶然の分割によるブレを減らす。

from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score

stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)
kfold_cv = KFold(n_splits=5, shuffle=True, random_state=17)

cv_scores = cross_val_score(model, X, y, cv=stratified_cv)
mean_cv_score = cv_scores.mean()

2. ハイパーパラメータ探索

用途：効きやすいパラメータを体系的に探す。

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    "n_estimators": [100, 200, 300],
    "max_depth": [3, 5, None],
    "min_samples_split": [2, 5, 10],
}

grid = GridSearchCV(
    RandomForestClassifier(random_state=17),
    param_grid=param_grid,
    cv=5,
    scoring="f1",
    n_jobs=-1,
)

param_distributions = {
    "n_estimators": [100, 200, 300, 500],
    "max_depth": [3, 5, 7, None],
    "max_features": ["sqrt", "log2", None],
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=17),
    param_distributions=param_distributions,
    n_iter=10,
    cv=5,
    scoring="f1",
    random_state=17,
    n_jobs=-1,
)

3. 過学習の見分け方

用途：訓練スコアとテストスコアの差を見る。

model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

4. 特徴量重要度の確認

用途：木系モデルで効いている列の当たりを付ける。

import pandas as pd
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=300, random_state=17)
rf_model.fit(X_train, y_train)

importance = pd.Series(rf_model.feature_importances_, index=X_train.columns)
top_importance = importance.sort_values(ascending=False).head(10)

第6部：項目一覧

datasets / model_selection

load_breast_cancer / load_diabetes / load_iris
make_classification / make_regression
train_test_split
KFold / StratifiedKFold
cross_val_score
GridSearchCV / RandomizedSearchCV

preprocessing / impute / feature_selection

StandardScaler / MinMaxScaler / MaxAbsScaler / RobustScaler
QuantileTransformer / PowerTransformer
KBinsDiscretizer
LabelEncoder / OneHotEncoder
SimpleImputer
SelectKBest / f_regression / chi2 / mutual_info_classif

回帰モデル

LinearRegression
Ridge / Lasso / ElasticNet
KNeighborsRegressor
DecisionTreeRegressor
SVR
RandomForestRegressor / AdaBoostRegressor / GradientBoostingRegressor

分類モデル

LogisticRegression
KNeighborsClassifier
GaussianNB
DecisionTreeClassifier
SVC
RandomForestClassifier / AdaBoostClassifier / GradientBoostingClassifier
VotingClassifier / StackingClassifier

教師なし学習

KMeans / AgglomerativeClustering / DBSCAN / MeanShift
PCA / KernelPCA / TruncatedSVD / NMF

metrics

分類: accuracy_score / precision_score / recall_score / f1_score / confusion_matrix / classification_report / roc_curve / precision_recall_curve / auc
回帰: mean_absolute_error / mean_squared_error / RMSE / mean_absolute_percentage_error / r2_score

pipeline / compose

Pipeline
ColumnTransformer

第7部：要点ツリー

Scikit-learn 実務フロー
├─ データ準備
│  ├─ datasets: load_breast_cancer / load_diabetes / load_iris / make_classification / make_regression
│  ├─ 分割: train_test_split / KFold / StratifiedKFold
│  └─ EDA の前提: 欠損 / 外れ値 / クラス不均衡 / リーク確認
├─ 前処理
│  ├─ スケーリング: StandardScaler / MinMaxScaler / MaxAbsScaler / RobustScaler
│  ├─ 分布変換: QuantileTransformer / PowerTransformer
│  ├─ ビニング: KBinsDiscretizer
│  ├─ 欠損補完: SimpleImputer
│  ├─ カテゴリ処理: LabelEncoder / OneHotEncoder
│  └─ 特徴量選択: SelectKBest / f_regression / chi2 / mutual_info_classif
├─ 回帰
│  ├─ 基準: LinearRegression
│  ├─ 正則化: Ridge / Lasso / ElasticNet
│  ├─ 距離ベース: KNeighborsRegressor
│  ├─ 木: DecisionTreeRegressor
│  └─ 非線形: SVR / RandomForestRegressor / GradientBoostingRegressor / AdaBoostRegressor
├─ 分類
│  ├─ 基準: LogisticRegression
│  ├─ 距離ベース: KNeighborsClassifier
│  ├─ 軽量ベースライン: GaussianNB
│  ├─ 木: DecisionTreeClassifier
│  └─ 非線形とアンサンブル: SVC / RandomForestClassifier / GradientBoostingClassifier / AdaBoostClassifier / VotingClassifier / StackingClassifier
├─ 教師なし学習
│  ├─ クラスタリング: KMeans / AgglomerativeClustering / DBSCAN / MeanShift
│  └─ 次元削減: PCA / KernelPCA / TruncatedSVD / NMF
├─ 評価
│  ├─ 分類: accuracy / precision / recall / f1 / confusion_matrix / classification_report / ROC / PR
│  ├─ 回帰: MAE / MSE / RMSE / MAPE / R2
│  └─ 閾値調整: predict_proba / np.where
└─ 改善と再現性
   ├─ 交差検証: cross_val_score
   ├─ 探索: GridSearchCV / RandomizedSearchCV
   ├─ 前処理一体化: Pipeline / ColumnTransformer
   └─ 重要度確認: feature_importances_

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up