0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

概観

  -> データセット
  -> 前処理
  -> 学習
  -> 評価
  -> チューニング
  -> Pipeline 化

datasets         -> load_breast_cancer / load_diabetes / load_iris / make_classification / make_regression
preprocessing    -> StandardScaler / MinMaxScaler / MaxAbsScaler / RobustScaler / OneHotEncoder / LabelEncoder
impute           -> SimpleImputer
feature_selection-> SelectKBest
linear_model     -> LinearRegression / LogisticRegression / Ridge / Lasso / ElasticNet
neighbors        -> KNeighborsClassifier / KNeighborsRegressor
tree             -> DecisionTreeClassifier / DecisionTreeRegressor
svm              -> SVC / SVR
ensemble         -> RandomForestClassifier / RandomForestRegressor / AdaBoostClassifier / AdaBoostRegressor / GradientBoostingClassifier / GradientBoostingRegressor / VotingClassifier / StackingClassifier
metrics          -> accuracy_score / precision_score / recall_score / f1_score / confusion_matrix / mean_absolute_error / RMSE / r2_score
model_selection  -> train_test_split / StratifiedKFold / cross_val_score / GridSearchCV / RandomizedSearchCV
pipeline         -> Pipeline
compose          -> ColumnTransformer
cluster          -> KMeans / AgglomerativeClustering / DBSCAN / MeanShift
decomposition    -> PCA / KernelPCA / TruncatedSVD / NMF

機械学習の実務フローで理解する Scikit-learn ライブラリ入門

第1部:実務フローの基本

1. sklearn の共通フロー

用途:学習系 API の最小パターンを押さえる。

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=17,
    stratify=y,
)

model = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=500, random_state=17)),
])
model.fit(X_train, y_train)

pred = model.predict(X_test)
score = model.score(X_test, y_test)

2. サンプルデータの読み込み

用途:練習用データをすぐ試す。

from sklearn.datasets import (
    load_breast_cancer,
    load_diabetes,
    load_iris,
    make_classification,
    make_regression,
)

breast_cancer = load_breast_cancer(as_frame=True)  # 分類
diabetes = load_diabetes(as_frame=True)            # 回帰
iris = load_iris(as_frame=True)                    # 多クラス分類

X_cls, y_cls = make_classification(n_samples=200, n_features=10, random_state=17)
X_reg, y_reg = make_regression(n_samples=200, n_features=10, noise=10.0, random_state=17)

3. 課題整理と EDA で最低限見る点

用途:モデル選択の前に見るべき観点を固定する。

何を予測したいか
分類か回帰か
見逃しと誤検知のどちらが重いか
欠損、外れ値、クラス不均衡が強いか
リークしそうな列が混ざっていないか

第2部:前処理と特徴量エンジニアリング

1. スケーリング

用途:距離ベースや線形モデルでスケール差をそろえる。

from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler, RobustScaler, StandardScaler

standard_scaler = StandardScaler()  # 最初の基準
minmax_scaler = MinMaxScaler()      # 0 から 1 にそろえる
maxabs_scaler = MaxAbsScaler()      # 疎行列と相性がよい
robust_scaler = RobustScaler()      # 外れ値に強め

X_scaled = standard_scaler.fit_transform(X_train)

2. 分布変換

用途:歪みの強い数値特徴量を扱いやすくする。

from sklearn.preprocessing import PowerTransformer, QuantileTransformer

quantile_normal = QuantileTransformer(output_distribution="normal")
quantile_uniform = QuantileTransformer(output_distribution="uniform")
power_transformer = PowerTransformer()

X_quantile = quantile_normal.fit_transform(X_num)
X_power = power_transformer.fit_transform(X_num)

3. ビニング

用途:連続値を区間へ切り分ける。

from sklearn.preprocessing import KBinsDiscretizer

binning_quantile = KBinsDiscretizer(n_bins=5, strategy="quantile", encode="ordinal")
binning_uniform = KBinsDiscretizer(n_bins=5, strategy="uniform", encode="ordinal")

X_binned = binning_quantile.fit_transform(X_num)

4. 欠損補完

用途:数値列とカテゴリ列で補完方法を分ける。

from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")
constant_imputer = SimpleImputer(strategy="constant", fill_value="missing")

X_num_filled = num_imputer.fit_transform(X_num)
X_cat_filled = cat_imputer.fit_transform(X_cat)

5. カテゴリ処理

用途:目的変数と説明変数で使い分ける。

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

label_encoder = LabelEncoder()  # 主に y 用
y_encoded = label_encoder.fit_transform(y)

onehot_encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
X_cat_encoded = onehot_encoder.fit_transform(X_cat)

categories = onehot_encoder.categories_
feature_names = onehot_encoder.get_feature_names_out()

古いバージョンでは sparse_output=False ではなく sparse=False を使うことがあります。

6. 特徴量選択

用途:不要特徴量を減らし、過学習を抑える。

f_regression は回帰向け、mutual_info_classifchi2 は分類向けです。chi2 は非負値の特徴量に使います。

from sklearn.feature_selection import SelectKBest, chi2, f_regression, mutual_info_classif

selector_reg = SelectKBest(score_func=f_regression, k=10)
selector_cls = SelectKBest(score_func=mutual_info_classif, k=10)
selector_chi2 = SelectKBest(score_func=chi2, k=10)

X_reg_selected = selector_reg.fit_transform(X_reg_train, y_reg_train)
X_cls_selected = selector_cls.fit_transform(X_cls_train, y_cls_train)
X_non_negative_selected = selector_chi2.fit_transform(X_non_negative, y_cls_train)

support_mask = selector_cls.get_support()

7. ColumnTransformer と Pipeline

用途:前処理から学習までを 1 本にまとめる。

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

numeric_features = ["age", "income"]
categorical_features = ["city", "plan"]

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocess = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features),
])

pipeline = Pipeline([
    ("preprocess", preprocess),
    ("model", LogisticRegression(max_iter=500, random_state=17)),
])

第3部:モデル

1. 回帰モデル

用途:連続値を予測する。

from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

linear_regression = LinearRegression()                  # 基準モデル
ridge = Ridge(alpha=1.0)                               # L2 正則化
lasso = Lasso(alpha=0.01)                              # L1 正則化
elastic_net = ElasticNet(alpha=0.01, l1_ratio=0.5)     # L1 と L2 の中間
knn_regressor = KNeighborsRegressor(n_neighbors=8)     # 距離ベース
tree_regressor = DecisionTreeRegressor(max_depth=4, random_state=17)
svr = SVR(C=3.0, epsilon=0.2)                          # 非線形回帰

2. 分類モデル

用途:クラスを予測する。

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

logistic_regression = LogisticRegression(max_iter=500, random_state=17)
gaussian_nb = GaussianNB()
knn_classifier = KNeighborsClassifier(n_neighbors=5, weights="uniform")
tree_classifier = DecisionTreeClassifier(max_depth=4, random_state=17)
svc = SVC(C=1.0, gamma="scale", probability=True, random_state=17)

3. アンサンブル学習

用途:単体モデルより強い基準を作る。

from sklearn.ensemble import (
    AdaBoostClassifier,
    AdaBoostRegressor,
    GradientBoostingClassifier,
    GradientBoostingRegressor,
    RandomForestClassifier,
    RandomForestRegressor,
    StackingClassifier,
    VotingClassifier,
)
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

random_forest_classifier = RandomForestClassifier(n_estimators=200, random_state=17)
random_forest_regressor = RandomForestRegressor(n_estimators=200, random_state=17)
adaboost_classifier = AdaBoostClassifier(n_estimators=200, learning_rate=0.5, random_state=17)
adaboost_regressor = AdaBoostRegressor(n_estimators=200, learning_rate=0.1, random_state=17)
gradient_boosting_classifier = GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, random_state=17)
gradient_boosting_regressor = GradientBoostingRegressor(n_estimators=200, learning_rate=0.05, random_state=17)

voting_classifier = VotingClassifier(
    estimators=[
        ("lr", Pipeline([("scaler", StandardScaler()), ("model", LogisticRegression(max_iter=500, random_state=17))])),
        ("rf", RandomForestClassifier(n_estimators=200, random_state=17)),
        ("svc", Pipeline([("scaler", StandardScaler()), ("model", SVC(probability=True, random_state=17))])),
    ],
    voting="soft",
)

stacking_classifier = StackingClassifier(
    estimators=[
        ("rf", RandomForestClassifier(n_estimators=200, random_state=17)),
        ("svc", Pipeline([("scaler", StandardScaler()), ("model", SVC(probability=True, random_state=17))])),
    ],
    final_estimator=LogisticRegression(max_iter=500, random_state=17),
)

4. 教師なし学習

用途:クラスタリングや次元削減を行う。

from sklearn.cluster import AgglomerativeClustering, DBSCAN, KMeans, MeanShift
from sklearn.decomposition import KernelPCA, NMF, PCA, TruncatedSVD

kmeans = KMeans(n_clusters=4, random_state=17, n_init=10)
agglomerative = AgglomerativeClustering(n_clusters=4)
dbscan = DBSCAN(eps=0.5, min_samples=5)
mean_shift = MeanShift()

pca = PCA(n_components=3)
kernel_pca = KernelPCA(n_components=3, kernel="rbf")
truncated_svd = TruncatedSVD(n_components=50, random_state=17)
nmf = NMF(n_components=10, random_state=17)

第4部:評価指標

1. 分類の評価

用途:accuracy だけで判断しない。

以下は二値分類の例です。多クラス分類では average の指定や、どのクラスの確率を見るかを明示します。

from sklearn.metrics import (
    accuracy_score,
    auc,
    classification_report,
    confusion_matrix,
    f1_score,
    precision_recall_curve,
    precision_score,
    recall_score,
    roc_curve,
)

pred = model.predict(X_test)
prob = model.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)
matrix = confusion_matrix(y_test, pred)
report = classification_report(y_test, pred)

precision_vals, recall_vals, pr_thresholds = precision_recall_curve(y_test, prob)
fpr, tpr, roc_thresholds = roc_curve(y_test, prob)
roc_auc = auc(fpr, tpr)

2. 回帰の評価

用途:平均的なズレと大外しを分けて見る。

import numpy as np

from sklearn.metrics import (
    mean_absolute_error,
    mean_absolute_percentage_error,
    mean_squared_error,
    r2_score,
)

pred = model.predict(X_test)

mae = mean_absolute_error(y_test, pred)
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test, pred))
mape = mean_absolute_percentage_error(y_test, pred)
r2 = r2_score(y_test, pred)

3. 閾値調整

用途:分類の判定基準を変える。

import numpy as np

prob = model.predict_proba(X_test)[:, 1]
pred_threshold = np.where(prob >= 0.7, 1, 0)

第5部:検証・改善・再現性

1. 交差検証

用途:偶然の分割によるブレを減らす。

from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score

stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)
kfold_cv = KFold(n_splits=5, shuffle=True, random_state=17)

cv_scores = cross_val_score(model, X, y, cv=stratified_cv)
mean_cv_score = cv_scores.mean()

2. ハイパーパラメータ探索

用途:効きやすいパラメータを体系的に探す。

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    "n_estimators": [100, 200, 300],
    "max_depth": [3, 5, None],
    "min_samples_split": [2, 5, 10],
}

grid = GridSearchCV(
    RandomForestClassifier(random_state=17),
    param_grid=param_grid,
    cv=5,
    scoring="f1",
    n_jobs=-1,
)

param_distributions = {
    "n_estimators": [100, 200, 300, 500],
    "max_depth": [3, 5, 7, None],
    "max_features": ["sqrt", "log2", None],
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=17),
    param_distributions=param_distributions,
    n_iter=10,
    cv=5,
    scoring="f1",
    random_state=17,
    n_jobs=-1,
)

3. 過学習の見分け方

用途:訓練スコアとテストスコアの差を見る。

model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

4. 特徴量重要度の確認

用途:木系モデルで効いている列の当たりを付ける。

import pandas as pd
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=300, random_state=17)
rf_model.fit(X_train, y_train)

importance = pd.Series(rf_model.feature_importances_, index=X_train.columns)
top_importance = importance.sort_values(ascending=False).head(10)

第6部:項目一覧

datasets / model_selection

  • load_breast_cancer / load_diabetes / load_iris
  • make_classification / make_regression
  • train_test_split
  • KFold / StratifiedKFold
  • cross_val_score
  • GridSearchCV / RandomizedSearchCV

preprocessing / impute / feature_selection

  • StandardScaler / MinMaxScaler / MaxAbsScaler / RobustScaler
  • QuantileTransformer / PowerTransformer
  • KBinsDiscretizer
  • LabelEncoder / OneHotEncoder
  • SimpleImputer
  • SelectKBest / f_regression / chi2 / mutual_info_classif

回帰モデル

  • LinearRegression
  • Ridge / Lasso / ElasticNet
  • KNeighborsRegressor
  • DecisionTreeRegressor
  • SVR
  • RandomForestRegressor / AdaBoostRegressor / GradientBoostingRegressor

分類モデル

  • LogisticRegression
  • KNeighborsClassifier
  • GaussianNB
  • DecisionTreeClassifier
  • SVC
  • RandomForestClassifier / AdaBoostClassifier / GradientBoostingClassifier
  • VotingClassifier / StackingClassifier

教師なし学習

  • KMeans / AgglomerativeClustering / DBSCAN / MeanShift
  • PCA / KernelPCA / TruncatedSVD / NMF

metrics

  • 分類: accuracy_score / precision_score / recall_score / f1_score / confusion_matrix / classification_report / roc_curve / precision_recall_curve / auc
  • 回帰: mean_absolute_error / mean_squared_error / RMSE / mean_absolute_percentage_error / r2_score

pipeline / compose

  • Pipeline
  • ColumnTransformer

第7部:要点ツリー

Scikit-learn 実務フロー
├─ データ準備
│  ├─ datasets: load_breast_cancer / load_diabetes / load_iris / make_classification / make_regression
│  ├─ 分割: train_test_split / KFold / StratifiedKFold
│  └─ EDA の前提: 欠損 / 外れ値 / クラス不均衡 / リーク確認
├─ 前処理
│  ├─ スケーリング: StandardScaler / MinMaxScaler / MaxAbsScaler / RobustScaler
│  ├─ 分布変換: QuantileTransformer / PowerTransformer
│  ├─ ビニング: KBinsDiscretizer
│  ├─ 欠損補完: SimpleImputer
│  ├─ カテゴリ処理: LabelEncoder / OneHotEncoder
│  └─ 特徴量選択: SelectKBest / f_regression / chi2 / mutual_info_classif
├─ 回帰
│  ├─ 基準: LinearRegression
│  ├─ 正則化: Ridge / Lasso / ElasticNet
│  ├─ 距離ベース: KNeighborsRegressor
│  ├─ 木: DecisionTreeRegressor
│  └─ 非線形: SVR / RandomForestRegressor / GradientBoostingRegressor / AdaBoostRegressor
├─ 分類
│  ├─ 基準: LogisticRegression
│  ├─ 距離ベース: KNeighborsClassifier
│  ├─ 軽量ベースライン: GaussianNB
│  ├─ 木: DecisionTreeClassifier
│  └─ 非線形とアンサンブル: SVC / RandomForestClassifier / GradientBoostingClassifier / AdaBoostClassifier / VotingClassifier / StackingClassifier
├─ 教師なし学習
│  ├─ クラスタリング: KMeans / AgglomerativeClustering / DBSCAN / MeanShift
│  └─ 次元削減: PCA / KernelPCA / TruncatedSVD / NMF
├─ 評価
│  ├─ 分類: accuracy / precision / recall / f1 / confusion_matrix / classification_report / ROC / PR
│  ├─ 回帰: MAE / MSE / RMSE / MAPE / R2
│  └─ 閾値調整: predict_proba / np.where
└─ 改善と再現性
   ├─ 交差検証: cross_val_score
   ├─ 探索: GridSearchCV / RandomizedSearchCV
   ├─ 前処理一体化: Pipeline / ColumnTransformer
   └─ 重要度確認: feature_importances_
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?