概観
-> データセット
-> 前処理
-> 学習
-> 評価
-> チューニング
-> Pipeline 化
datasets -> load_breast_cancer / load_diabetes / load_iris / make_classification / make_regression
preprocessing -> StandardScaler / MinMaxScaler / MaxAbsScaler / RobustScaler / OneHotEncoder / LabelEncoder
impute -> SimpleImputer
feature_selection-> SelectKBest
linear_model -> LinearRegression / LogisticRegression / Ridge / Lasso / ElasticNet
neighbors -> KNeighborsClassifier / KNeighborsRegressor
tree -> DecisionTreeClassifier / DecisionTreeRegressor
svm -> SVC / SVR
ensemble -> RandomForestClassifier / RandomForestRegressor / AdaBoostClassifier / AdaBoostRegressor / GradientBoostingClassifier / GradientBoostingRegressor / VotingClassifier / StackingClassifier
metrics -> accuracy_score / precision_score / recall_score / f1_score / confusion_matrix / mean_absolute_error / RMSE / r2_score
model_selection -> train_test_split / StratifiedKFold / cross_val_score / GridSearchCV / RandomizedSearchCV
pipeline -> Pipeline
compose -> ColumnTransformer
cluster -> KMeans / AgglomerativeClustering / DBSCAN / MeanShift
decomposition -> PCA / KernelPCA / TruncatedSVD / NMF
機械学習の実務フローで理解する Scikit-learn ライブラリ入門
第1部:実務フローの基本
1. sklearn の共通フロー
用途:学習系 API の最小パターンを押さえる。
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.3,
random_state=17,
stratify=y,
)
model = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(max_iter=500, random_state=17)),
])
model.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)
2. サンプルデータの読み込み
用途:練習用データをすぐ試す。
from sklearn.datasets import (
load_breast_cancer,
load_diabetes,
load_iris,
make_classification,
make_regression,
)
breast_cancer = load_breast_cancer(as_frame=True) # 分類
diabetes = load_diabetes(as_frame=True) # 回帰
iris = load_iris(as_frame=True) # 多クラス分類
X_cls, y_cls = make_classification(n_samples=200, n_features=10, random_state=17)
X_reg, y_reg = make_regression(n_samples=200, n_features=10, noise=10.0, random_state=17)
3. 課題整理と EDA で最低限見る点
用途:モデル選択の前に見るべき観点を固定する。
何を予測したいか
分類か回帰か
見逃しと誤検知のどちらが重いか
欠損、外れ値、クラス不均衡が強いか
リークしそうな列が混ざっていないか
第2部:前処理と特徴量エンジニアリング
1. スケーリング
用途:距離ベースや線形モデルでスケール差をそろえる。
from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler, RobustScaler, StandardScaler
standard_scaler = StandardScaler() # 最初の基準
minmax_scaler = MinMaxScaler() # 0 から 1 にそろえる
maxabs_scaler = MaxAbsScaler() # 疎行列と相性がよい
robust_scaler = RobustScaler() # 外れ値に強め
X_scaled = standard_scaler.fit_transform(X_train)
2. 分布変換
用途:歪みの強い数値特徴量を扱いやすくする。
from sklearn.preprocessing import PowerTransformer, QuantileTransformer
quantile_normal = QuantileTransformer(output_distribution="normal")
quantile_uniform = QuantileTransformer(output_distribution="uniform")
power_transformer = PowerTransformer()
X_quantile = quantile_normal.fit_transform(X_num)
X_power = power_transformer.fit_transform(X_num)
3. ビニング
用途:連続値を区間へ切り分ける。
from sklearn.preprocessing import KBinsDiscretizer
binning_quantile = KBinsDiscretizer(n_bins=5, strategy="quantile", encode="ordinal")
binning_uniform = KBinsDiscretizer(n_bins=5, strategy="uniform", encode="ordinal")
X_binned = binning_quantile.fit_transform(X_num)
4. 欠損補完
用途:数値列とカテゴリ列で補完方法を分ける。
from sklearn.impute import SimpleImputer
num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")
constant_imputer = SimpleImputer(strategy="constant", fill_value="missing")
X_num_filled = num_imputer.fit_transform(X_num)
X_cat_filled = cat_imputer.fit_transform(X_cat)
5. カテゴリ処理
用途:目的変数と説明変数で使い分ける。
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder = LabelEncoder() # 主に y 用
y_encoded = label_encoder.fit_transform(y)
onehot_encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
X_cat_encoded = onehot_encoder.fit_transform(X_cat)
categories = onehot_encoder.categories_
feature_names = onehot_encoder.get_feature_names_out()
古いバージョンでは sparse_output=False ではなく sparse=False を使うことがあります。
6. 特徴量選択
用途:不要特徴量を減らし、過学習を抑える。
f_regression は回帰向け、mutual_info_classif と chi2 は分類向けです。chi2 は非負値の特徴量に使います。
from sklearn.feature_selection import SelectKBest, chi2, f_regression, mutual_info_classif
selector_reg = SelectKBest(score_func=f_regression, k=10)
selector_cls = SelectKBest(score_func=mutual_info_classif, k=10)
selector_chi2 = SelectKBest(score_func=chi2, k=10)
X_reg_selected = selector_reg.fit_transform(X_reg_train, y_reg_train)
X_cls_selected = selector_cls.fit_transform(X_cls_train, y_cls_train)
X_non_negative_selected = selector_chi2.fit_transform(X_non_negative, y_cls_train)
support_mask = selector_cls.get_support()
7. ColumnTransformer と Pipeline
用途:前処理から学習までを 1 本にまとめる。
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
numeric_features = ["age", "income"]
categorical_features = ["city", "plan"]
numeric_transformer = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
])
categorical_transformer = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
])
preprocess = ColumnTransformer([
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
])
pipeline = Pipeline([
("preprocess", preprocess),
("model", LogisticRegression(max_iter=500, random_state=17)),
])
第3部:モデル
1. 回帰モデル
用途:連続値を予測する。
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
linear_regression = LinearRegression() # 基準モデル
ridge = Ridge(alpha=1.0) # L2 正則化
lasso = Lasso(alpha=0.01) # L1 正則化
elastic_net = ElasticNet(alpha=0.01, l1_ratio=0.5) # L1 と L2 の中間
knn_regressor = KNeighborsRegressor(n_neighbors=8) # 距離ベース
tree_regressor = DecisionTreeRegressor(max_depth=4, random_state=17)
svr = SVR(C=3.0, epsilon=0.2) # 非線形回帰
2. 分類モデル
用途:クラスを予測する。
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
logistic_regression = LogisticRegression(max_iter=500, random_state=17)
gaussian_nb = GaussianNB()
knn_classifier = KNeighborsClassifier(n_neighbors=5, weights="uniform")
tree_classifier = DecisionTreeClassifier(max_depth=4, random_state=17)
svc = SVC(C=1.0, gamma="scale", probability=True, random_state=17)
3. アンサンブル学習
用途:単体モデルより強い基準を作る。
from sklearn.ensemble import (
AdaBoostClassifier,
AdaBoostRegressor,
GradientBoostingClassifier,
GradientBoostingRegressor,
RandomForestClassifier,
RandomForestRegressor,
StackingClassifier,
VotingClassifier,
)
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
random_forest_classifier = RandomForestClassifier(n_estimators=200, random_state=17)
random_forest_regressor = RandomForestRegressor(n_estimators=200, random_state=17)
adaboost_classifier = AdaBoostClassifier(n_estimators=200, learning_rate=0.5, random_state=17)
adaboost_regressor = AdaBoostRegressor(n_estimators=200, learning_rate=0.1, random_state=17)
gradient_boosting_classifier = GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, random_state=17)
gradient_boosting_regressor = GradientBoostingRegressor(n_estimators=200, learning_rate=0.05, random_state=17)
voting_classifier = VotingClassifier(
estimators=[
("lr", Pipeline([("scaler", StandardScaler()), ("model", LogisticRegression(max_iter=500, random_state=17))])),
("rf", RandomForestClassifier(n_estimators=200, random_state=17)),
("svc", Pipeline([("scaler", StandardScaler()), ("model", SVC(probability=True, random_state=17))])),
],
voting="soft",
)
stacking_classifier = StackingClassifier(
estimators=[
("rf", RandomForestClassifier(n_estimators=200, random_state=17)),
("svc", Pipeline([("scaler", StandardScaler()), ("model", SVC(probability=True, random_state=17))])),
],
final_estimator=LogisticRegression(max_iter=500, random_state=17),
)
4. 教師なし学習
用途:クラスタリングや次元削減を行う。
from sklearn.cluster import AgglomerativeClustering, DBSCAN, KMeans, MeanShift
from sklearn.decomposition import KernelPCA, NMF, PCA, TruncatedSVD
kmeans = KMeans(n_clusters=4, random_state=17, n_init=10)
agglomerative = AgglomerativeClustering(n_clusters=4)
dbscan = DBSCAN(eps=0.5, min_samples=5)
mean_shift = MeanShift()
pca = PCA(n_components=3)
kernel_pca = KernelPCA(n_components=3, kernel="rbf")
truncated_svd = TruncatedSVD(n_components=50, random_state=17)
nmf = NMF(n_components=10, random_state=17)
第4部:評価指標
1. 分類の評価
用途:accuracy だけで判断しない。
以下は二値分類の例です。多クラス分類では average の指定や、どのクラスの確率を見るかを明示します。
from sklearn.metrics import (
accuracy_score,
auc,
classification_report,
confusion_matrix,
f1_score,
precision_recall_curve,
precision_score,
recall_score,
roc_curve,
)
pred = model.predict(X_test)
prob = model.predict_proba(X_test)[:, 1]
accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)
matrix = confusion_matrix(y_test, pred)
report = classification_report(y_test, pred)
precision_vals, recall_vals, pr_thresholds = precision_recall_curve(y_test, prob)
fpr, tpr, roc_thresholds = roc_curve(y_test, prob)
roc_auc = auc(fpr, tpr)
2. 回帰の評価
用途:平均的なズレと大外しを分けて見る。
import numpy as np
from sklearn.metrics import (
mean_absolute_error,
mean_absolute_percentage_error,
mean_squared_error,
r2_score,
)
pred = model.predict(X_test)
mae = mean_absolute_error(y_test, pred)
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test, pred))
mape = mean_absolute_percentage_error(y_test, pred)
r2 = r2_score(y_test, pred)
3. 閾値調整
用途:分類の判定基準を変える。
import numpy as np
prob = model.predict_proba(X_test)[:, 1]
pred_threshold = np.where(prob >= 0.7, 1, 0)
第5部:検証・改善・再現性
1. 交差検証
用途:偶然の分割によるブレを減らす。
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)
kfold_cv = KFold(n_splits=5, shuffle=True, random_state=17)
cv_scores = cross_val_score(model, X, y, cv=stratified_cv)
mean_cv_score = cv_scores.mean()
2. ハイパーパラメータ探索
用途:効きやすいパラメータを体系的に探す。
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {
"n_estimators": [100, 200, 300],
"max_depth": [3, 5, None],
"min_samples_split": [2, 5, 10],
}
grid = GridSearchCV(
RandomForestClassifier(random_state=17),
param_grid=param_grid,
cv=5,
scoring="f1",
n_jobs=-1,
)
param_distributions = {
"n_estimators": [100, 200, 300, 500],
"max_depth": [3, 5, 7, None],
"max_features": ["sqrt", "log2", None],
}
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=17),
param_distributions=param_distributions,
n_iter=10,
cv=5,
scoring="f1",
random_state=17,
n_jobs=-1,
)
3. 過学習の見分け方
用途:訓練スコアとテストスコアの差を見る。
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
4. 特徴量重要度の確認
用途:木系モデルで効いている列の当たりを付ける。
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=300, random_state=17)
rf_model.fit(X_train, y_train)
importance = pd.Series(rf_model.feature_importances_, index=X_train.columns)
top_importance = importance.sort_values(ascending=False).head(10)
第6部:項目一覧
datasets / model_selection
-
load_breast_cancer/load_diabetes/load_iris -
make_classification/make_regression train_test_split-
KFold/StratifiedKFold cross_val_score-
GridSearchCV/RandomizedSearchCV
preprocessing / impute / feature_selection
-
StandardScaler/MinMaxScaler/MaxAbsScaler/RobustScaler -
QuantileTransformer/PowerTransformer KBinsDiscretizer-
LabelEncoder/OneHotEncoder SimpleImputer-
SelectKBest/f_regression/chi2/mutual_info_classif
回帰モデル
LinearRegression-
Ridge/Lasso/ElasticNet KNeighborsRegressorDecisionTreeRegressorSVR-
RandomForestRegressor/AdaBoostRegressor/GradientBoostingRegressor
分類モデル
LogisticRegressionKNeighborsClassifierGaussianNBDecisionTreeClassifierSVC-
RandomForestClassifier/AdaBoostClassifier/GradientBoostingClassifier -
VotingClassifier/StackingClassifier
教師なし学習
-
KMeans/AgglomerativeClustering/DBSCAN/MeanShift -
PCA/KernelPCA/TruncatedSVD/NMF
metrics
- 分類:
accuracy_score/precision_score/recall_score/f1_score/confusion_matrix/classification_report/roc_curve/precision_recall_curve/auc - 回帰:
mean_absolute_error/mean_squared_error/RMSE/mean_absolute_percentage_error/r2_score
pipeline / compose
PipelineColumnTransformer
第7部:要点ツリー
Scikit-learn 実務フロー
├─ データ準備
│ ├─ datasets: load_breast_cancer / load_diabetes / load_iris / make_classification / make_regression
│ ├─ 分割: train_test_split / KFold / StratifiedKFold
│ └─ EDA の前提: 欠損 / 外れ値 / クラス不均衡 / リーク確認
├─ 前処理
│ ├─ スケーリング: StandardScaler / MinMaxScaler / MaxAbsScaler / RobustScaler
│ ├─ 分布変換: QuantileTransformer / PowerTransformer
│ ├─ ビニング: KBinsDiscretizer
│ ├─ 欠損補完: SimpleImputer
│ ├─ カテゴリ処理: LabelEncoder / OneHotEncoder
│ └─ 特徴量選択: SelectKBest / f_regression / chi2 / mutual_info_classif
├─ 回帰
│ ├─ 基準: LinearRegression
│ ├─ 正則化: Ridge / Lasso / ElasticNet
│ ├─ 距離ベース: KNeighborsRegressor
│ ├─ 木: DecisionTreeRegressor
│ └─ 非線形: SVR / RandomForestRegressor / GradientBoostingRegressor / AdaBoostRegressor
├─ 分類
│ ├─ 基準: LogisticRegression
│ ├─ 距離ベース: KNeighborsClassifier
│ ├─ 軽量ベースライン: GaussianNB
│ ├─ 木: DecisionTreeClassifier
│ └─ 非線形とアンサンブル: SVC / RandomForestClassifier / GradientBoostingClassifier / AdaBoostClassifier / VotingClassifier / StackingClassifier
├─ 教師なし学習
│ ├─ クラスタリング: KMeans / AgglomerativeClustering / DBSCAN / MeanShift
│ └─ 次元削減: PCA / KernelPCA / TruncatedSVD / NMF
├─ 評価
│ ├─ 分類: accuracy / precision / recall / f1 / confusion_matrix / classification_report / ROC / PR
│ ├─ 回帰: MAE / MSE / RMSE / MAPE / R2
│ └─ 閾値調整: predict_proba / np.where
└─ 改善と再現性
├─ 交差検証: cross_val_score
├─ 探索: GridSearchCV / RandomizedSearchCV
├─ 前処理一体化: Pipeline / ColumnTransformer
└─ 重要度確認: feature_importances_