More than 5 years have passed since last update.

線形SVC(分類)とハイパーパラメータのチューニング

Last updated at 2019-11-14Posted at 2019-05-23

はじめに

　乳癌の腫瘍が良性であるか悪性であるかを判定するためのウィスコンシン州の乳癌データセットについて、線形SVCとハイパーパラメータのチューニングにより分類器を作成する。データはsklearnに含まれるもので、データ数は569、そのうち良性は212、悪性は357、特徴量は30種類ある。

シリーズ

サポートベクタマシンとは

教師あり学習を用いるパターン認識モデルの一つである。分類や回帰へ適用できる。1963年に Vladimir N. Vapnik, Alexey Ya. Chervonenkis が線形サポートベクターマシンを発表し、1992年に Bernhard E. Boser, Isabelle M. Guyon, Vladimir N. Vapnik が非線形へと拡張した。

サポートベクターマシンは、現在知られている手法の中でも認識性能が優れた学習モデルの一つである。サポートベクターマシンが優れた認識性能を発揮することができる理由は、未学習データに対して高い識別性能を得るための工夫があるためである。
（wikipediaより）

線形SVCのハイパーパラメータ

詳細は以下を参照されたい。
sklearn.svm.LinearSVC

ハイパーパラメータ	選択肢	default
penalty	l1,l2	l2
loss	hinge、squared_hinge	squared_hinge
dual	boolt型	True
tol	float型	0.0001
C	float型	1
multi_class	ovr、crammer_singer	ovr
fit_intercept	bool型	True
intercept_scaling	float型	1
class_weight	辞書型、balanced	1(全クラス)
verbose	int型	0
random_state	int型	None
max_iter	int型	1000

手順

乳癌データの読み込み
条件設定
トレーニングデータ、テストデータの分離
グリッドサーチ
ランダムサーチ
ハイパーパラメータをチューニングしない場合との比較

pythonによる実装

%%time
import scipy.stats
from sklearn.datasets import load_breast_cancer
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# 乳癌データの読み込み
cancer_data = load_breast_cancer()

# 条件設定
max_score = 0
SearchMethod = 0
LSVC_grid = {LinearSVC(): {"C": [10 ** i for i in range(-5, 6)],
                           "multi_class": ["ovr", "crammer_singer"],
                           "class_weight": ["balanced"],
                           "random_state": [i for i in range(0, 101)]}}
LSVC_random = {LinearSVC(): {"C": scipy.stats.uniform(0.00001, 1000),
                             "multi_class": ["ovr", "crammer_singer"],
                             "class_weight": ["balanced"],
                             "random_state": scipy.stats.randint(0, 100)}}

# トレーニングデータ、テストデータの分離
train_X, test_X, train_y, test_y = train_test_split(cancer_data.data, cancer_data.target, random_state=0)

# グリッドサーチ
for model, param in LSVC_grid.items():
    clf = GridSearchCV(model, param)
    clf.fit(train_X, train_y)
    pred_y = clf.predict(test_X)
    score = f1_score(test_y, pred_y, average="micro")
    
    if max_score < score:
        max_score = score
        best_param = clf.best_params_
        best_model = model.__class__.__name__

# ランダムサーチ
for model, param in LSVC_random.items():
    clf =RandomizedSearchCV(model, param)
    clf.fit(train_X, train_y)
    pred_y = clf.predict(test_X)
    score = f1_score(test_y, pred_y, average="micro")
    
    if max_score < score:
        SearchMethod = 1
        max_score = score
        best_param = clf.best_params_
        best_model = model.__class__.__name__
    
if SearchMethod == 0:
    print("サーチ方法:グリッドサーチ")
else:
    print("サーチ方法:ランダムサーチ")
print("ベストスコア:{}".format(max_score))
print("モデル:{}".format(best_model))
print("パラメーター:{}".format(best_param))

# ハイパーパラメータを調整しない場合との比較
model = LinearSVC()
model.fit(train_X, train_y)
score = model.score(test_X, test_y)
print("")
print("デフォルトスコア:", score)

結果

サーチ方法:グリッドサーチ
ベストスコア:0.972027972027972
モデル:LinearSVC
パラメーター:{'C': 1, 'class_weight': 'balanced', 'multi_class': 'crammer_singer', 'random_state': 58}

デフォルトスコア: 0.937062937063
Wall time: 1h 2min 31s

おわりに

　線形SVCでもデフォルトよりも高い正解率を得ることができた。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up