Qiita Teams that are logged in
You are not logged in to any team

Community
Service
Qiita JobsQiita ZineQiita Blog
26
Help us understand the problem. What is going on with this article?
@canard0328

# scikit-learnのSVM（SVC）の処理速度について

More than 3 years have passed since last update.

2016.09.14 処理時間のバラつきについて追記しました

scikit-learnのSVC（rbfカーネルとlinearカーネル）とLinearSVCの処理速度を比較してみました．

ラベルはspam:1813サンプル，nonspam:2788サンプルです．

サンプル数，次元数を変えた時の結果は以下の通りです．

SVCのlinearカーネルが遅すぎますね．
ついついカーネル種別まで含めてグリッドサーチしてしまいたくなりますが，
きちんとLinearSVCを使ったほうが良さそうです．

また特徴量選択（次元削減）はRandomForestのfeature importanceを利用しました．
これは適当に選択したところ，逆に処理時間が長くなったためです．

test_svm.py
``````# -*- coding: utf-8 -*-

import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn import cross_validation
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
from scipy.stats.mstats import mquantiles

def grid_search(X, y, estimator, params, cv, n_jobs=3):
mdl = GridSearchCV(estimator, params, cv=cv, n_jobs=n_jobs)
t1 = time.clock()
mdl.fit(X, y)
t2 = time.clock()
return t2 - t1

if __name__=="__main__":
y = data['type']
del data['type']

data, y = shuffle(data, y, random_state=0)
data = StandardScaler().fit_transform(data)

clf = RandomForestClassifier(n_estimators=100)
clf.fit(data, y)

ndim, elp_rbf, elp_lnr, elp_lsvm = [], [], [], []
for thr in mquantiles(clf.feature_importances_, prob=np.linspace(1., 0., 5)):
print thr,
X = data[:,clf.feature_importances_ >= thr]
ndim.append(X.shape[1])

cv = cross_validation.StratifiedShuffleSplit(y, test_size=0.2, random_state=0)

print 'rbf',
elp_rbf.append(grid_search(X, y, SVC(random_state=0),
[{'kernel': ['rbf'], 'C': [1, 10, 100]}], cv))

print 'linear',
elp_lnr.append(grid_search(X, y, SVC(random_state=0),
[{'kernel': ['linear'], 'C': [1, 10, 100]}], cv))

print 'lsvm'
elp_lsvm.append(grid_search(X, y, LinearSVC(random_state=0),
[{'C': [1, 10, 100]}], cv))

plt.figure()
plt.title('Elapsed time - # of dimensions')
plt.ylabel('Elapsed time [sec]')
plt.xlabel('# of dimensions')
plt.grid()
plt.plot(ndim, elp_rbf, 'o-', color='r',
label='SVM(rbf)')
plt.plot(ndim, elp_lnr, 'o-', color='g',
label='SVM(linear)')
plt.plot(ndim, elp_lsvm, 'o-', color='b',
label='LinearSVM')
plt.legend(loc='best')
plt.savefig('dimensions.png', bbox_inches='tight')
plt.close()

nrow, elp_rbf, elp_lnr, elp_lsvm = [], [], [], []
for r in np.linspace(0.1, 1., 5):
print r,
X = data[:(r*data.shape[0]),:]
yy = y[:(r*data.shape[0])]
nrow.append(X.shape[0])

cv = cross_validation.StratifiedShuffleSplit(yy, test_size=0.2, random_state=0)

print 'rbf',
elp_rbf.append(grid_search(X, yy, SVC(random_state=0),
[{'kernel': ['rbf'], 'C': [1, 10, 100]}], cv))

print 'linear',
elp_lnr.append(grid_search(X, yy, SVC(random_state=0),
[{'kernel': ['linear'], 'C': [1, 10, 100]}], cv))

print 'lsvm'
elp_lsvm.append(grid_search(X, yy, LinearSVC(random_state=0),
[{'C': [1, 10, 100]}], cv))

plt.figure()
plt.title('Elapsed time - # of samples')
plt.ylabel('Elapsed time [sec]')
plt.xlabel('# of samples')
plt.grid()
plt.plot(nrow, elp_rbf, 'o-', color='r',
label='SVM(rbf)')
plt.plot(nrow, elp_lnr, 'o-', color='g',
label='SVM(linear)')
plt.plot(nrow, elp_lsvm, 'o-', color='b',
label='LinearSVM')
plt.legend(loc='best')
plt.savefig('samples.png', bbox_inches='tight')
plt.close()
``````

### 追記

SVM(linear)の処理時間についてコメントをいただいたので調べてみました．
Python2.7.12，scikit-learn0.17.1で，
データ数1000，特徴量数29，200回試行したときの処理時間のバラつきは下図のようになりました．

SVM(linear)，怪しいですね…

26
Help us understand the problem. What is going on with this article?
Why not register and get more from Qiita?
1. We will deliver articles that match you
By following users and tags, you can catch up information on technical fields that you are interested in as a whole
2. you can read useful information later efficiently
By "stocking" the articles you like, you can search right away