More than 5 years have passed since last update.

scikit-learnで分類手法を呼び出すメモ

Last updated at 2019-02-20Posted at 2019-01-02

scikit-learnで機械学習プログラムを記述するとき、関数名や引数の意味などをよく忘れるので、メモ用に残しました。

1. データの整理

1.1 データの変数名

変数名	変数の意味
x_train	学習データの特徴量
y_train	学習データの正解ラベル
x_test	テストデータの特徴量
y_test	テストデータの正解ラベル

1.2 学習データとテストデータの分割

CrossValidation.py

from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=149)

引数

引数	引数の意味
x	データの特徴量
y	データの正解ラベル
test_size	分割後のテストデータの割合
random_state	乱数のシード

1.3 データの標準化

Standard.py

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(x_train)
x_train_standard = sc.transform(x_train)
x_test_standard = sc.transform(x_test)

データの標準化を行うとデータの特徴量が以下のように変更になる。
- データの各特徴量の平均値μ=0
- データの各特徴量の標準偏差σ=1

1.4 主成分分析

PCA.py

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
x_train_pca = pca.fit_transform(x_train_standard)
x_test_pca = pca.transform(x_test_standard)

主成分分析とは
- データの特徴を捉えつつ、特徴量の次元を圧縮する
引数 (PCA)

引数	引数の意味
n_components	圧縮後の特徴量の次元

2. 分類手法

サポートベクターマシン(SVM)
ロジスティック回帰
ランダムフォレスト
k近傍法

2.1 サポートベクターマシン(SVM)

SVM.py

from sklearn.svm import SVC
svm = SVC(kernel='linear', C=1.0, random_state=149)
svm.fit(x_train, y_train)

score = svm.score(x_test, y_test)
print('正答率: {}' .format(score))

引数(SVC)

引数	デフォルト値	引数の意味
kernel	'rbf'	カーネルの種類の指定
C	1.0	正則化の割合を指定。Cの値が小さいほど正則化の割合が強い。
random_state	None	乱数のシードの指定

kernelの種類

引数	引数の意味
'rbf'	RBF(Gauss)カーネル
'linear'	線形カーネル
'poly'	多項式カーネル
'sigmoid'	シグモイドカーネル

2.2 ロジスティック回帰

LogisticRegression.py

from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression(penalty='l2', C=100, random_state=149)
logistic_regression.fit(x_train, y_train)

score = lr.score(x_test, y_test)
print('正答率: {}' .format(score))

引数

引数	デフォルト値	引数の意味
penalty	'l2'	正則化の種類の指定('l1':L1正則化, 'l2':L2正則化)
C	1.0	正則化の割合を指定。Cの値が小さいほど正則化の割合が強い。
random_state	None	乱数のシードの指定

2.3 ランダムフォレスト

RandomForest.py

from sklearn.tree import DecisionTreeClassifier
random_forest = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=149)
random_forest.fit(x_train, y_train)

score = random_forest.score(x_test, y_test)
print('正答率: {}' .format(score))

引数

引数	デフォルト値	引数の意味
criterion	'gini'	不純度の指定('gini':ジニ係数, 'entropy':エントロピー)
max_depth	None	木の深さの値を指定
random_state	None	乱数のシードの指定

2.4 K-NearestNeighbor（K近傍法）

KNN.py

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
knn.fit(x_train, y_train)

score = knn.score(x_test, y_test)
print('正答率: {}' .format(score))

引数

引数	デフォルト値	引数の意味
n_neighbors	5	参照する付近のデータ数
p	2	1:マンハッタン距離　　　2:ユークリッド距離
metric	'minkowski'	ユークリッド距離やマンハッタン距離を一般化した式

minkowskiの式（p=1:マンハッタン距離, 　p=2:ユークリッド距離）

d(x^{i}, x^{j}) = \sqrt[p]{\sum_{k}\bigl|x_k^i - x_k^j\bigr|^{p}}

参考資料

scikit-learnの本家サイト

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up