More than 5 years have passed since last update.

【機械学習】自動車の性能を分類してみる

Last updated at 2019-03-03Posted at 2019-03-02

【反省会】クラス分類するとき名義尺度と間隔尺度の概念を失念していた【初心者あるある？】

プログラミングの勉強を始めて3か月、機械学習の勉強を始めて3週間が経ち、一通り広く浅く中身をザッと勉強しました。
今回SIGNATEの練習問題を利用し分類の知識の確認を行ったので、その記録を記したいと思います。
まだまだ勉強不足なので皆さんから色々ご指摘をもらえたら幸いです。

1.分析目的

今回の目的は、自動車の属性データから評価値(class)の予測するというものです
【練習問題】自動車の評価

データの中身は以下の通りです

ヘッダ名称	データ型	説明
id	int	インデックスとして使用
class	varchar	評価値（unacc, acc, good, vgood）
buying	varchar	車の売値（vhigh, high, med, low）
maint	varchar	整備代（vhigh, high, med, low）
doors	int	ドアの数（2, 3, 4, 5, more.）
persons	int	定員（2, 4, more.）
lug_boot	varchar	トランクの大きさ（small, med, big.）
safety	varchar	安全性（low, med, high）

2.データの読み込み

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df=pd.read_table('train.tsv')
df.head()

id	class	buying	maint	doors	persons	lug_boot	safety
0	0	unacc	low	med	3	2	small	low
1	3	acc	low	high	3	more	small	med
2	7	unacc	vhigh	high	5more	2	small	med
3	11	acc	high	high	3	more	big	med
4	12	unacc	high	high	3	2	med	high

3.文字列の変換
とりあえず文字列をすべて数値に置換しました

df['class']=df['class'].apply(lambda x:1 if x=='unacc' else(2 if x=='acc' else(3 if x=='good' else 4)))
df['buying']=df['buying'].apply(lambda x:1 if x=='low' else(2 if x=='med' else(3 if x=='high' else 4)))
df['maint']=df['maint'].apply(lambda x:1 if x=='low' else(2 if x=='med' else(3 if x=='high' else 4)))
df['doors'][df['doors']=='5more']=5
df['persons'][df['persons']=='more']=5
df['lug_boot']=df['lug_boot'].apply(lambda x:1 if x=='small' else(2 if x=='med' else 3))
df['safety']=df['safety'].apply(lambda x:1 if x=='low' else(2 if x=='med' else 3))

特徴量idは不要なので削除してから出力

index	class	buying	maint	doors	persons	lug_boot	safety
0	1	1	2	3	2	1	1
1	2	1	3	3	5	1	2
2	1	4	3	5	2	1	2
3	2	3	3	3	5	3	2
4	1	3	3	3	2	2	3

4.データの確認

sns.countplot(data=df, x='class')
plt.title('1:unacceptbale 2:acc 3:good 4:very-good')

unacc(unacceptable)が大半を占めているようです

次に各データ間の相関関係を調べます

sns.heatmap(df.corr(),annot=True,cmap='cool')

classとsafetyにはやや相関関係があり、lug_boot(トランクの大きさ)とはほとんど相関関係がありません。

各項目についても調べます。

figure,axes=plt.subplots(2,2,figsize=(10,10))
sns.countplot(data=df,x='buying',hue='class',ax=axes[0,0])
sns.countplot(data=df,x='maint',hue='class',ax=axes[0,1])
sns.countplot(data=df,x='lug_boot',hue='class',ax=axes[1,0])
sns.countplot(data=df,x='safety',hue='class',ax=axes[1,1])

5.データの分割

X=df.iloc[:,1:]
y=df.iloc[:,0]

X=np.array(X)
y=np.array(y)

今回は層化k-Foldで分割しました

from sklearn.model_selection import StratifiedKFold
ss=StratifiedKFold(n_splits=10,shuffle=True)

train_index, test_index=next(ss.split(X,y))

X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

6-1 ロジスティック回帰

from sklearn.linear_model import LogisticRegression
clf=LogisticRegression(C=1000)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

結果は0.8764044943820225となりました

6-2 線形SVC

from sklearn.svm import SVC
clf=SVC(kernel='linear')
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

結果0.9325842696629213

6-3 非線形SVC

from sklearn.svm import SVC
clf=SVC(kernel='rbf')
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

結果0.9550561797752809

6-4 k近傍法

from sklearn import neighbors
clf=neighbors.KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

結果0.9550561797752809

kの値を変えて調べてみます

n_range = range(1,20)
scores = []
for n in n_range:
    clf.n_neighbors = n
    score = clf.score(X_test, y_test)
    print(n, score) 
    scores.append(score)
scores = np.array(scores)

plt.plot(scores)

k=18で98％でしたが、汎用性は低そうです

6-5 ランダムフォレスト

from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

木の数を増やしてみます

n_range = range(1,100,10)
scores = []
for n in n_range:
    clf.n_estimators = n
    score = clf.score(X_test, y_test)
    print(n, score) 
    scores.append(score)
scores = np.array(scores)

n_estmatoresを変えても数値に変化はありませんでした
max_depthについてもやりましたが変化なし

7 結果の評価

from sklearn.metrics import classification_report
y_pred=clf.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))

////	precision	recall	f1-score	support
1	0.9833	0.9833	0.9833	60
2	0.8333	0.9524	0.8889	21
3	0.5000	0.2500	0.3333	4
4	1.0000	0.7500	0.8571	4

micro avg	0.9326	0.9326	0.9326 89
macro avg	0.8292	0.7339	0.7657 89
weighted avg	0.9270	0.9326	0.9262 89

クラス3(good)の識別がうまくいっていませんでした。

8 結論

今回の検証ではk近傍法(K=5)で95％得られた

9 わからなかったこと

1.スマートな特徴量変換
ダミー変数ではうまくいかなかった→いろいろやり方を学ぶ必要がある
2.効果的なデータ分析
正直今回のデータ分析結果から何か分類にかけるときに有効な情報を得ることができなかった。
どんなことに着目してデータ整理をすればいいのか？そしてそれをどう分類につなげるか
3.クラス3(good)の識別がうまくいっていなかったが、理由がわからない。
この事実からどう修正すれば識別率を上げられるのかに繋げられないため、これでは結果を評価しても無意味な状態

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up