More than 5 years have passed since last update.

[SIGNATE練習問題]自動車の評価をやってみた

Last updated at 2019-09-08Posted at 2019-09-08

データ分析の練習として、SIGNATEの練習問題に取り組んでみた
【練習問題】自動車の評価

「自動車の情報から自動車の評価値を予測するモデル」と書いてある

データ概要

課題種別：分類
データ種別：多変量
学習データサンプル数：864
説明変数の数：6
欠損値：無し

カラム	ヘッダ名称	データ型	説明
0	id	int	インデックスとして使用
1	class	varchar	評価値（unacc, acc, good, vgood）
2	buying	varchar	車の売値（vhigh, high, med, low）
3	maint	varchar	整備代（vhigh, high, med, low）
4	doors	int	ドアの数（2, 3, 4, 5, more.）
5	persons	int	定員（2, 4, more.）
6	lug_boot	varchar	トランクの大きさ（small, med, big.）
7	safety	varchar	安全性（low, med, high）

解析

まずは手を動かしてみるということで、ライブラリのインポートから

# ライブラリ
import numpy as np
import pandas as pd
import pandas_profiling as pdpf
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
from xgboost import XGBRFClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

データ読み込み

# train data
train = pd.read_csv('./train.tsv',sep='\t')
# test data
test = pd.read_csv('./test.tsv', sep='\t')
# 確認
display(train)

それぞれがどんな分布をしているかを確認

DoorとPersonで値がカブっているものがあるので、それぞれ分かりやすく変更


# Door数
train.doors[train.doors == '2'] = 'two_doors'
train.doors[train.doors == '3'] = 'three_doors'
train.doors[train.doors == '4'] = 'four_doors'

test.doors[test.doors == '2'] = 'two_doors'
test.doors[test.doors == '3'] = 'three_doors'
test.doors[test.doors == '4'] = 'four_doors'

# 乗員数
train.persons[train.persons == '2'] = '2seater'
train.persons[train.persons == '4'] = '4seater'
test.persons[test.persons == '2'] = '2seater'
test.persons[test.persons == '4'] = '4seater'

目的変数を分ける

今回は、"class"を正解値として当てる必要があるため、学習データから分離する

# 説明変数
train_ = train.drop(['id','class'], axis=1)
# test用の説明変数
test_ = test.drop('id', axis=1)
# 目的変数
y = train['class']

# 確認
print("train_:",train_.shape,
      "y:",y.shape,
      "test_ :",test_.shape)

train_: (864, 6) y: (864,) test_ : (864, 6)

カテゴリ変数のエンコード

# trainとtestでカテゴリーが異なるとダメだから、trainとtestをマージ
length = train.shape[0]

df = pd.concat([train_, test_], axis=0)
print(df.columns)
## Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'], dtype='object')

pandasのget_dummiesを使ってエンコード

df_dummy = pd.get_dummies(df, prefix=df.columns)
display(df_dummy)

new_train = df_dummy.iloc[:length,:]
new_test = df_dummy.iloc[length:,:]

目的変数も同様にエンコード


# LabelEncoderのインスタンスを生成
le = LabelEncoder()
# ラベルを覚えさせる
le = le.fit(["unacc", "acc", "good", "vgood"])
# ラベルを整数に変換
y_ = le.transform(y)
y__ = le.inverse_transform(y_)

trainとvalidationに分ける

学習モデルの精度を評価したいから、trainデータを20%分だけ、validationへ

X_train, X_test, y_train, y_test = train_test_split(new_train, y_, 
                                                    test_size = .2,
                                                    random_state = 666,
                                                    stratify = y_)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# (691, 21) (173, 21) (691,) (173,)

XGBoostを使って学習

今回はまだまだ根強い人気を誇っているXGBoostを使ってみる
※パラメータ調整とかは次回(?)更新予定

# 学習
xgb = XGBRFClassifier()
xgb.fit(X_train, y_train)

# 評価
xgb.score(X_test, y_test)
# 0.791907514450867

だいたい80%くらいの精度
おそらく提出用のtestデータでも同じくらいか若干下がるかくらいの精度だと思われる

結果提出

まずは、test用データで予測させ、sample_submitの提出用ファイルを書き換え

# 予測
pred = xgb.predict(new_test)
# ラベルをデコーディング
pred_ = le.inverse_transform(pred)

# 提出用ファイル読み込み
submit = pd.read_csv('./sample_submit.csv', sep=',',header=None)
# 予測結果に書き換え
submit.loc[:,1] = pred_
# 予測結果ファイル出力
submit.to_csv("test_submit.csv", index = None, header = None)

結果

この結果をSIGNATEにアップロードしたところ
やっぱり80%くらいの精度だった

ここから、説明変数を増やしたりモデルを改良したりなど様々な手法を組み合わせて精度を上げていく
次回に続く...

参考：

https://qiita.com/uratatsu/items/8bedbf91e22f90b6e64b
https://qiita.com/yh0sh/items/1df89b12a8dcd15bd5aa

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up