More than 3 years have passed since last update.

Learn Python In Seminar〜自動車の走行距離を予測するモデル作成〜

Last updated at 2021-05-22Posted at 2021-05-16

Page 1 of 19

概要

自動車の属性データからガソリン１ガロンあたりの走行距離を予測するモデル作成を行います。
データはこちらからダウンロードしてください。

他にも「Learn Python In Seminar」のシリーズとして書いているので、よければご覧ください！

利用するデータの説明

カラム	ヘッダ名称	データ型	説明
1	mpg	float	ガソリン１ガロンあたりの走行距離
2	cylinders	varchar	シリンダー
3	displacement	float	排気量
4	horsepower	float	馬力
5	weight	float	重量
6	acceleration	float	加速度
7	model year	varchar	年式
8	origin	varchar	起源
9	car name	varchar	車名

ライブラリのインストール

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.linear_model import LinearRegression as LR

データの読み込み

df_train = pd.read_csv('data/train.tsv', sep='\t')

print(df_train.head())

↓

   id   mpg  cylinders  displacement horsepower  weight  acceleration  model year  origin          car name
0   0  29.0          4         135.0      84.00  2525.0          16.0          82       1    dodge aries se
1   3  31.9          4          89.0      71.00  1925.0          14.0          79       2  vw rabbit custom
2   9  19.0          6         156.0      108.0  2930.0          15.5          76       3    toyota mark ii
3  11  28.0          4          90.0      75.00  2125.0          14.5          74       1        dodge colt
4  13  37.7          4          89.0      62.00  2050.0          17.3          81       3     toyota tercel

欠損値の確認

print(df_train.isnull().sum())

↓

id              0
mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model year      0
origin          0
car name        0
dtype: int64

基本統計量

print(df_train.describe(include='all'))

↓

                id         mpg   cylinders  displacement horsepower       weight  acceleration  model year      origin    car name
count   199.000000  199.000000  199.000000    199.000000        199   199.000000    199.000000  199.000000  199.000000         199
unique         NaN         NaN         NaN           NaN         71          NaN           NaN         NaN         NaN         167
top            NaN         NaN         NaN           NaN      90.00          NaN           NaN         NaN         NaN  ford pinto
freq           NaN         NaN         NaN           NaN         12          NaN           NaN         NaN         NaN           4
mean    200.170854   24.307035    5.296482    183.311558        NaN  2883.839196     15.647739   76.165829    1.582915         NaN
std     113.432759    7.797665    1.644562     98.400457        NaN   819.766870      2.701885    3.802928    0.798932         NaN
min       0.000000    9.000000    3.000000     71.000000        NaN  1613.000000      8.500000   70.000000    1.000000         NaN
25%      98.500000   18.000000    4.000000     98.000000        NaN  2217.500000     14.000000   73.000000    1.000000         NaN
50%     202.000000   24.000000    4.000000    140.000000        NaN  2702.000000     15.500000   76.000000    1.000000         NaN
75%     294.500000   30.500000    6.000000    250.000000        NaN  3426.500000     17.150000   80.000000    2.000000         NaN
max     396.000000   44.600000    8.000000    454.000000        NaN  5140.000000     23.700000   82.000000    3.000000         NaN

相関係数

print(df_train.corr())

↓

                    id       mpg  cylinders  displacement    weight  acceleration  model year    origin
id            1.000000 -0.052688   0.103419      0.098416  0.070563     -0.087649   -0.093272 -0.031421
mpg          -0.052688  1.000000  -0.770160     -0.804635 -0.820660      0.379721    0.568471  0.533671
cylinders     0.103419 -0.770160   1.000000      0.950600  0.893256     -0.479561   -0.303462 -0.497373
displacement  0.098416 -0.804635   0.950600      1.000000  0.933038     -0.523955   -0.329817 -0.579805
weight        0.070563 -0.820660   0.893256      0.933038  1.000000     -0.401757   -0.265562 -0.535893
acceleration -0.087649  0.379721  -0.479561     -0.523955 -0.401757      1.000000    0.194854  0.189193
model year   -0.093272  0.568471  -0.303462     -0.329817 -0.265562      0.194854    1.000000  0.124279
origin       -0.031421  0.533671  -0.497373     -0.579805 -0.535893      0.189193    0.124279  1.000000

相関係数の可視化

mpg(ガロンあたりの走行距離)とdisplacement、weightに強い負の相関関係

相関関係の可視化

sns.pairplot(df_train[["mpg", "weight", "displacement"]])

↓

重回帰モデルの作成

# データの準備
y_train = df_train[["mpg"]] # 被説明変数
X_train = df_train[["weight", "displacement"]] # 説明変数

# モデルの生成
lr = LR()

# モデルの学習
lr.fit(X_train, y_train)

学習結果

print(lr.coef_) # 傾き
print(lr.intercept_) # 切片
print(lr.score(X_train, y_train)) # 決定係数

↓

[[-0.00513703 - 0.02383218]] # 傾き
[43.4901176] # 切片
0.6851901992035567 # 決定係数

y = b1 × -0.00513703 + b2 × -0.02383218 + 43.4901176

学習結果と実データの比較

df_train['pred'] = y_pred_train

# 実データと予測結果の差をresとして代入
df_train['res'] = df_train['mpg'] - df_train['pred']

# ソート
print(df_train.sort_values(by='res'))

↓

mpg   displacement  weight  model year                            car name       pred        res
19.0         122.0  2310.0          73                          ford pinto  28.716052 - 9.716052
20.0         140.0  2408.0          72                      chevrolet vega  27.783644 - 7.783644
18.0         121.0  2933.0          72                     volvo 145e(sw)  25.539515 - 7.539515
19.0         121.0  2868.0          73                         volvo 144ea  25.873422 - 6.873422
21.0         140.0  2401.0          73                      chevrolet vega  27.819603 - 6.819603
...           ...     ...         ...                                 ...        ...        ...
41.5          98.0  2144.0          80                           vw rabbit  30.140772  11.359228
43.1          90.0  1985.0          78     volkswagen rabbit custom diesel  31.148217  11.951783
44.6          91.0  1850.0          80                 honda civic 1500 gl  31.817884  12.782116
43.4          90.0  2335.0          80                  vw dasher(diesel)  29.350257  14.049743
38.0         262.0  3015.0          82   oldsmobile cutlass ciera(diesel)  21.757941  16.242059

model yearが古い時にマイナス、新しい時にプラスにずれている
すでにmodel yearとmpgに相関関係が確認

特徴量の追加

説明変数にmodel yearを追加

y_train = df_train[["mpg"]]
X_train = df_train[["weight", "displacement", "model year"]]


lr = LR()
lr.fit(X_train, y_train)
y_pred_train = lr.predict(X_train)

print(lr.score(X_train, y_train))

↓

0.806202914350324

model yearを説明変数に設定したところ決定変数の値が約0.2上昇

再検証

df_train['pred'] = y_pred_train
df_train['res'] = df_train['mpg'] - df_train['pred']
print(df_train.sort_values(by='res'))

↓

mpg  displacement  weight  model year                           car name       pred        res
19.0         122.0  2310.0          73                         ford pinto  25.822268 - 6.822268
22.0         232.0  2835.0          82                     ford granada l  28.804316 - 6.804316
19.8         200.0  2990.0          79                   mercury zephyr 6  25.711137 - 5.911137
21.5         121.0  2600.0          77                           bmw 320i  27.049786 - 5.549786
23.5         173.0  2725.0          81                 chevrolet citation  29.040204 - 5.540204
 ...           ...     ...         ...                                ...        ...        ...
36.4         121.0  2950.0          80                audi 5000s(diesel)  27.133402   9.266598
44.6          91.0  1850.0          80                honda civic 1500 gl  34.205448  10.394552
38.0         262.0  3015.0          82  oldsmobile cutlass ciera(diesel)  27.517075  10.482925
43.1          90.0  1985.0          78    volkswagen rabbit custom diesel  31.838867  11.261133
43.4          90.0  2335.0          80                 vw dasher(diesel)  31.161030  12.238970

通常年式だけではなく、車の種類とセットのため、car nameもセットで関係していると予測

ダミー変数化

ダミー変数化できるカラムを自動で認識

X_train = pd.get_dummies(
    df_train[["weight", "displacement", "model year", "car name"]]) 
print(X_train)

↓

weight  displacement  model year  ...  car name_vw dasher(diesel)  car name_vw rabbit  car name_vw rabbit custom
2525.0         135.0          82  ...                            0                   0                    0
1925.0          89.0          79  ...                            0                   0                    1
2930.0         156.0          76  ...                            0                   0                    0
2125.0          90.0          74  ...                            0                   0                    0
2050.0          89.0          81  ...                            0                   0                    0
...           ...         ...  ...                          ...                 ...                  ...
2110.0          85.0          80  ...                            0                   0                    0
3570.0         302.0          78  ...                            0                   0                    0
3433.0         304.0          70  ...                            0                   0                    0
2335.0          90.0          80  ...                            1                   0                    0
2265.0          98.0          73  ...                            0                   0                    0

再度モデルの作成と検証

y_train = df_train[["mpg"]]
X_train = pd.get_dummies(
    df_train[["weight", "displacement", "model year", "car name"]]) 

lr = LR()
lr.fit(X_train, y_train)
y_pred_train = lr.predict(X_train)

df_train['pred'] = y_pred_train
df_train['res'] = df_train['mpg'] - df_train['pred']
print(df_train.sort_values(by='res'))

print(lr.score(X_train, y_train))

↓

mpg  displacement  weight  model year            car name       pred       res
29.0          90.0  1937.0          76           vw rabbit  34.656354 - 5.656354
31.8          85.0  2020.0          79          datsun 210  36.276274 - 4.476274
19.0         122.0  2310.0          73          ford pinto  23.364118 - 4.364118
25.0         116.0  2220.0          76           opel 1900  27.625883 - 2.625883
23.5         173.0  2725.0          81  chevrolet citation  26.067394 - 2.567394
 ...           ...     ...         ...                 ...        ...       ...
28.0         116.0  2123.0          71           opel 1900  25.374117  2.625883
26.5         140.0  2565.0          76          ford pinto  23.871847  2.628153
30.5          97.0  2190.0          77   volkswagen dasher  27.862259  2.637741
40.8          85.0  2110.0          80          datsun 210  36.323726  4.476274
41.5          98.0  2144.0          80           vw rabbit  35.843646  5.656354

0.9834170393004322

予測データと実データの差が減少
car nameのダミー変数化前に比べて決定係数が約0.3上昇

結果

horsepower、weight、model yearに自動車の走行距離(mpg)に影響
car nameをダミー変数化することで決定係数が大幅に改善

課題

car nameを説明変数に入れると未知の車種を予測できなかった(過学習)
→メーカーと車種を分けて分析する必要がある
originやcylindersが未検証
null以外の欠損値があとで発見

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up