More than 5 years have passed since last update.

pythonを使ったデータ探索と回帰【kaggle, EDA,randomforest】

Last updated at 2019-07-30Posted at 2019-07-27

今回はコレを紹介しながら写経します。

データについて

試用するデータは保険料のデータ(と思われる)。

2019/07/30追記 : データは健康保険から請求される費用とのこと。

データには住所や年齢、喫煙、子供の数などと、保険料が入っている。

このデータをEDAで色々な角度から見ていってモデルの変数選択をして、
最終的に回帰で保険料を算出しようという試み。

早速はじめます

import numpy as np 
import pandas as pd 
import os
import matplotlib.pyplot as pl
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('C:/・・・/insurance.csv')

いつもの

data.head()
data.isnull().sum()

 	age 	sex 	bmi 	children 	smoker 	region 	charges
0 	19 	female 	27.900 	0 	yes 	southwest 	16884.92400
1 	18 	male 	33.770 	1 	no 	southeast 	1725.55230
2 	28 	male 	33.000 	3 	no 	southeast 	4449.46200
3 	33 	male 	22.705 	0 	no 	northwest 	21984.47061
4 	32 	male 	28.880 	0 	no 	northwest 	3866.85520

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

データにNaNが含まれてない日はついている日です。
カテゴリカルデータをエンコードして、変数の相関を計算します。

from sklearn.preprocessing import LabelEncoder
# sex
le = LabelEncoder()
le.fit(data.sex.drop_duplicates()) 
data.sex = le.transform(data.sex)
# smoker or not
le.fit(data.smoker.drop_duplicates()) 
data.smoker = le.transform(data.smoker)
# region
le.fit(data.region.drop_duplicates()) 
data.region = le.transform(data.region)

カテゴリカルデータにフィルターをかけて重複しているものをファクターに変換しています。
regionのような変数にはonehotencorderなどを使用すべきなのでしょうが今回は怠慢にもlabelencorderで行くようです。

data.corr()['charges'].sort_values()

region     -0.006208
sex         0.057292
children    0.067998
bmi         0.198341
age         0.299008
smoker      0.787251
charges     1.000000

f, ax = pl.subplots(figsize=(10, 8))
corr = data.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(240,10,as_cmap=True),
            square=True, ax=ax)

相関を計算した結果、喫煙者であることと、保険料に正の相関がありそうです。
筆者はbmiと相関があると想定していたようです。

費用の分布を確認します。

from bokeh.io import output_notebook, show
from bokeh.plotting import figure
output_notebook()
import scipy.special
from bokeh.layouts import gridplot
from bokeh.plotting import figure, show, output_file
p = figure(title="Distribution of charges",tools="save",background_fill_color="#E8DDCB")
hist, edges = np.histogram(data.charges)
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],fill_color="#036564", line_color="#033649")
p.xaxis.axis_label = 'x'
p.yaxis.axis_label = 'Pr(x)'

元のソースコードでは

show(gridplot(p,ncols = 2, plot_width=400, plot_height=400, toolbar_location=None))

なのですが、動かなかったので

show(p)

費用の分布をみるとべらぼうに高い値は少ないようです。
大体10000くらい。

f= pl.figure(figsize=(12,5))

ax=f.add_subplot(121)
sns.distplot(data[(data.smoker == 1)]["charges"],color='c',ax=ax)
ax.set_title('Distribution of charges for smokers')

ax=f.add_subplot(122)
sns.distplot(data[(data.smoker == 0)]['charges'],color='b',ax=ax)
ax.set_title('Distribution of charges for non-smokers')

喫煙者と非喫煙者での費用分布の違いを見ると、
喫煙者はふたこぶの分布で、さらに費用は高い傾向にある。
喫煙していなければ10000くらいで大体のデータが収まっています。

sns.catplot(x="smoker", kind="count",hue = 'sex', palette="pink", data=data)

女性は1,男性は0に相当します
喫煙者を確認してみると、男性のほうが喫煙者でありそうです。
この関係から、男性全体の治療費は女性よりも多くなりそうですね。

sns.catplot(x="sex", y="charges", hue="smoker",
            kind="violin", data=data, palette = 'magma')

バイオリンplotで治療費と人数の密度を見てみましょう
こっちのほうが分かりやすい視覚化ですよね

pl.figure(figsize=(12,5))
pl.title("Box plot for charges of women")
sns.boxplot(y="smoker", x="charges", data =  data[(data.sex == 1)] , orient="h", palette = 'magma')

喫煙者のboxplot 女性から

pl.figure(figsize=(12,5))
pl.title("Box plot for charges of men")
sns.boxplot(y="smoker", x="charges", data =  data[(data.sex == 0)] , orient="h", palette = 'rainbow')

男性も。

年齢との関係を見てみましょう。

pl.figure(figsize=(12,5))
pl.title("Distribution of age")
ax = sns.distplot(data["age"], color = 'g')

患者の最もヤングな年齢が18
シニアは64です
18歳で吸ってる人はいるのでしょうか？？

sns.catplot(x="smoker", kind="count",hue = 'sex', palette="rainbow", data=data[(data.age == 18)])
pl.title("The number of smokers and non-smokers (18 years old)")

あらま
18歳でも吸ってる人がいるんですね。
やはり18歳でも治療費は高くなるのでしょうか。

pl.figure(figsize=(12,5))
pl.title("Box plot for charges 18 years old smokers")
sns.boxplot(y="smoker", x="charges", data = data[(data.age == 18)] , orient="h", palette = 'pink')

18歳の金額分布。
18歳でも喫煙者は治療費が高くなっているようですね。

g = sns.jointplot(x="age", y="charges", data = data[(data.smoker == 0)],kind="kde", color="m")
g.plot_joint(pl.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("$X$", "$Y$")
ax.set_title('Distribution of charges and age for non-smokers')

喫煙している人の金額分布
(若い人ほど単純にやすい)

g = sns.jointplot(x="age", y="charges", data = data[(data.smoker == 1)],kind="kde", color="c")
g.plot_joint(pl.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("$X$", "$Y$")
ax.set_title('Distribution of charges and age for smokers')

喫煙者の金額分布
年齢と金額をみると金額は二こぶになっている。

# non - smokers
p = figure(plot_width=500, plot_height=450)
p.circle(x=data[(data.smoker == 0)].age,y=data[(data.smoker == 0)].charges, size=7, line_color="navy", fill_color="pink", fill_alpha=0.9)

show(p)

# smokers
p = figure(plot_width=500, plot_height=450)
p.circle(x=data[(data.smoker == 1)].age,y=data[(data.smoker == 1)].charges, size=7, line_color="navy", fill_color="red", fill_alpha=0.9)
show(p)

sns.lmplot(x="age", y="charges", hue="smoker", data=data, palette = 'inferno_r', size = 7)
ax.set_title('Smokers and non-smokers')

重ね合わせてplotみるとこんな感じ

喫煙者の2グループについて、それぞれのグループに分けて回帰をしている。

非喫煙者は直線が年齢と比例して増加していっている。
自然の摂理ですよね。健康第一!

喫煙者ではばらつきが大変大きく、単純に年齢を重ねたら高くなるのか？と聞かれると説明できない部分が多いデータになっている。

Bmiと比べてみましょう
治療費とbmiは関係があるのでしょうか？

pl.figure(figsize=(12,5))
pl.title("Distribution of bmi")
ax = sns.distplot(data["bmi"], color = 'm')

Bmiの分布は平均30のきれいな分布です。
30ってどんなもんなのかグーグルに聞いてみましょう
30は肥満の始まりらしいです。Bmiを30で切り分け、30以上以下で費用分布を見ましょう

pl.figure(figsize=(12,5))
pl.title("Distribution of charges for patients with BMI greater than 30")
ax = sns.distplot(data[(data.bmi >= 30)]['charges'], color = 'm')

30以上の分布

pl.figure(figsize=(12,5))
pl.title("Distribution of charges for patients with BMI less than 30")
ax = sns.distplot(data[(data.bmi < 30)]['charges'], color = 'b')

30以下の分布

Bmiが30を超えるほど高額を支払っています

g = sns.jointplot(x="bmi", y="charges", data = data,kind="kde", color="r")
g.plot_joint(pl.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("$X$", "$Y$")
ax.set_title('Distribution of bmi and charges')

Bmiに関して治療費と合わせてみてみましょ

さらに喫煙者とbmiではどうなるでしょうか

pl.figure(figsize=(10,6))
ax = sns.scatterplot(x='bmi',y='charges',data=data,palette='magma',hue='smoker')
ax.set_title('Scatter plot of charges and bmi')

sns.lmplot(x="bmi", y="charges", hue="smoker", data=data, palette = 'magma', size = 8)

喫煙者であればbmiと治療費に関する直線の傾きが上がっていますね。

子供を持っていたら何か特徴はみられるのでしょうか？

sns.catplot(x="children", kind="count", palette="ch:.25", data=data, size = 6)

ほとんど子供を持っていないようです
子供を抱えていたら喫煙したりするのでしょうか？

sns.catplot(x="smoker", kind="count", palette="rainbow",hue = "sex",
            data=data[(data.children > 0)], size = 6)
ax.set_title('Smokers and non-smokers who have childrens')

子持ちに喫煙者がいますね。
でも喫煙してない親のほうがかなり多いことがわかります

それぞれの変数から回帰

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.ensemble import RandomForestRegressor

x = data.drop(['charges'], axis = 1)
y = data.charges

x_train,x_test,y_train,y_test = train_test_split(x,y, random_state = 0)
lr = LinearRegression().fit(x_train,y_train)

y_train_pred = lr.predict(x_train)
y_test_pred = lr.predict(x_test)

print(lr.score(x_test,y_test))

0.7962732059725786

線形回帰してみました
予測値と実際の値を比較して精度を測定すると0.79でした
データが必ずしも綺麗とは限らないので参考程度にしましょう

X = data.drop(['charges','region'], axis = 1)
Y = data.charges

quad = PolynomialFeatures (degree = 2)
x_quad = quad.fit_transform(X)

X_train,X_test,Y_train,Y_test = train_test_split(x_quad,Y, random_state = 0)

plr = LinearRegression().fit(X_train,Y_train)

Y_train_pred = plr.predict(X_train)
Y_test_pred = plr.predict(X_test)

print(plr.score(X_test,Y_test))

0.8849197344147237

変数を減らしてシンプルにしています。
Regionを消すと、予測の精度は0.88まで上がりました。

ランダムフォレストを使ってみます

forest = RandomForestRegressor(n_estimators = 100,
                              criterion = 'mse',
                              random_state = 1,
                              n_jobs = -1)
forest.fit(x_train,y_train)
forest_train_pred = forest.predict(x_train)
forest_test_pred = forest.predict(x_test)

print('MSE train data: %.3f, MSE test data: %.3f' % (
mean_squared_error(y_train,forest_train_pred),
mean_squared_error(y_test,forest_test_pred)))
print('R2 train data: %.3f, R2 test data: %.3f' % (
r2_score(y_train,forest_train_pred),
r2_score(y_test,forest_test_pred)))

MSE train data: 3729086.094, MSE test data: 19933823.142
R2 train data: 0.974, R2 test data: 0.873

ランダムフォレストといえばclassifierをよく見る気がしますが、
こんかいはregressorを使用していきます。

二条平均平方根MSEとR二乗値を確認します

pl.figure(figsize=(10,6))

pl.scatter(forest_train_pred,forest_train_pred - y_train,
          c = 'black', marker = 'o', s = 35, alpha = 0.5,
          label = 'Train data')
pl.scatter(forest_test_pred,forest_test_pred - y_test,
          c = 'c', marker = 'o', s = 35, alpha = 0.7,
          label = 'Test data')
pl.xlabel('Predicted values')
pl.ylabel('Tailings')
pl.legend(loc = 'upper left')
pl.hlines(y = 0, xmin = 0, xmax = 60000, lw = 2, color = 'red')
pl.show()

X軸に予測値
Y軸に予測値と実際の値との差をplotしている
Trainとtestどちらもplotしている

以上

seabornやbokehはplotがとても分かりやすくなって素敵な資料が作成できそうです。
そこに時間を割くかは置いといても、見やすいplotは見にくいplotよりは理解を深めてくれるのでマスターしていきたいです。

個人的にはlabelencorderの使用例が見られたのがうれしかった。

こんな感じでランダムフォレストで回帰がうごくのかって勉強になりました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up