More than 5 years have passed since last update.

<科目> 機械学習第三章:ロジスティク回帰モデル

Last updated at 2019-12-14Posted at 2019-12-12

<科目> 機械学習

目次
第一章:線形回帰モデル
[第二章:非線形回帰モデル]
(https://qiita.com/matsukura04583/items/baa3f2269537036abc57)
[第三章:ロジスティク回帰モデル]
(https://qiita.com/matsukura04583/items/0fb73183e4a7a6f06aa5)
[第四章:主成分分析]
(https://qiita.com/matsukura04583/items/b3b5d2d22189afc9c81c)
[第五章:アルゴリズム1(k近傍法(kNN))]
(https://qiita.com/matsukura04583/items/543719b44159322221ed)
[第六章:アルゴリズム2(k-means)]
(https://qiita.com/matsukura04583/items/050c98c7bb1c9e91be71)
[第七章:サポートベクターマシン]
(https://qiita.com/matsukura04583/items/6b718642bcbf97ae2ca8)

第三章:ロジスティク回帰モデル

ロジスティク回帰モデルの説明

分類問題(クラス分類)　　
- ある入力(数値)からクラスに分類する問題
- 「回帰」と名前がついているが分類の問題である
分類で扱うデータ
- 入力(各要素を説明変数または特徴量と呼ぶ)
m次元のベクトル(m=1の場合はスカラ)
出力(目的変数) 0 or 1の値」
- 例として、タイタニック、IRISデータなど
説明変数

   x=(x_1,x_2,・・・,x_m)^T \in R^m

目的変数

     y \in \left\{0,1\right\}

ロジスティック線形回帰モデル
- 分類問題を解くための教師あり機械学習モデル(教師データから学習)
- 入力とm次元パラメータの線形結合をシグモイド関数に入力
- 出力はy=1になる確率の値になる
シグモイド関数
- 入力は実数・出力は必ず0~1の値
- (クラス1に分類される)確率を表現
- 単調増加関数
パラメータが変わるとシグモイド関数の形が変わる
- aを増加させると，x=0付近での曲線の勾配が増加
- aを極めて大きくすると，単位ステップ関数(x<0でf(x)=0，x>0でf(x)=1となるような関数)に近づきます
- バイアス変化は段差の位置
シグモイド関数の性質
- シグモイド関数の微分は、シグモイド関数自身で表現することが可能
- 尤度関数の微分を行う際にこの事実を利用すると計算が容易
- データYは確率が0.5以上ならば1・未満なら0と予測

(演習3)　タイタニックのデータセットで30歳男性の生存率を予測する

Googleドライブのマウント

from google.colab import drive
drive.mount('/content/drive')

0.データ表示

# from モジュール名 import クラス名（もしくは関数名や変数名）
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# matplotlibをinlineで表示するためのおまじない (plt.show()しなくていい)
%matplotlib inline

以下では，Googleドライブのマイドライブ直下にstudy_ai_mlフォルダを利用しています。

# titanic data csvファイルの読み込み
titanic_df = pd.read_csv('/content/drive/My Drive/study_ai_ml/data/titanic_train.csv')

# ファイルの先頭部を表示し、データセットを確認する
titanic_df.head(5)

変数の意味を調べてみました。

PassengerID:　乗客ID
Survived: 　　生存結果 (1: 生存, 0: 死亡)　
Pclass: 　　　乗客の階級 1が一番上位のクラス
Name: 　　　乗客の名前
Sex: 　　　　性別
Age: 　　　　年齢
SibSp 　　　兄弟、配偶者の数
Parch 　　　　両親、子供の数
Ticket 　　　チケット番号
Fare 　　　　乗船料金
Cabin 　　　　部屋番号
Embarked 　　乗船した港　Cherbourg、Queenstown、Southamptonの３種類

1. ロジスティック回帰

不要なデータの削除・欠損値の補完

# 予測に不要と考えるからうをドロップ 
titanic_df.drop(['PassengerId','Pclass', 'Name', 'SibSp','Parch','Ticket','Fare','Cabin','Embarked'], axis=1, inplace=True)

# 一部カラムをドロップしたデータを表示
titanic_df.head()

# nullを含んでいる行を表示
titanic_df[titanic_df.isnull().any(1)].head(10)

# Ageカラムのnullを中央値で補完

titanic_df['AgeFill'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())

# 再度nullを含んでいる行を表示 (Ageのnullは補完されている)
titanic_df[titanic_df.isnull().any(1)]

# titanic_df.dtypes
# titanic_df.head()

1. ロジスティック回帰

実装(性別と年齢から生死を判別)

# AgeFillの欠損値を埋めたので
# titanic_df = titanic_df.drop(['Age'], axis=1)
# Genderに女性0 男性1をセット
titanic_df['Gender'] = titanic_df['Sex'].map({'female': 0, 'male': 1}).astype(int)
titanic_df.head()

男女年齢別の生死の分布を描いてみる

np.random.seed = 0

xmin, xmax = -5, 85
ymin, ymax = -0.5, 1.3

index_survived = titanic_df[titanic_df["Survived"]==0].index
index_notsurvived = titanic_df[titanic_df["Survived"]==1].index

from matplotlib.colors import ListedColormap
fig, ax = plt.subplots()
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
sc = ax.scatter(titanic_df.loc[index_survived, 'AgeFill'],
                titanic_df.loc[index_survived, 'Gender']+(np.random.rand(len(index_survived))-0.5)*0.1,
                color='r', label='Not Survived', alpha=0.3)
sc = ax.scatter(titanic_df.loc[index_notsurvived, 'AgeFill'],
                titanic_df.loc[index_notsurvived, 'Gender']+(np.random.rand(len(index_notsurvived))-0.5)*0.1,
                color='b', label='Survived', alpha=0.3)
ax.set_xlabel('AgeFill')
ax.set_ylabel('Gender')
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
ax.legend(bbox_to_anchor=(1.4, 1.03))

男性が1で女性が0、赤が死亡で青が生存なので、女性が比較的多く生存しているように分布している。

# 年齢と性別だけのリストを作成
data2 = titanic_df.loc[:, ["AgeFill", "Gender"]].values
data2

結果

array([[22.        ,  1.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       ...,
       [29.69911765,  0.        ],
       [26.        ,  1.        ],
       [32.        ,  1.        ]])

年齢別の生存グラフを作ってみる

split_data = []
for survived in [0,1]:
    split_data.append(titanic_df[titanic_df.Survived==survived])

temp = [i["AgeFill"].dropna() for i in split_data ]
plt.hist(temp, histtype="barstacked", bins=16)

年齢の欠損値を平均で埋めたので、真ん中が多くなっている。
欠損値を除いたデータで再度グラフ化してみる。

temp = [i["Age"].dropna() for i in split_data]
plt.hist(temp, histtype="barstacked", bins=16)

男女の生存率も山積み図で確認てみる

temp = [i["Gender"].dropna() for i in split_data]
plt.hist(temp, histtype="barstacked", bins=16)

それっぽくなった。

1. ロジスティック回帰

実装(2変数から生死を判別)

## 生死フラグのみのリストを作成
label2 =  titanic_df.loc[:,["Survived"]].values
from sklearn.linear_model import LogisticRegression
model2 = LogisticRegression()
model2.fit(data2, label2)

結果

/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

30歳　男性を予測する

model2.predict([[30,1]])

結果

array([0])```


```python
model2.predict([[30,1]])

結果

array([1])

model2.predict_proba([[30,1]])

ゼロ(死亡)の予測が返ってくる

結果

array([0])

その判定の確立を見てる

model2.predict_proba([[30,1]])

結果

array([[0.80664059, 0.19335941]])

死亡の確立８０％、生存の確立２０%の割合がわかる。

<科目> 機械学習 第三章:ロジスティク回帰モデル

<科目> 機械学習

第三章:ロジスティク回帰モデル

ロジスティク回帰モデルの説明

(演習3) タイタニックのデータセットで30歳男性の生存率を予測する

0.データ表示

1. ロジスティック回帰

不要なデータの削除・欠損値の補完

1. ロジスティック回帰

実装(性別と年齢から生死を判別)

男女年齢別の生死の分布を描いてみる

1. ロジスティック回帰

実装(2変数から生死を判別)

<科目> 機械学習第三章:ロジスティク回帰モデル

(演習3)　タイタニックのデータセットで30歳男性の生存率を予測する