More than 5 years have passed since last update.

MachineLearningの基本的な手順ーーCourseraのMachineLeaningを完走したので復習する１ーー

Last updated at 2019-10-22Posted at 2019-10-22

この記事での表記法

m: examplesの数
n: featuresの数
y: Vector of Actual Values
X: Matrix of Features
θ: Vector of Parameters

y =
\begin{pmatrix}
y^{(1)}\\
y^{(2)}\\
.\\
.\\
y^{(m)}\\

\end{pmatrix}

X=
\begin{pmatrix}
X^{(1)}_1\ & X^{(1)}_2 \ & .... \ &  X^{(1)}_n \\
X^{(2)}_1\ & X^{(2)}_2 \ & .... \ &  X^{(2)}_n \\
.\ & .\ & .... \ &  . \\
.\ & .\ & .... \ &  . \\
X^{(m)}_1\ & X^{(m)}_2 \ & .... \ &  X^{(m)}_n \\

\end{pmatrix}

\theta=
\begin{pmatrix}
θ_1\\ θ_2\\ θ_3\\ .\\.\\. \\ θ_n

\end{pmatrix}

とする。

以降、線形回帰（Liner Regression）を例に機械学習の大まかな流れを記述していく。

1.TrainingSetを使ってモデルを作る

1-1 とりあえずモデル[仮説hθ(x)]を作る

$ h\theta \left( x\right) =\ \theta X=\theta_{0}x_{0}+\theta_{1}x_{1}+\ldots +\theta_{n}x_{n} $
$ ※x_{0}=1 $

h\theta\left( x\right) =
\begin{pmatrix}
h\theta\left( x^{(1)}\right)\\
h\theta\left( x^{(2)}\right)\\
.\\
.\\
h\theta\left( x^{(m)}\right)\\

\end{pmatrix}

1-2 CostFunction J(θ)[仮説の値と実際の値の誤差を算出する関数]を作る

J(θ)=[仮説hθ(x)により算出された値と実際の値yとの平均二乗誤差 (Liner Regressionの場合)]
$J\left( \theta \right) =\dfrac {1}{2m}\sum ^{m}_{i=1}\left( h\theta \left( x^{(i)}\right) -y^{(i)}\right) ^{2}$

1-3 CostFunction J(θ)[誤差]が最小になるθを見つける

例.Batch Gradient Descent
Repeat{
$\theta _{j}:=\theta _{j}-\alpha \dfrac {\partial }{\partial \theta _{j}}J\left( \theta \right)$
(simultaneously update for every j=0...n)
}

J(θ)が収束する（$\dfrac {\partial }{\partial \theta _{j}}J\left( \theta \right)$＝０になる[極小値をとる]）までθを変化させ続けることで、
CostFunction J(θ)[誤差]がおおよそ最小になるθを見つける。
→hθ(x)を正確にする。

2 モデルの正確性を新しいDateSet[CrossValidationSet]を用いて検証

IF モデルが単純すぎると・・・・・

モデルがデータの特徴（分布）を捉えきれていないためUnderfit（high bias）が起こる。
この場合、モデルが単純すぎてデータの特徴を捉えられないところに問題があるため、データを増やしたところで、モデル自体の構造を変えない限り結局問題の解決にはならない。

解決策：単純すぎるモデルを複雑にする

・使うfeaturesを増やしてみる。
・polynomial features を作ってみる。

IF モデルが複雑すぎると・・・・・

TrainigSetにおけるデータにはモデルが綺麗にフィットする。しかし、それは言い換えればフィットしすぎていることでもあり、新しいデータセットにモデルがフィットしない事態が起こる。つまり、Overfit(high varience)はモデルが複雑すぎるゆえに、トレーニングデータセットの特徴を捉えすぎるため起こる。

解決策：モデルの複雑性を減らす or データを増やす。

・smaller sets of features

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up