#Supervised Learning
##Classification Problem
1. Hypothesis Function:
- "Sigmoid Function" or "Logistic Function":
$g(z) = \frac{1}{1+e^{-z}} $
h_\theta(x) = g(\theta^Tx) =
\frac{1}{1 + e^{-\theta^Tx}}
-
Interpretation:
the probability that our output is 1
$h_\theta(x) = P(y=1|x;\theta) = 1 - P(y=0|x;\theta) $ -
Decision Boundary:
- when
$h_\theta(x) \geq 0.5 \rightarrow y=1, h_\theta(x) < 0.5 \rightarrow y=0$
means $g(z) \geq 0.5 \rightarrow z \geq 0 \rightarrow y=1 $ - z is the input (e.g. $z = \theta^Tx$)
- decision boundary could be any shape:
- when
2. Cost Function:
J(\theta) = - \frac{1}{m}
\sum_{i=1}^{m}
\Bigl[y^{(i)}log\bigl(h_\theta(x^{(i)})\bigr)
+(1-y^{(i)})log\bigl(1-h_\theta(x^{(i)})\bigr)
\Bigr]
Vectorized implementation:
$h=g(X\theta)$
$J(\theta)=\frac{1}{m} \bigl(-y^Tlog(h)-(1-y)^Tlog(1-h)\bigr)$
3. Gradient Descent:
Repeat {
\theta_j := \theta_j- \frac{\alpha}{m}
\sum_{i=1}^{m}
\bigl(h_\theta(x^{(i)})-y^{(i)}\bigr)x_j^{(i)}
}
Vectorized implementation:
$\theta := \theta - \frac{\alpha}{m}X^T\bigl(g(X\theta)-\vec{y}\bigr)$
##Advanced Optimization
- Optimization algorithms:
- Gradient descent
- Conjugate gradient
- BFGS
- L-BFGS
Code:
First, we need to provide a function that evaluates both
$J(\theta)$, and
$\frac{\alpha}{\alpha\theta_j}J(\theta)$
function [jVal, gradient] = costFunction(theta)
jVal = [...code to compute J(theta)...];
gradient = [...code to compute derivative of J(theta)...];
end
Then we use "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()".
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
##Multiclass Classification: One-vs-all
Multiclass means y = {0,1, ... ,n}. Simply apply the same logistic algorithem to each class:
Train a logistic regression classifier $h_\theta(x)$ for each class to predict the probability that  y = i .
To make a prediction on a new x, pick the class that maximizes $h_\theta(x)$
The Problem of Overfitting
Too many features, too complicated function - high variance:
1. Regularized Cost Function:
regularize all of our theta parameters in a single summation:
min_\theta \rightarrow J(\theta)= \frac{1}{2m}\Biggl[
\sum_{i=1}^m \bigl(h_\theta(x^{(i)})-y^{(i)} \bigr)^2 +
\lambda\sum_{j=1}^n\theta_j^2 \Biggr]
The $\lambda$, or lambda, is the regularization parameter.
2. Regularized Gradient Descent:
- Regularized Linear Regression:
Repeat {
\theta_0 := \theta_0 - \alpha\frac{1}{m} \sum_{i=1}^m \bigl(h_\theta(x^{(i)})-y^{(i)}\bigr)x_0^{(i)}
\theta_j := \theta_j - \alpha \Biggl[ \Bigl(\frac{1}{m} \sum_{i=1}^{m} \bigl(h_\theta(x^{(i)})-y^{(i)} \bigr)x_j^{(i)}\Bigr)+\frac{\lambda}{m}\theta_j \Biggr]\qquad j\in {1,2...n}
}
$\theta_j$ can also be represented as:
$\theta_j := \theta_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m} \sum_{i=1}^m \bigl(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$
Intuitively, you can see it as reducing the value of $\theta_j$ by some amount on every update.
- Normal Equation:
$\theta = (X^TX+\lambda L)^{-1}X^Ty$
where $L$ =
\begin{bmatrix}
0& & & & \\
&1& & & \\
& &1& & \\
& & &\ddots& \\
& & & &1\\
\end{bmatrix}\qquad (n+1) \, x \, (n+1) \, dimension
- Regularized Logistic Regression:
Repeat {
\theta_0 := \theta_0 - \alpha\frac{1}{m} \sum_{i=1}^m \bigl(h_\theta(x^{(i)})-y^{(i)}\bigr)x_0^{(i)}
\theta_j := \theta_j - \alpha \Biggl[ \Bigl(\frac{1}{m} \sum_{i=1}^{m} \bigl(h_\theta(x^{(i)})-y^{(i)} \bigr)x_j^{(i)}\Bigr)+\frac{\lambda}{m}\theta_j \Biggr]\qquad j\in {1,2...n}
}
Cost function (regularized):
$J(\theta) = - \frac{1}{m}
\sum_{i=1}^{m}
\Bigl[y^{(i)}log\bigl(h_\theta(x^{(i)})\bigr)
+(1-y^{(i)})log\bigl(1-h_\theta(x^{(i)})\bigr)
\Bigr]+\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2$