#Supervised Learning
##Linear Regression Model
1.Hypothesis Function: (the function to best fit in the training set)
假设函数/仮定関数
h_\theta(x) = \theta_0 + \theta_1 x_1
$\theta_i$'s: Parameters
2.Cost Function: (to measure the performance/accuracy of the hypothesis function)
代价函数/目的関数
J(\theta_0,\theta_1) =
\frac{1}{2m}
\sum_{i=1}^m
\Bigl(
h_\theta (x_i) - y_i
\Bigr)
^2
$m$ is called a training set (or the # of training examples)
$(x_i , y_i)$ is called a training example
3.Gradient Descent: (the algorithm used to find the best parameters $\theta$ = minimize the cost function J)
梯度下降法/最急降下法
\theta_j
:=
\theta_j -
\alpha
\frac{d}{d \theta_j}
J(\theta_0, \theta_1)
$\alpha$ is the learning rate (the step size of descent)
$\frac{d}{d \theta_j}J(\theta_0, \theta_1)$ is the partial derivative
- At each iteration j, one should simultaneously update the parameters $\theta_1,
\theta_2, ..., \theta_n$- In gradient descent as we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease $\alpha$ over time.
- Note that, while gradient descent can be susceptible to local minima in general, the optimization problem for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum.
- This method looks at every example in the entire training set on every step, and is called batch gradient descent.
##Multivariate Linear Regression
1. Hyphthesis Function:
\begin{align}
h_\theta(x)
&= \theta_0 x_0 + \theta_1 x_1 + ... +\theta_n x_n \\
&= \theta^T x
\end{align}
This is a vectorization of our hypothesis function.
2. Gradient Descent for Multiple Variables:
repeat until convergence: {
\theta_j :=
\theta_j - \alpha
\frac{1}{m}
\sum_{i=1}^m
\Bigl(
h_\theta (x^{(i)}) - y^{(i)}
\Bigr)
⋅ x^{(i)}_j
for j := 0...n
3. Gradient Descent in Practice:
- Feature Scaling and Mean Normalization:
x_i := \frac {x_i - \mu_i}{s_i}
Where
$\mu_i$ is the average of all the values for feature (i)
$s_i$ is the range of values (max - min), or the standard deviation.
We can speed up gradient descent by having each of our input values in roughly the same range.
Ideally,
$-1 \leq x^{(i)} \leq 1 $,
or
$-0.5 \leq x^{(i)} \leq 0.5$
- Debugging gradient descent
Plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.
It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.
To choose $\alpha$, try
..., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ...
- Features and Polynomial Regression:
(Methods to improve hypothesis function) - Combine multiple features into one.
- Change the behavior or curve by making it a quadratic, cubic, or square root function.
##Normal Equation
An alternative way of minimizing J (cost function).
\theta = (X^T X) ^{-1} X^Ty
- Octave code:
pinv(X'*X)*X'*y
-
There is no need to do feature scaling with the normal equation
-
When to use Normal Equation vs. Gradient Descent: (n is # of features $x$)
-
if n < 1,000, better use NE
-
if n too large (usually when n>10K), GD is better
-
Not applicable to more complex algorithms (classification algorithm, logistic regression algorithm...)