More than 3 years have passed since last update.

Machines Learning 学习笔记(Week3)

Last updated at Posted at 2020-04-30

#Supervised Learning

##Classification Problem

1. Hypothesis Function:

  • "Sigmoid Function" or "Logistic Function":
    $g(z) = \frac{1}{1+e^{-z}} $
h_\theta(x) = g(\theta^Tx) = 
\frac{1}{1 + e^{-\theta^Tx}}


  • Interpretation:
    the probability that our output is 1
    $h_\theta(x) = P(y=1|x;\theta) = 1 - P(y=0|x;\theta) $

  • Decision Boundary:

    • when
      $h_\theta(x) \geq 0.5 \rightarrow y=1, h_\theta(x) < 0.5 \rightarrow y=0$
      means $g(z) \geq 0.5 \rightarrow z \geq 0 \rightarrow y=1 $
    • z is the input (e.g. $z = \theta^Tx$)
    • decision boundary could be any shape:


2. Cost Function:

J(\theta) = - \frac{1}{m}

Vectorized implementation:
$J(\theta)=\frac{1}{m} \bigl(-y^Tlog(h)-(1-y)^Tlog(1-h)\bigr)$

3. Gradient Descent:
Repeat {

\theta_j := \theta_j- \frac{\alpha}{m}

Vectorized implementation:
$\theta := \theta - \frac{\alpha}{m}X^T\bigl(g(X\theta)-\vec{y}\bigr)$

##Advanced Optimization

  • Optimization algorithms:
  • Gradient descent
  • Conjugate gradient
  • BFGS
  • L-BFGS

First, we need to provide a function that evaluates both
$J(\theta)$, and

function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];

Then we use "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()".

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
   [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

##Multiclass Classification: One-vs-all
Multiclass means y = {0,1, ... ,n}. Simply apply the same logistic algorithem to each class:

Train a logistic regression classifier $h_\theta(x)$ for each class to predict the probability that  y = i .

To make a prediction on a new x, pick the class that maximizes $h_\theta(x)$


The Problem of Overfitting

Too many features, too complicated function - high variance:

1. Regularized Cost Function:
regularize all of our theta parameters in a single summation:

min_\theta \rightarrow J(\theta)= \frac{1}{2m}\Biggl[ 
\sum_{i=1}^m \bigl(h_\theta(x^{(i)})-y^{(i)} \bigr)^2 +
\lambda\sum_{j=1}^n\theta_j^2  \Biggr]

The $\lambda$, or lambda, is the regularization parameter.

2. Regularized Gradient Descent:

  • Regularized Linear Regression:

Repeat {

\theta_0 := \theta_0 - \alpha\frac{1}{m} \sum_{i=1}^m \bigl(h_\theta(x^{(i)})-y^{(i)}\bigr)x_0^{(i)}
\theta_j := \theta_j - \alpha \Biggl[ \Bigl(\frac{1}{m} \sum_{i=1}^{m} \bigl(h_\theta(x^{(i)})-y^{(i)} \bigr)x_j^{(i)}\Bigr)+\frac{\lambda}{m}\theta_j \Biggr]\qquad j\in {1,2...n}


$\theta_j$ can also be represented as:
$\theta_j := \theta_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m} \sum_{i=1}^m \bigl(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$

Intuitively, you can see it as reducing the value of $\theta_j$ by some amount on every update.

  • Normal Equation:

$\theta = (X^TX+\lambda L)^{-1}X^Ty$
where $L$ =

0& & & &  \\
 &1& & &  \\
 & &1& &  \\
 & & &\ddots& \\
 & & & &1\\
\end{bmatrix}\qquad (n+1) \, x \, (n+1) \, dimension
  • Regularized Logistic Regression:

Repeat {

\theta_0 := \theta_0 - \alpha\frac{1}{m} \sum_{i=1}^m \bigl(h_\theta(x^{(i)})-y^{(i)}\bigr)x_0^{(i)}
\theta_j := \theta_j - \alpha \Biggl[ \Bigl(\frac{1}{m} \sum_{i=1}^{m} \bigl(h_\theta(x^{(i)})-y^{(i)} \bigr)x_j^{(i)}\Bigr)+\frac{\lambda}{m}\theta_j \Biggr]\qquad j\in {1,2...n}


Cost function (regularized):
$J(\theta) = - \frac{1}{m}


