Support Vector Machines (SVM)
###1. SVM Hypothesis
min \, C\sum_{i=1}^m \Bigl[ y^{(i)}cost_1(\theta^Tx^{(i)})+(1-y^{(i)})cost_0(\theta^Tx^{(i)})\Bigr]+\frac{1}{2}\sum_{i=1}^n\theta_j^2
###2. Kernels
- Gaussian Kernel:
f_1 = similarity(x,l^{(1)}) = exp\Bigl(-\frac {||x-l_j^{(1)}||^2} {2\sigma^2} \Bigr) = exp \Bigl(- \frac {\sum_{j=1}^n(x_j-l_j^{(1)})^2} {2\sigma^2} \Bigr)
- SVM with Kernels
Hypothesis: Given x, compute features $f \in R^{m+1}$
Predict "y=1" if $\theta^Tf \geq 0$
Training:
min \, C \sum_{i=1}^m y^{(i)}cost_1(\theta^Tf^{(i)})+(1-y^{(i)})cost_0(\theta^Tf^{(i)})+\frac{1}{2}\sum_{j=1}^n\theta_j^2
Notes:
C (=$\frac{1}{\lambda}$):
Large C: Lower bias, high variance.
Small C: Higher bias, low variance.
$\sigma^2$:
Large $\sigma^2$: Higher bias, lower variance
Small $\sigma^2$: Lower bias, high variance
###3. Using an SVM
Use SVM software package to solve for parameters $\theta$. Need to specify:
- Choice of parameter C
- Choice of kernel: No kernel ("linear kernel"), Gaussian kernel, ...
###4. Logistic Regression vs. SVMs
n = number of features, m = number of training examples
- if n is large (relative to m): use logistic regression, or SVM without a kernel ("linear kernel") -- n=10,000, m=10...1,000
- if n is small, m is intermediate: use SVM with Gaussian kernel -- n=1,000, m=10-10,000
- if n is small, m is large: create/add more features, then use logistic regression or SVM without a kernel -- n=1,000, m=50,000+
- Neural network likely to work well for most of the these settings, but may be slower to train