More than 5 years have passed since last update.

Machines Learning 学习笔记(Week10)

Posted at 2020-06-16

Gradient Descent with Large Datasets

Batch gradient descent: Use all $m$ examples in each iteration
very computationally expensive
Stochastic gradient descent: Use 1 example in each iteration
We can plot cost $\bigl(\theta,(x^{(i)},y^{(i)})\bigr)$ (averaged over the last (say) 1000 examples) to monitor how stochastic gradient descent is doing
If we reduce the learning rate α (and run stochastic gradient descent long enough), it’s possible that we may find a set of better parameters than with larger α.
If we want stochastic gradient descent to converge to a (local) minimum rather than wander of "oscillate" around it, we should slowly decrease α over time.
Mini-batch gradient descent: Use $b$ examples in each iteration (b = mini-batch size 2-100)
Map Reduce (Parallelize data computing)
use multiple computers or cores to parallelize learning algorithm