##Gradient Descent with Large Datasets
-
Batch gradient descent: Use all $m$ examples in each iteration
-
very computationally expensive
-
Stochastic gradient descent: Use 1 example in each iteration
-
We can plot cost $\bigl(\theta,(x^{(i)},y^{(i)})\bigr)$ (averaged over the last (say) 1000 examples) to monitor how stochastic gradient descent is doing
-
If we reduce the learning rate α (and run stochastic gradient descent long enough), it’s possible that we may find a set of better parameters than with larger α.
-
If we want stochastic gradient descent to converge to a (local) minimum rather than wander of "oscillate" around it, we should slowly decrease α over time.
-
Mini-batch gradient descent: Use $b$ examples in each iteration (b = mini-batch size 2-100)
-
Map Reduce (Parallelize data computing)
-
use multiple computers or cores to parallelize learning algorithm