Gradient Descent

Updated: 2019-01-13

Gradient Descent vs Stochastic Gradient Descent

  • Gradient Descent (or Batch Gradient Descent): compute the cost gradient based on the complete training set. Can be costly for large datasets, and take longer to converge.
  • Stochastic Gradient Descent (or Online Gradient Descent): update the weights after each single training sample, or mini-batches(hundreds or thousands) of training samples. Originally named ADALINE.

Use Cases

  • fits logistic regression: loss=log
  • fits (linear) support vector machines: loss=hinge soft-margin
  • combined with the "back propagation" algorithm: the de facto standard algorithm for training (shallow) neural networks.

Other algorithms

  • conjugate gradient
  • BFGS
  • L-BFGS
  • Limited-memory BFGS (L-BFGS or LM-BFGS)
  • MLE
  • least mean squares (LMS) adaptive filter