# Gradient Descent

Updated: 2021-11-19

## Gradient Descent vs Stochastic Gradient Descent

- Gradient Descent (or Batch Gradient Descent): compute the cost gradient based on the complete training set. Can be costly for large datasets, and take longer to converge.
- Stochastic Gradient Descent (or Online Gradient Descent): update the weights after each single training sample, or mini-batches(hundreds or thousands) of training samples. Originally named ADALINE.

## Use Cases

- fits logistic regression: loss=log
- fits (linear) support vector machines: loss=hinge soft-margin
- combined with the "back propagation" algorithm: the de facto standard algorithm for training (shallow) neural networks.

## Other algorithms

- conjugate gradient
- BFGS
- L-BFGS
- Limited-memory BFGS (L-BFGS or LM-BFGS)
- MLE
- least mean squares (LMS) adaptive filter