- Gradient Descent (or Batch Gradient Descent): compute the cost gradient based on the complete training set. Can be costly for large datasets, and take longer to converge.
- Stochastic Gradient Descent (or Online Gradient Descent): update the weights after each single training sample, or mini-batches(hundreds or thousands) of training samples. Originally named ADALINE.
- fits logistic regression: loss=log
- fits (linear) support vector machines: loss=hinge soft-margin
- combined with the "back propagation" algorithm: the de facto standard algorithm for training (shallow) neural networks.
- conjugate gradient
- Limited-memory BFGS (L-BFGS or LM-BFGS)
- least mean squares (LMS) adaptive filter