Machine Learning Algorithms Pros and Cons

TL;DR

Start with Logistic Regression, then try Tree Ensembles, and/or Neural Networks.

Occam's Razor principle: use the least complicated algorithm that can address your needs and only go for something more complicated if strictly necessary.

Based on my own experience, only Neural Networks and Gradient Boosted Decision Trees(GBDT) are being widely used in industry. I witnessed Logistic Regression and Random Forest being deprecated more than once(meaning they are good starters). Never heard anybody talked about SVM in companies.

Pros and Cons

Here discusses the most popular algorithms. For a full list of machine learning algorithms, check out the cheatsheet.

Naive Bayes

  • super simple, just doing a bunch of counts.
  • if the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn't hold, a NB classifier still often performs surprisingly well in practice.
  • a good bet if you want to do some kind of semi-supervised learning, or want something embarrassingly simple that performs pretty well.
  • no distribution requirements,
  • good for few categories variables
  • compute the multiplication of independent distributions
  • suffer multicollinearity

Logistic Regression

Logistic regression is still the most widely used

Learn More

  • a pretty well-behaved classification algorithm that can be trained as long as you expect your features to be roughly linear and the problem to be linearly separable.
  • can do some feature engineering to turn most non-linear features into linear pretty easily.
  • it is also pretty robust to noise and you can avoid overfitting and even do feature selection by using l2 or l1 regularization.
  • logistic regression can also be used in Big Data scenarios since it is pretty efficient and can be distributed using, for example, ADMM (see logreg).
  • the output can be interpreted as a probability: you can use it for ranking instead of classification.
  • run a simple l2-regularized LR to come up with a baseline
  • no distribution requirement
  • perform well with few categories categorical variables
  • compute the logistic distribution
  • good for few categories variables
  • easy to interpret
  • compute CI
  • suffer multicollinearity
  • lots of ways to regularize your model
  • no need to worry about features being correlated, like in Naive Bayes.
  • easily update the model to take in new data (using an online gradient descent method)
  • use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you're unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.

Lasso

  • no distribution requirement
  • compute L1 loss
  • variable selection
  • suffer multicollinearity

Ridge

  • no distribution requirement
  • compute L2 loss
  • no variable selection
  • not suffer multicollinearity

When not to use

  • if the variables are normally distributed and the categorical variables all have 5+ categories: use Linear discriminant analysis
  • if the correlations are mostly nonlinear: use SVM
  • if sparsity and multicollinearity are a concern: Adaptive Lasso with Ridge(weights) + Lasso

Linear discriminant analysis

LDA: Linear discriminant analysis, not latent Dirichlet allocation

  • require normal distrbution
  • not good for few categories variables
  • compute the addition of Multivariate distribution
  • compute CI
  • suffer multicollinearity

Support Vector Machines

SVM vs LR:

  • Support Vector Machines (SVMs) use a different loss function (Hinge) from LR.
  • They are also interpreted differently (maximum-margin).
  • However, in practice, an SVM with a linear kernel is not very different from a Logistic Regression (If you are curious, you can see how Andrew Ng derives SVMs from Logistic Regression in his Coursera Machine Learning Course).
  • The main reason you would want to use an SVM instead of a Logistic Regression is because your problem might not be linearly separable. In that case, you will have to use an SVM with a non linear kernel (e.g. RBF).
  • The truth is that a Logistic Regression can also be used with a different kernel, but at that point you might be better off going for SVMs for practical reasons.
  • Another related reason to use SVMs is if you are in a highly dimensional space. For example, SVMs have been reported to work better for text classification.
  • High accuracy, nice theoretical guarantees regarding overfitting
  • with an appropriate kernel they can work well even if you're data isn't linearly separable in the base feature space.
  • Especially popular in text classification problems where very high-dimensional spaces are the norm.
  • no distribution requirement
  • compute hinge loss
  • flexible selection of kernels for nonlinear correlation
  • not suffer multicollinearity
  • hard to interpret

Cons:

  • can be painfully inefficient to train. not recommend for any problem that have many training examples. not recommend SVMs for most "industry scale" applications. Anything beyond a toy/lab problem might be better approached with a different algorithm. Memory-intensive and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.

Decision Tree

  • Easy to interpret and explain
  • Non-parametric, so you don't have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). Their main disadvantage is that they easily overfit, but that's where ensemble methods like random forests (or boosted trees) come in.
  • Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they're fast and scalable, and you don't have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.
  • no distribution requirement
  • heuristic
  • good for few categories variables
  • not suffer multicollinearity (by choosing one of them)

Bagging, boosting, ensemble methods generally outperform single algorithm.

Tree Ensembles: Random Forests and Gradient Boosted Trees.

Tree Ensembles vs LR.

  • they do not expect linear features or even features that interact linearly. Something I did not mention in LR is that it can hardly handle categorical (binary) features. Tree Ensembles, because they are nothing more than a bunch of Decision Trees combined, can handle this very well. The other main advantage is that, because of how they are constructed (using bagging or boosting) these algorithms handle very well high dimensional spaces as well as large number of training examples.
  • both are fast and scalable, random forests tend to beat out logistic regression in terms of accuracy, but logistic regression can be updated online and gives you useful probabilities.

Random Forests

Random Forests train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree, and less likely to overfit on the training data. There are typically two parameters in RF - number of trees and no. of features to be selected at each node.

  • RF is good for parallel or distributed computing.
  • Almost always have lower classification error and better f-scores than decision trees.
  • Almost always perform as well as or better than SVMs, but are far easier for humans to understand.
  • Deal really well with uneven data sets that have missing variables.
  • Give you a really good idea of which features in your data set are the most important for free.
  • Generally train faster than SVMs (though this obviously depends on your implementation).

Gradient Boosted Decision Trees

GBTs build trees one at a time, where each new tree helps to correct errors made by previously trained tree. With each tree added, the model becomes even more expressive. There are typically three parameters - number of trees, depth of trees and learning rate, and the each tree built is generally shallow.

  • prone to overfitting
  • GBDTs will usually perform better than RF, but they are harder to get right. More concretely, GBDTs have more hyper-parameters to tune and are also more prone to overfitting. RFs can almost work "out of the box" and that is one reason why they are very popular.
  • GBDT training generally takes longer because of the fact that trees are built sequentially

Neural Network

Pros

  • good to model the non-linear data with large number of input features
  • widely used in industry
  • many open source implementations

Cons

  • NNs are useable only for numerical inputs, vectors with constant number of values, and datasets with non-missing data.
  • The classification boundaries are hard to understand intuitively and ANNs are computationally expensive.
  • black box, makes them difficult to work with, it’s like trying interrogate the human unconscious for the reasons behind our conscious actions.
  • difficult to train: the training outcome can be nondeterministic and depend crucially on the choice of initial parameters
  • It makes them difficult to troubleshoot when they don't work as you expect, and when they do work, you will never really feel confident that they will generalize well to data not included in your training set because, fundamentally, you don't understand how your network is solving the problem
  • multi-layer neural networks are usually hard to train, and require tuning lots of parameters
  • Neural networks are not probabilistic, unlike their more statistical or Bayesian counterparts. A neural network might give you a continuous number as its output (e.g. a score) but translating that into a probability is often difficult. Approaches with stronger theoretical foundations usually give you those probabilities directly.

Deep Learning

  • not a general-purpose technique for classification.
  • good in image classification, video, audio, text.

Summary

Factors to Consider

  • Number of training examples, (how large is your training set?)

    • If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN or logistic regression), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren't powerful enough to provide accurate models. You can also think of this as a generative model vs. discriminative model distinction.
  • Dimensionality of the feature space
  • Do I expect the problem to be linearly separable?
  • Are features independent?
  • Are features expected to linearly dependent with the target variable?
  • Is overfitting expected to be a problem?
  • What are the system's requirement in terms of speed/performance/memory usage...?
  • Does it require variables to be normally distributed?
  • Does it suffer multicollinearity issue?
  • Dose it do as well with categorical variables as continuous variables?
  • Does it calculate CI without CV?
  • Does it conduct variables selection without stepwise?
  • Does it apply to sparse data?

Start with something simple like Logistic Regression to set a baseline and only make it more complicated if you need to. At that point, tree ensembles, and in particular Random Forests since they are easy to tune, might be the right way to go. If you feel there is still room for improvement, try GBDT or get even fancier and go for Deep Learning.