Machine Learning Algorithms Pros and Cons
TL;DR
Start with Logistic Regression, then try Tree Ensembles, and/or Neural Networks.
Occam's Razor principle: use the least complicated algorithm that can address your needs and only go for something more complicated if strictly necessary.
Neural Networks(both traditional and deep neural nets) and Gradient Boosted Decision Trees(GBDT) are being widely used in industry.
Pros and Cons
Here discusses the most popular algorithms. For a full list of machine learning algorithms, check out the cheatsheet.
Naive Bayes
 super simple(just doing some counts) yet performing well in practice.
 compute the multiplication of independent distributions
 require less training data
 no distribution requirements
 converge quicker than discriminative models(e.g. logistic regression) under conditional independence assumption
 good for few categories variables
 suffer multicollinearity
Logistic Regression
Logistic regression is probably the most widely used
 easy to interpret. the output can be interpreted as a probability: you can use it for ranking instead of classification.
 good for cases where features are expected to be roughly linear, and the problem to be linearly separable.
 can easily "feature engineering" most nonlinear features into linear ones.
 robust to noise
 can use l2 or l1 regularization to avoid overfitting(and for feature selection)
 efficient, and can be distributed(ADMM)
 no distribution requirement
 compute the logistic distribution
 cannot handle categorical(binary) variables well
 compute Confidence Interval
 suffer multicollinearity
 no need to worry about features being correlated, like in Naive Bayes.
 easily update the model to take in new data (using an online gradient descent method)
 use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you're unsure, or to get confidence intervals)
 use it if you expect to receive more training data in the future and want to quickly be incorporate into the model.
Lasso(L1)
 no distribution requirement
 compute L1 loss
 variable selection
 suffer multicollinearity
Ridge(L2)
 no distribution requirement
 compute L2 loss
 no variable selection
 not suffer multicollinearity
When NOT to use
 if the variables are normally distributed and the categorical variables all have 5+ categories: use Linear discriminant analysis
 if the correlations are mostly nonlinear: use SVM
 if sparsity and multicollinearity are a concern: Adaptive Lasso with Ridge(weights) + Lasso
Linear discriminant analysis
LDA: Linear discriminant analysis, not latent Dirichlet allocation
 require normal distribution
 not good for few categories variables
 compute the addition of Multivariate distribution
 compute Confidence Interval
 suffer multicollinearity
Support Vector Machines
 Support Vector Machines (SVMs) use a different loss function (Hinge) from LR.
 they are also interpreted differently (maximummargin).
 SVM with a linear kernel is similar to a Logistic Regression in practice
 if the problem is not linearly separable, use an SVM with a non linear kernel (e.g. RBF). (Logistic Regression can also be used with a different kernel)
 good in a highdimensional space (e.g. text classification).
 high accuracy
 good theoretical guarantees regarding overfitting
 no distribution requirement
 compute hinge loss
 flexible selection of kernels for nonlinear correlation
 not suffer multicollinearity
 hard to interpret
Cons:
 can be inefficient to train, memoryintensive and annoying to run and tune
 not for problems with many training examples.
 not for most "industry scale" applications (anything beyond a toy or lab problem)
Decision Tree
 Easy to interpret and explain
 Nonparametric, no need to worry about outliers or whether the data is linearly separable.
 no distribution requirement
 heuristic
 good for few categories variables
 not suffer multicollinearity (by choosing one of them)
 can easily overfit,

tree ensembles
 e.g. Random Forests and Gradient Boosted Trees, using bagging or boosting
 generally outperform single decision tree.
 handle very well high dimensional spaces as well as large number of training examples.
Random Forests
 train each tree independently, using a random sample of the data, so the trained model is more robust than a single decision tree, and less likely to overfit
 2 parameters: number of trees and number of features to be selected at each node.
 good for parallel or distributed computing.
 lower classification error and better fscores than decision trees.
 perform as well as or better than SVMs, but far easier for humans to understand.
 good with uneven data sets with missing variables.
 calculates feature importance
 train faster than SVMs
Gradient Boosted Decision Trees
 build trees one at a time, each new tree corrects some errors made by the previous trees, the model becomes even more expressive.
 3 parameters  number of trees, depth of trees, and learning rate; trees are generally shallow.
 usually perform better than Random Forests, but harder to get right. The hyperparameters are harder to tune and more prone to overfitting. RFs can almost work "out of the box".
 training takes longer since trees are built sequentially
Neural Network
 good to model the nonlinear data with large number of input features
 widely used in industry
 many open source implementations
 only for numerical inputs, vectors with constant number of values, and datasets with nonmissing data.
 "black boxy", the classification boundaries are hard to understand intuitively("like trying interrogate the human unconscious for the reasons behind our conscious actions.")
 computationally expensive.
 the trained model depends crucially on initial parameters
 difficult to troubleshoot when they don't work as expect
 not sure if they will generalize well to data not in training set
 multilayer neural networks are usually hard to train, and require tuning lots of parameters
 not probabilistic, unlike their more statistical or Bayesian counterparts. The continuous number output (e.g. a score) can be difficult to translate that into a probability.
Deep Learning
 not a generalpurpose technique for classification.
 good in image classification, video, audio, text.
Summary
Factors to consider

number of training examples, (how large is your training set?)
 if small: high bias/low variance classifiers (e.g., Naive Bayes), less likely to overfit
 if large: low bias/high variance classifiers (e.g., kNN or logistic regression)
 dimensionality of the feature space
 is the problem linearly separable?
 are features independent?
 are features expected to linearly dependent with the target variable?
 is overfitting expected to be a problem?
 system requirement: speed, performance, memory usage
 Does it require variables to be normally distributed?
 Does it suffer multicollinearity issue?
 Dose it do as well with categorical variables as continuous variables?
 Does it calculate CI without CV?
 Does it conduct variables selection without stepwise?
 Does it apply to sparse data?
Start with something simple like Logistic Regression to set a baseline and only make it more complicated if you need to. At that point, tree ensembles, and in particular Random Forests since they are easy to tune, might be the right way to go. If you feel there is still room for improvement, try GBDT or get even fancier and go for Deep Learning.