Machine Learning Interview Questions

1. "What is X?"

The basic concepts.

Statistics questions such as what is an F-test
implement logistic regression training for binary classification
explain overfit, underfit, bias, variance and their relations
gradient descent
L1/L2 regularization
Bayes Theorem
collaborative filtering
dimension reduction
what is batch normalization ? What benefit it gives ?
Explain naive Bayes. (What is independent ?) Explain how to use it to build a spam filter .
Explain ROC. What is the curve if we do a random guess ? What is different between two points on the ROC curve ? Explain Precision-Recall-Curve. Explain confusion matrix. Explain F1-score, why do we use it ?

2. "Compare X and Y" or "Pros and Cons" of X

One step further beyond basic concepts, needs a better understanding of the topics

bias-variance trade-off
bagging vs boosting
Difference between a convex and non-convex solution
why stochastic gradient descent is appropriate for distributed training
how XGBoost differs from traditional GBDT, e.g. what is special about its loss function, why it needs to compute the second order derivative
AdaptiveBoost vs GradientBoost.

Check out the Versus page

3. Practical Questions

Needs deeper understanding of the topic or hands on experiences.

How do you adjust the cost parameter for the SVM regularizer
How to assess the quality of clustering, especially to know when you have the right number of clusters
How do you pick the features to use
model: over calibration issue

4. Design Questions

"How would you approach ..."

Question about a real world problem:

How would you approach the Netflix Prize?
How would you generate related searches on Bing?
How would you suggest followers on Twitter?

More Questions

describe how a decision tree works, from the viewpoint of "information gain". Why pruning may help ? what benefit we get from pruning a tree ?
What is random forest ? How to use bagging trick to make RF ? Does RF need pruning and Why ?
What's difference between Sigmoid and ReLu ? Their advantages and disadvantages ? (sparsity, gradient vanish , activation blow up, complexit )
what Optimizer you used in your DL model ? Explain AdamOpt, Momentum, SGD.
Explain transfer learning and fine-tune. Can you arbitrarily take out one layer from CNN model ? Why ? Can you run a CNN on different sizes of images ? Why ?
Explain learning rate decay, and why use it ? Explain L2 regularization, and why use it ? What's relation/difference between weight decay and L2 reg ?
Explain K-fold cross validation. How do you use it to train your model ?
Explain LR (linear regression), OLS (ordinary least square) model, and PCA. What's the difference/relation between them ?
Does PCA give us largest variance or smallest variance when we use it to compress data ? Explain why. Bonus question: explain Linear Discriminant Analysis and its difference from PCA.
If your data is corrupted by noise , how the noise affect you model, overfit or underfit ? Why ?
How the K value affect KNN model ? Larger K overfits or underfits ? Smaller K overfit or underfit ?
What is the major problem with RNN-BPTT ? How come the gradient may vanish or explode ?
Illustrate basic ideas of collaborative filtering , and matrix factorization
Compare Kmeans with Gaussian mixture. Relation and difference ?
why use mini-batch in training ? Why not just use SGD , or just use all training data in the whole batch when updating the gradient ? Why use momentum ( taking the history of gradients ) when we use SGD ?
How would you sample uniformly from a continuous stream of data? (or Randomly Pick n elements from a given array of m elements.)Reservoir Sampling.

What are the problems with feature importance in Random Forest and Gradient Boosted Tree?

Feature selection based on impurity reduction is biased towards preferring variables with more categories
With correlated features, strong features can end up with low scores