Updated: 2019-01-13

Drop RNN and LSTM(Long-short term memory)


  • tensor: an array of any number of dimensions(including 0 dimensions, which is a scalar)
  • rank: number of dimensions(0 for scalar)


instead of being trained to predict the target value Y given inputs X, autoencoders are trained to reconstruct their own inputs X

An autoencoder, autoassociator or Diabolo network is an artificial neural network used for learning efficient codings. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction.


Calibration is defined as the ratio between ctr obtained by predicting on the model (modelctr) and ctr of the actual data (truectr).

calibration = (modelctr / truectr).

If calibration:

  • == 1.0: means the model prediction distribution is aligned well with true labels.
  • < 1.0: under predicting. Model is predicting less positive labels compared to actual data.
  • > 1.0: over predicting. Model is predicting more positive labels compared to actual data.

Random Forest

One-hot Encoding

one-hot encoding/one-cold


when there are only two groups for the one-way ANOVA F-test, F=t2F=t^2 where tt is the Student's t statistic.


the frequency of occurrence of each word is used as a feature for training a classifier, disregarding grammar and word order. Used in natural language processing and information retrieval and computer vision.

Feature Generation

TF, IDF, PageRank cos distance

Statistics Test

  • A test of goodness of fit establishes whether or not an observed frequency distribution differs from a theoretical distribution.
  • A test of independence assesses whether paired observations on two variables, expressed in a contingency table, are independent of each other (e.g. polling responses from people of different nationalities to see if one's nationality is related to the response).

chi square: categorical data chi-square distribution: a special case of gamma distribution t-distribution: a sampled normal distribution

Student's t-test assumes that the situations produce "normal" data that differ only in that the average outcome in one situation is different from the average outcome in the other situation.


A chi-squared test is used to compare binned data (e.g. a histogram) with another set of binned data or the predictions of a model binned in the same way.

A K-S test is applied to unbinned data to compare the cumulative frequency of two distributions or compare a cumulative frequency against a model prediction of a cumulative frequency.

Both chi-squared and K-S will give a probability of rejecting the null hypothesis. Artificially binning data loses information so should be avoided if possible. On the other hand the chi-squared statistic does give useful shortcuts if you are trying to model the parameters that describe a set of data and the uncertainties on those parameters. The K-S test should not really be used if there are adjustable parameters which are being optimised to fit the data.

Specific trivial example. I measure the height of 1000 people. Let's say they're all between 1.5m and 2m feet. I have a model I wish to test that says the distribution is Gaussian with a mean of 1.76m and a dispersion (sigma) of 0.1m.

So, how do I test whether this model well represents the data? One approach is to construct the cumulative distribution of heights and then compare it against the cumulative normal distribution described using a KS test. However, an alternative would be to put the data into say 5cm bins and then find the chi-squared statistic compared with the model. Both of these would give you a probability of rejecting the null hypothesis. For such a purpose though I would favour the K-S test, because binning the data takes away some information.

On the other hand maybe your hypothesis is that the distribution is normal and you want to find what the mean and dispersion are. In which case you can't use the K-S test, that's not what it is for. However youcan minimise chi-squared to find the best fitting parameters using the binned data. A caveat here would be that when dealing with frequencies, chi-squared should not be used when you have small numbers per bin (say less than 9), because Poisson statistics become important. In these instances there are alternatives like the "Cash statistic".

I suppose at some level data are always binned. But when doing the K-S test there is usually only one object in each bin!


LDA is based upon the concept of searching for a linear combination of variables (predictors) that best separates two classes (targets)

LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities.

Latent Space

The "Latent Space" is the vector space within which the vectors that make up the topics found by LDA are found. These topics are latent within the text - that is, they are not immediately apparent, but are found or discovered by the LDA algorithm. In the same way, the vector space within which they reside is latent, or waiting, to be populated.


Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewed left, we mean that the left tail is long relative to the right tail. Similarly, skewed right means that the right tail is long relative to the left tail. If the data are multi-modal, then this may affect the sign of the skewness.


Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. A uniform distribution would be the extreme case.

The kurtosis for a standard normal distribution is three.

Classification vs Reinforcement learning

Classification/supervised learning is essentially about learning labels. We have some examples that have already been labeled: cats vs. dogs, suspicious transactions vs legitimate ones, As vs Bs vs...Zs. From those examples, we want to learn some way to assign new, unlabeled instances, to one of those classes.

Reinforcement learning, in contrast, is fundamentally about learning how to behave. Agents learn by interacting with their environment: some combinations of states and actions eventually lead to a reward (which the agent "likes") and others do not. The reward might even be disconnected from the most recent state or action and instead depend on decisions made earlier. The goal is to learn a "policy" that describes what should be done in each state and balances learning more about the environment ("exploration", which may pay off by letting us collect more rewards later) and using what we know about the environment to maximize our current reward intake (exploitation).

Recommender System

2 Ways:

  • Collaborative Filtering: similar users(thus "collaborative") are also buying/watching/listening to these items
  • Content-based Filtering: based on the attributes/characteristics/hashtags of the items that users may be interested in

Collaborative Filtering

Based on user-item matrix, where each row is a user, each column is an item, values are ratings or likes.

Matrix Factorization(e.g. SVD): find 2 matrices whose product is the original matrix. Purpose: reduce dimensions from N to K, where K << N.

Random Notes

a source in scikit-learn stating with detail that, GBDT and all other ensemble models will have biased model calibration by nature, especially when value approaching lower bound 0 and upper bound 1.

degree of freedom

P-values report if the numbers differ significantly

log-scale: only to positive data

twitter stats package:

string input columns will be one-hot encoded, and numeric columns will be cast to doubles.


regularization It's enough to understand that A^{-1} is ill conditioned






Lift(A,B)=p(A,B)p(A)p(B)Lift(A,B) = p(A,B) \over p(A)p(B)


Leverage(A,B)=p(A,B)p(A)p(B)Leverage(A,B) = p(A,B) - p(A)p(B)

the normal distribution is the distribution that maximizes entropy for a given mean and variance.

SVD: recommendation(PCA is implemented by SVD in spark)

When MLP have shared weights (eg:Convolutional nets)

MLP multilayer p: fully connected

hyperparameter optimization or model selection

Bayesian optimization is a sequential design strategy for global optimization of black-box functions

The infinite-dimensional generalization of the Dirichlet distribution is the Dirichlet process.

data collection
feature engineering/feature selection/data prep
auto decision/logging

latent feature space (also called embedding latent feature vectors

Knowledge transfer from a source domain (user organic engagement) to a target domain (ads) through representation.
using a bag of attributes to represent users instead of user id.

Note that GBDT is not a suitable learning algorithm to consume dense embedding vectors due to its myopic nature of looking at one coordinate at a time while embedding vectors match with each other as a whole.

deep learning

caffe2/tf MNIST

sequence modeling: Sequence to sequence problems address areas such as machine translation, where an input sequence in one language is converted into a sequence in another language.

derivative -> gradient -> gradient descent -> SGD, gradient boosting

ML Glossary

Bayesian probability

Bayesian statistics is often used for inferring latent variables.:

Bayesian vs Frequentist: @manisha72617183: My favorite quote, when reading up on Frequentists and Bayesians: A frequentist is a person whose long-run ambition is to be wrong 5% of the time. A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule

Latent Variables

latent variables (from Latin: present participle of lateo (“lie hidden”), as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured).

Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models

One advantage of using latent variables is that they can serve to reduce the dimensionality of data. A large number of observable variables can be aggregated in a model to represent an underlying concept, making it easier to understand the data.

Model Selection

model comparison: either a metric that compares classifier efficacy along the whole score range like area under ROC curve, or at least comparing recall at a preset precision point.


Generative Adversarial Network:


Central Moment

CentralMoment=μn=E[(XE[X])n]=+(xμ)nf(x)dx{\bf Central Moment} = \mu_n = {\bf E} [(X - {\bf E}[X])^n] = \int_{-\infty}^{+\infty} (x-\mu)^n f(x)dx

where EE is the expectation operator; f(x)f(x) is probability density function.

For random variables that have no mean, such as the Cauchy distribution, central moments are not defined.

  • First moment is mean: μ=mean\mu=mean;
  • Zero-th central moment is 1: μ0=1\mu_0 = 1
  • Fisrt central moment is 0: μ1=0\mu_1=0
  • Second central moment is variance: μ2=σ2\mu_2=\sigma^2, where σ\sigma is standard deviation
  • The 3rd and 4th central moments are used to define the standardized moments

Standardized Moment

StandardizedMoment=μkσk{\bf Standardized Moment} = \frac{\mu_k}{\sigma^k}

The first standardized moment is zero, because the first moment about the mean is zero The second standardized moment is one, because the second moment about the mean is equal to the variance (the square of the standard deviation) The third standardized moment is the skewness The fourth standardized moment is the kurtosis

μ1σ=0σ=0{\mu_1 \over \sigma} = {0 \over \sigma} = 0

Laplace Distribution


f(xμ,b)=12bexpxμb{f(x \mid \mu, b)} = {1 \over 2b} {\exp - {\mid x - \mu \mid \over b}}


A good binning algorithm should follow the following guidelines (ref):

  • Missing values are binned separately.
  • Each bin should contain at least 5% of observations.
  • No bins have 0 accounts for good or bad.

Binning Options:

  • each bin contains equal number of items
  • each bin contains equal number of positive items

Refined Binning

  • for continuous fields: merge bins so WOE of the bins has a monotonic trend
  • for categorical fields: merge small bins


Andrej Karpathy

Approaching Almost Any machine learning Problem