Last Updated: 2021-11-19


  • tensor: an array of any number of dimensions(including 0 dimensions, which is a scalar)
  • rank: number of dimensions(0 for scalar)


instead of being trained to predict the target value Y given inputs X, autoencoders are trained to reconstruct their own inputs X

An autoencoder, autoassociator or Diabolo network is an artificial neural network used for learning efficient codings. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction.


Calibration is defined as the ratio between ctr obtained by predicting on the model (model_ctr) and ctr of the actual data (true_ctr).

calibration = (model_ctr / true_ctr).

If calibration:

  • == 1.0: means the model prediction distribution is aligned well with true labels.
  • < 1.0: under predicting. Model is predicting less positive labels compared to actual data.
  • > 1.0: over predicting. Model is predicting more positive labels compared to actual data.

One-hot Encoding

one-hot encoding/one-cold https://en.wikipedia.org/wiki/One-hot


when there are only two groups for the one-way ANOVA F-test, F = t 2 F=t^2 where t t is the Student's t statistic.

Statistics Test

  • A test of goodness of fit establishes whether or not an observed frequency distribution differs from a theoretical distribution.
  • A test of independence assesses whether paired observations on two variables, expressed in a contingency table, are independent of each other (e.g. polling responses from people of different nationalities to see if one's nationality is related to the response).

chi-square distribution vs t-distribution

  • chi-square distribution: a special case of gamma distribution
  • t-distribution: a sampled normal distribution

Student's t-test assumes that the situations produce "normal" data that differ only in that the average outcome in one situation is different from the average outcome in the other situation.

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis LDA is based upon the concept of searching for a linear combination of variables (predictors) that best separates two classes (targets)

LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities.

Latent Dirichlet Allocation (LDA)

Groups unclassified text into a number of categories. often used in natural language processing (NLP) to find texts that are similar, i.e. topic modeling.

Latent Space

The "Latent Space" is the vector space within which the vectors that make up the topics found by LDA are found. These topics are latent within the text - that is, they are not immediately apparent, but are found or discovered by the LDA algorithm. In the same way, the vector space within which they reside is latent, or waiting, to be populated.


Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewed left, we mean that the left tail is long relative to the right tail. Similarly, skewed right means that the right tail is long relative to the left tail. If the data are multi-modal, then this may affect the sign of the skewness.


Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. A uniform distribution would be the extreme case.

The kurtosis for a standard normal distribution is three.

Classification vs Reinforcement learning

Classification / supervised learning: about learning labels. We have some examples that have already been labeled: cats vs. dogs, suspicious transactions vs legitimate ones, As vs Bs vs...Zs. From those examples, we want to learn some way to assign new, unlabeled instances, to one of those classes.

Reinforcement learning: about learning how to behave. Agents learn by interacting with their environment: some combinations of states and actions eventually lead to a reward (which the agent "likes") and others do not. The reward might even be disconnected from the most recent state or action and instead depend on decisions made earlier. The goal is to learn a "policy" that describes what should be done in each state and balances learning more about the environment ("exploration", which may pay off by letting us collect more rewards later) and using what we know about the environment to maximize our current reward intake (exploitation).

Recommender System

2 Ways:

  • Collaborative Filtering: similar users (thus "collaborative") are also buying/watching/listening to these items
  • Content-based Filtering: based on the attributes/characteristics/hashtags of the items that users may be interested in

Collaborative Filtering

Based on user-item matrix, where each row is a user, each column is an item, values are ratings or likes.

Matrix Factorization

Find 2 matrices whose product is the original matrix. Purpose: reduce dimensions from N to K, where K << N.

SVD: Singular value decomposition (PCA is implemented by SVD in spark)

NMF http://en.wikipedia.org/wiki/Non-negative_matrix_factorization

ML Glossary


Bayesian probability

https://stats.stackexchange.com/questions/31867/bayesian-vs-frequentist-interpretations-of-probability https://en.wikipedia.org/wiki/Frequentist_probability

Bayesian statistics is often used for inferring latent variables.: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Bayesian vs Frequentist: @manisha72617183: My favorite quote, when reading up on Frequentists and Bayesians: A frequentist is a person whose long-run ambition is to be wrong 5% of the time. A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule

Latent Variables

latent variables (from Latin: present participle of lateo (“lie hidden”), as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured).

Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models

One advantage of using latent variables is that they can serve to reduce the dimensionality of data. A large number of observable variables can be aggregated in a model to represent an underlying concept, making it easier to understand the data.

Model Selection

model comparison: either a metric that compares classifier efficacy along the whole score range like area under ROC curve, or at least comparing recall at a preset precision point.


Generative Adversarial Network: https://en.wikipedia.org/wiki/Generative_adversarial_network


Central Moment

C e n t r a l M o m e n t = μ n = E [ ( X E [ X ] ) n ] = + ( x μ ) n f ( x ) d x {\bf Central Moment} = \mu_n = {\bf E} [(X - {\bf E}[X])^n] = \int_{-\infty}^{+\infty} (x-\mu)^n f(x)dx

where E E is the expectation operator; f ( x ) f(x) is probability density function.

For random variables that have no mean, such as the Cauchy distribution, central moments are not defined.

  • First moment is mean: μ = m e a n \mu=mean ;
  • Zero-th central moment is 1: μ 0 = 1 \mu_0 = 1
  • Fisrt central moment is 0: μ 1 = 0 \mu_1=0
  • Second central moment is variance: μ 2 = σ 2 \mu_2=\sigma^2 , where σ \sigma is standard deviation
  • The 3rd and 4th central moments are used to define the standardized moments

Standardized Moment

S t a n d a r d i z e d M o m e n t = μ k σ k {\bf Standardized Moment} = \frac{\mu_k}{\sigma^k}

The first standardized moment is zero, because the first moment about the mean is zero The second standardized moment is one, because the second moment about the mean is equal to the variance (the square of the standard deviation) The third standardized moment is the skewness The fourth standardized moment is the kurtosis

μ 1 σ = 0 σ = 0 {\mu_1 \over \sigma} = {0 \over \sigma} = 0

Laplace Distribution


f ( x μ , b ) = 1 2 b exp x μ b {f(x \mid \mu, b)} = {1 \over 2b} {\exp - {\mid x - \mu \mid \over b}}


A good binning algorithm should follow the following guidelines (ref):

  • Missing values are binned separately.
  • Each bin should contain at least 5% of observations.
  • No bins have 0 accounts for good or bad.

Binning Options:

  • each bin contains equal number of items
  • each bin contains equal number of positive items

Refined Binning

  • for continuous fields: merge bins so WOE of the bins has a monotonic trend
  • for categorical fields: merge small bins