- tensor: an array of any number of dimensions(including 0 dimensions, which is a scalar)
- rank: number of dimensions(0 for scalar)
instead of being trained to predict the target value Y given inputs X, autoencoders are trained to reconstruct their own inputs X
An autoencoder, autoassociator or Diabolo network is an artificial neural network used for learning efficient codings. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction.
Calibration is defined as the ratio between
ctr obtained by predicting on the model (
ctr of the actual data (
calibration = (model_ctr / true_ctr).
== 1.0: means the model prediction distribution is aligned well with true labels.
< 1.0: under predicting. Model is predicting less positive labels compared to actual data.
> 1.0: over predicting. Model is predicting more positive labels compared to actual data.
one-hot encoding/one-cold https://en.wikipedia.org/wiki/One-hot
when there are only two groups for the one-way ANOVA F-test, where is the Student's t statistic.
- A test of goodness of fit establishes whether or not an observed frequency distribution differs from a theoretical distribution.
- A test of independence assesses whether paired observations on two variables, expressed in a contingency table, are independent of each other (e.g. polling responses from people of different nationalities to see if one's nationality is related to the response).
chi-square distribution vs t-distribution
- chi-square distribution: a special case of gamma distribution
- t-distribution: a sampled normal distribution
Student's t-test assumes that the situations produce "normal" data that differ only in that the average outcome in one situation is different from the average outcome in the other situation.
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis LDA is based upon the concept of searching for a linear combination of variables (predictors) that best separates two classes (targets)
LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities.
Latent Dirichlet Allocation (LDA)
Groups unclassified text into a number of categories. often used in natural language processing (NLP) to find texts that are similar, i.e. topic modeling.
The "Latent Space" is the vector space within which the vectors that make up the topics found by LDA are found. These topics are latent within the text - that is, they are not immediately apparent, but are found or discovered by the LDA algorithm. In the same way, the vector space within which they reside is latent, or waiting, to be populated.
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewed left, we mean that the left tail is long relative to the right tail. Similarly, skewed right means that the right tail is long relative to the left tail. If the data are multi-modal, then this may affect the sign of the skewness.
Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. A uniform distribution would be the extreme case.
The kurtosis for a standard normal distribution is three.
Classification vs Reinforcement learning
Classification / supervised learning: about learning labels. We have some examples that have already been labeled: cats vs. dogs, suspicious transactions vs legitimate ones, As vs Bs vs...Zs. From those examples, we want to learn some way to assign new, unlabeled instances, to one of those classes.
Reinforcement learning: about learning how to behave. Agents learn by interacting with their environment: some combinations of states and actions eventually lead to a reward (which the agent "likes") and others do not. The reward might even be disconnected from the most recent state or action and instead depend on decisions made earlier. The goal is to learn a "policy" that describes what should be done in each state and balances learning more about the environment ("exploration", which may pay off by letting us collect more rewards later) and using what we know about the environment to maximize our current reward intake (exploitation).
- Collaborative Filtering: similar users (thus "collaborative") are also buying/watching/listening to these items
- Content-based Filtering: based on the attributes/characteristics/hashtags of the items that users may be interested in
Based on user-item matrix, where each row is a user, each column is an item, values are ratings or likes.
Find 2 matrices whose product is the original matrix. Purpose: reduce dimensions from N to K, where K << N.
SVD: Singular value decomposition (PCA is implemented by SVD in spark)
Bayesian statistics is often used for inferring latent variables.: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
Bayesian vs Frequentist: @manisha72617183: My favorite quote, when reading up on Frequentists and Bayesians: A frequentist is a person whose long-run ambition is to be wrong 5% of the time. A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule
latent variables (from Latin: present participle of lateo (“lie hidden”), as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured).
Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models
One advantage of using latent variables is that they can serve to reduce the dimensionality of data. A large number of observable variables can be aggregated in a model to represent an underlying concept, making it easier to understand the data.
model comparison: either a metric that compares classifier efficacy along the whole score range like area under ROC curve, or at least comparing recall at a preset precision point.
Generative Adversarial Network: https://en.wikipedia.org/wiki/Generative_adversarial_network
where is the expectation operator; is probability density function.
For random variables that have no mean, such as the Cauchy distribution, central moments are not defined.
- First moment is mean: ;
- Zero-th central moment is 1:
- Fisrt central moment is 0:
- Second central moment is variance: , where is standard deviation
- The 3rd and 4th central moments are used to define the standardized moments
The first standardized moment is zero, because the first moment about the mean is zero The second standardized moment is one, because the second moment about the mean is equal to the variance (the square of the standard deviation) The third standardized moment is the skewness The fourth standardized moment is the kurtosis
A good binning algorithm should follow the following guidelines (ref):
- Missing values are binned separately.
- Each bin should contain at least 5% of observations.
- No bins have 0 accounts for good or bad.
- each bin contains equal number of items
- each bin contains equal number of positive items
- for continuous fields: merge bins so WOE of the bins has a monotonic trend
- for categorical fields: merge small bins