Embeddings (a.k.a. Latent Features)

Last Updated: 2021-11-19


  • a short, numeric description or representation
  • constructed by a learning process.
  • why? They encode information beyond what we can collect from data mining.
  • Knowledge transfer from a source domain to a target domain through representation. (e.g. using a bag of attributes to represent users instead of user id.)

Famous algorithms:

  • seq2seq: e.g. machine translation, where an input sequence in one language is converted into a sequence in another language.
  • word2vec: created using 2 algorithms: Continuous Bag-of-Words model (CBOW) and the Skip-Gram model.
  • doc2vec: the goal is to create a numeric representation of a document, regardless of it’s length.

bag of words (BOW):loses word ordering

These embeddings come from multiple learning processes such as matrix factorization (MF), skip-gram negative down-sampling (SKND) , and Restricted Boltzman Machines (RBMs)


To describe a web page or Facebook page:

  • non embeddings: fans/visitors and their attributes, num of words, etc.
  • embeddings: latent features discovered in the learning process.

use cosine similarity to combine two embeddings

score = CosineSimilarity(Embedding(Object1), Embedding(Object2))

quote tensor flow word2vec tutorial(https://www.tensorflow.org/tutorials/word2vec):

Motivation: Why Learn Word Embeddings?

Image and audio processing systems work with rich, high-dimensional datasets encoded as vectors of the individual raw pixel-intensities for image data, or e.g. power spectral density coefficients for audio data. For tasks like object or speech recognition we know that all the information required to successfully perform the task is encoded in the data (because humans can perform these tasks from the raw data). However, natural language processing systems traditionally treat words as discrete atomic symbols, and therefore 'cat' may be represented as Id537 and 'dog' as Id143. These encodings are arbitrary, and provide no useful information to the system regarding the relationships that may exist between the individual symbols. This means that the model can leverage very little of what it has learned about 'cats' when it is processing data about 'dogs' (such that they are both animals, four-legged, pets, etc.). Representing words as unique, discrete ids furthermore leads to data sparsity, and usually means that we may need more data in order to successfully train statistical models. Using vector representations can overcome some of these obstacles.

Count-based methods(e.g. Latent Semantic Analysis) compute the statistics of how often some word co-occurs with its neighbor words in a large text corpus, and then map these count-statistics down to a small, dense vector for each word. Predictive models(e.g. neural probabilistic language models). directly try to predict a word from its neighbors in terms of learned small, dense embedding vectors (considered parameters of the model).Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text. word2vec: 2 flavors

Continuous Bag-of-Words model (CBOW) : CBOW predicts target words (e.g. 'mat') from source context words ('the cat sits on the'). CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets.

Skip-Gram model: skip-gram does the inverse and predicts source context-words from the target words. skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets.

word embeddings—representations of words as points in a high-dimensional space. distributional hypothesis:words that are used and occur in the same contexts tend to purport similar meanings. The Distributional Hypothesis is the basis for statistical semantics.

Recent work (for example) in neural machine translation has explored the idea of jointly training encoder- and decoder- recurrent neural nets such each phrase is mapped to a point in latent vector space representing both syntactic and semantic structure. That is, the encoder is a function f: N -> L that maps a phrase in the native language space (N) to a latent vector space L. The decoder is a function g: L -> T that maps the vector in the latent vector space (L) to a phrase in the target translated language (L). The two nets are jointly trained such that given ground truth phrases n (native phrase) and t (translated phrase), g(f(n)) produces the correct output, t. Thus, the output of the encoder function f(n) is a dense vector space representation of the given phrase which is then used to generate an equivalent phrase in another language. It stands to reason that we can use this representation to find similar phrases in our corpus.

words that mean the same thing appear very close in the embedding space.

In this way, we can perform a k-Nearest Neighbors operation on the space.

Consider a phrase to be a sequence of word embeddings P = w 1 , w 2 , , w n P={w_1,w_2,…,w_n} , where w i w_i is the word embedding vector for the word at position i i . We need some function f ( P ) f(P) to encode a variable-length sequence of word embeddings into a space where we can easily compute pairwise similarities.

Note that GBDT is not a suitable learning algorithm to consume dense embedding vectors due to its myopic nature of looking at one coordinate at a time while embedding vectors match with each other as a whole.