Machine Learning
Overview
ML - Algorithms
Algorithms - Pros and Cons
Algorithms - Cheatsheet
Linear Regression
Logistic Regression
Neural Network
Decision Tree

# Feature Engineering

Updated: 2021-11-19

## Dense vs Sparse vs Rich Features

• Dense features: float values. "Dense": every record has them.
• Sparse features: categorical values as integers, or a very long 0/1 vector with mostly 0s. e.g. ids, keywords in text.
• Rich features: embeddings.

## Sparse Features

Sparse features is captured through embedding matrices with some version of pooling (e.g. SumPooling) that usually doesn't require a lot of compute, but require a lot of memory.

Pooling: embedding matrix is ~10,000,000 x 64 floating-point numbers, that gets transformed into 64 dimensional tensor of floating-point numbers after pooling.

All dense features go through multiple fully-connected layers and fused with the sparse features at some point.

sparse feature to embeddings: from feature engineering to feature learning, instead of manually creating dense features, use the raw data and "learn" the latent features. To replicate the success of CNN/RNN on images/speech/text, but on event prediction and personalization

embedding => latent features

Learning to represent low-level data is one of the most powerful ideas from Deep Learning

## Decay vs Static Features

• halflife: how fast it decays. For example, if halflife = 60, then the decay is exp(-log2 / 60) = 0.988. https://en.wikipedia.org/wiki/Exponential_decay
• weights: the weighs of each action for daily value
• cap: or cutoff, the maximum value of the sum of the daily actions.
v_today = v_yesterday * decay + min(count_today, cutoff)

e.g. halflife = 60(decay=0.988) and no capcutoff=infiniti)

observation_today = observation_yesterday * 0.988 + some_feature_today

A static features can be regarded as an online feature whose halflife and daily cap are both infinite

## static vs dynamic

Static features (i.e. feature value does not change much with time) Dynamic features (i.e. feature value significantly depends on time)

## Usage features / counter-based features / velocity

• Events/Interactions log: how people interact with the product.
• Events/Interactions log aggregator: system that computes counters - how many interactions happened for certain periods of time for certain contexts.
• Storage: place where all the data is stored. Cache, flash based storage, for quick lookup
• Feature extractor: code that retrieves counters from storage and computes the actual features.

## How to generate the same features in both online and offline systems

Define offline and online:

• Offline: training
• Online: inference

### Option 1

offline for training models

re-implement logic to online to score

• hard to match offline and online: any discrepancies between offline and online data sources can create unexpected differences in the model output
• all of our data is available offline,

### Option 2

• log oneline data to offline, no discrepency
• cons: need to deploy each idea for a new feature into production and wait for the data to collect before we can determine if a feature is useful.

### Option 3

• log raw data
• offline simulation

## Remove Extreme Values

• winsorized mean: replace high and low extremes with p99 and p1 respectively.
• cutoff by standard deviation: e.g. cutoff values beyond +/-4 stdDev from mean(e.g. z-Score >= 4)

## Text Features

TF, IDF, PageRank, cos distance

bag-of-words: the frequency of occurrence of each word is used as a feature for training a classifier, disregarding grammar and word order. Used in natural language processing and information retrieval and computer vision.