Feature Engineering
Dense vs Sparse vs Rich Features
- Dense features: float values. "Dense": every record has them.
- Sparse features: categorical values as integers, or a very long 0/1 vector with mostly 0s. e.g. ids, keywords in text.
- Rich features: embeddings.
Sparse Features
Sparse features is captured through embedding matrices with some version of pooling (e.g. SumPooling) that usually doesn't require a lot of compute, but require a lot of memory.
Pooling: embedding matrix is ~10,000,000 x 64 floating-point numbers, that gets transformed into 64 dimensional tensor of floating-point numbers after pooling.
All dense features go through multiple fully-connected layers and fused with the sparse features at some point.
sparse feature to embeddings: from feature engineering to feature learning, instead of manually creating dense features, use the raw data and "learn" the latent features. To replicate the success of CNN/RNN on images/speech/text, but on event prediction and personalization
embedding => latent features
Learning to represent low-level data is one of the most powerful ideas from Deep Learning
Decay vs Static Features
- halflife: how fast it decays. For example, if halflife = 60, then the decay is
exp(-log2 / 60) = 0.988
. https://en.wikipedia.org/wiki/Exponential_decay - weights: the weighs of each action for daily value
- cap: or cutoff, the maximum value of the sum of the daily actions.
v_today = v_yesterday * decay + min(count_today, cutoff)
e.g. halflife = 60(decay=0.988) and no capcutoff=infiniti)
observation_today = observation_yesterday * 0.988 + some_feature_today
A static features can be regarded as an online feature whose halflife and daily cap are both infinite
static vs dynamic
Static features (i.e. feature value does not change much with time) Dynamic features (i.e. feature value significantly depends on time)
Usage features / counter-based features / velocity
- Events/Interactions log: how people interact with the product.
- Events/Interactions log aggregator: system that computes counters - how many interactions happened for certain periods of time for certain contexts.
- Storage: place where all the data is stored. Cache, flash based storage, for quick lookup
- Feature extractor: code that retrieves counters from storage and computes the actual features.
How to generate the same features in both online and offline systems
Define offline and online:
- Offline: training
- Online: inference
Option 1
offline for training models
re-implement logic to online to score
- hard to match offline and online: any discrepancies between offline and online data sources can create unexpected differences in the model output
- all of our data is available offline,
Option 2
- log oneline data to offline, no discrepency
- cons: need to deploy each idea for a new feature into production and wait for the data to collect before we can determine if a feature is useful.
Option 3
- log raw data
- offline simulation
Remove Extreme Values
- winsorized mean: replace high and low extremes with p99 and p1 respectively.
- cutoff by standard deviation: e.g. cutoff values beyond +/-4 stdDev from mean(e.g. z-Score >= 4)
Text Features
TF, IDF, PageRank, cos distance
bag-of-words: the frequency of occurrence of each word is used as a feature for training a classifier, disregarding grammar and word order. Used in natural language processing and information retrieval and computer vision.