Feature Engineering

Dense vs Sparse vs Rich Features

Dense features: float values. "Dense": every record has them.
Sparse features: categorical values as integers, or a very long 0/1 vector with mostly 0s. e.g. ids, keywords in text.
Rich features: embeddings.

Sparse Features

Sparse features is captured through embedding matrices with some version of pooling (e.g. SumPooling) that usually doesn't require a lot of compute, but require a lot of memory.

Pooling: embedding matrix is ~10,000,000 x 64 floating-point numbers, that gets transformed into 64 dimensional tensor of floating-point numbers after pooling.

All dense features go through multiple fully-connected layers and fused with the sparse features at some point.

sparse feature to embeddings: from feature engineering to feature learning, instead of manually creating dense features, use the raw data and "learn" the latent features. To replicate the success of CNN/RNN on images/speech/text, but on event prediction and personalization

embedding => latent features

Learning to represent low-level data is one of the most powerful ideas from Deep Learning

Decay vs Static Features

halflife: how fast it decays. For example, if halflife = 60, then the decay is exp(-log2 / 60) = 0.988. https://en.wikipedia.org/wiki/Exponential_decay
weights: the weighs of each action for daily value
cap: or cutoff, the maximum value of the sum of the daily actions.

v_today = v_yesterday * decay + min(count_today, cutoff)

e.g. halflife = 60(decay=0.988) and no capcutoff=infiniti)

observation_today = observation_yesterday * 0.988 + some_feature_today

A static features can be regarded as an online feature whose halflife and daily cap are both infinite

static vs dynamic

Static features (i.e. feature value does not change much with time) Dynamic features (i.e. feature value significantly depends on time)

Usage features / counter-based features / velocity

Events/Interactions log: how people interact with the product.
Events/Interactions log aggregator: system that computes counters - how many interactions happened for certain periods of time for certain contexts.
Storage: place where all the data is stored. Cache, flash based storage, for quick lookup
Feature extractor: code that retrieves counters from storage and computes the actual features.

How to generate the same features in both online and offline systems

Define offline and online:

Offline: training
Online: inference

Option 1

offline for training models

re-implement logic to online to score

hard to match offline and online: any discrepancies between offline and online data sources can create unexpected differences in the model output
all of our data is available offline,

Option 2

log oneline data to offline, no discrepency
cons: need to deploy each idea for a new feature into production and wait for the data to collect before we can determine if a feature is useful.

Option 3

log raw data
offline simulation

Remove Extreme Values

winsorized mean: replace high and low extremes with p99 and p1 respectively.
cutoff by standard deviation: e.g. cutoff values beyond +/-4 stdDev from mean(e.g. z-Score >= 4)

Text Features

TF, IDF, PageRank, cos distance

bag-of-words: the frequency of occurrence of each word is used as a feature for training a classifier, disregarding grammar and word order. Used in natural language processing and information retrieval and computer vision.