Machine Learning
Overview
ML - Algorithms
Algorithms - Pros and Cons
Algorithms - Cheatsheet
Linear Regression
Logistic Regression
Neural Network
Decision Tree

# Kolmogorov–Smirnov Test

Updated: 2021-11-19

## Empirical Distribution Function

• Step function
• estimates the true underlying cdf of the points in the sample

http://www.physics.csbsju.edu/stats/KS-test.html

(Example and charts are borrowed from wikipedia)

### Empirical vs Theory

First impression from the charts: one line is smooth, others are not.

• Theory: the red line in left chart shows the cumulative distribution function(cdf), in theory, so it is smooth
• Empirical: the blue line shows the empirical distribution function, which is crunched from your data, so zig-zaged. Both lines in the right chart are empirical.

### One-sample vs Two sample

• One-sample K-S test(left chart): test if one sample fits the reference distribution.
• Two-sample K-S test(right chart): test if two samples are from the same distribution, though no assumption on what the distribution is(test if two datasets differ significantly)

### How to Calculate in Math

Two-sample:

From the chart:

maximum vertical deviation(in plain English: the longest vertical distance you can draw between the lines)

### How to Calculate in Code

The input should be continuous data.

• sort the data
• (optional) generate bins/calculate percentiles
• calculate K-S test result

In two-sample: take ECDF for positive data and negative data, effectively the test returns the largest differences between the 2 distributions. The larger the value, the variable can better distinguish positive and negative. The value should be in [0, 1]