# Kolmogorov–Smirnov Test

## Empirical Distribution Function

- Step function
- estimates the true underlying cdf of the points in the sample

http://www.physics.csbsju.edu/stats/KS-test.html

(Example and charts are borrowed from wikipedia)

### Empirical vs Theory

First impression from the charts: one line is smooth, others are not.

**Theory**: the red line in left chart shows the cumulative distribution function(cdf), in theory, so it is**smooth****Empirical**: the blue line shows the empirical distribution function, which is crunched from your data, so**zig-zaged**. Both lines in the right chart are empirical.

### One-sample vs Two sample

**One-sample K-S test(left chart)**: test if one sample fits the reference distribution.**Two-sample K-S test(right chart)**: test if two samples are from the same distribution, though no assumption on what the distribution is(test if two datasets differ significantly)

### How to Calculate in Math

Two-sample:

From the chart:

maximum vertical deviation(in plain English: the longest vertical distance you can draw between the lines)

### How to Calculate in Code

The input should be continuous data.

- sort the data
- (optional) generate bins/calculate percentiles
- calculate K-S test result

### Advantage

- making no assumption about the distribution of data in two-sample test
- non-parametric
- result can be visualized on a chart (of CDFs) as the maximum vertical deviation
- (unlike the t-statistic) result not affected by scale changes like log(because it is the relative distribution of the data)

### Use Cases

In two-sample: take ECDF for **positive** data and **negative** data, effectively the test returns the largest differences between the 2 distributions. The larger the value, the variable can better distinguish positive and negative. The value should be in `[0, 1]`