Kolmogorov–Smirnov Test
Empirical Distribution Function
- Step function
- estimates the true underlying cdf of the points in the sample
http://www.physics.csbsju.edu/stats/KS-test.html
(Example and charts are borrowed from wikipedia)
Empirical vs Theory
First impression from the charts: one line is smooth, others are not.
- Theory: the red line in left chart shows the cumulative distribution function(cdf), in theory, so it is smooth
- Empirical: the blue line shows the empirical distribution function, which is crunched from your data, so zig-zaged. Both lines in the right chart are empirical.
One-sample vs Two sample
- One-sample K-S test(left chart): test if one sample fits the reference distribution.
- Two-sample K-S test(right chart): test if two samples are from the same distribution, though no assumption on what the distribution is(test if two datasets differ significantly)
How to Calculate in Math
Two-sample:
From the chart:
maximum vertical deviation(in plain English: the longest vertical distance you can draw between the lines)
How to Calculate in Code
The input should be continuous data.
- sort the data
- (optional) generate bins/calculate percentiles
- calculate K-S test result
Advantage
- making no assumption about the distribution of data in two-sample test
- non-parametric
- result can be visualized on a chart (of CDFs) as the maximum vertical deviation
- (unlike the t-statistic) result not affected by scale changes like log(because it is the relative distribution of the data)
Use Cases
In two-sample: take ECDF for positive data and negative data, effectively the test returns the largest differences between the 2 distributions. The larger the value, the variable can better distinguish positive and negative. The value should be in [0, 1]