logo

Kolmogorov–Smirnov Test

Empirical Distribution Function

  • Step function
  • estimates the true underlying cdf of the points in the sample

http://www.physics.csbsju.edu/stats/KS-test.html

(Example and charts are borrowed from wikipedia)

Empirical vs Theory

First impression from the charts: one line is smooth, others are not.

  • Theory: the red line in left chart shows the cumulative distribution function(cdf), in theory, so it is smooth
  • Empirical: the blue line shows the empirical distribution function, which is crunched from your data, so zig-zaged. Both lines in the right chart are empirical.

One-sample vs Two sample

  • One-sample K-S test(left chart): test if one sample fits the reference distribution.
  • Two-sample K-S test(right chart): test if two samples are from the same distribution, though no assumption on what the distribution is(test if two datasets differ significantly)

How to Calculate in Math

Two-sample:

From the chart:

maximum vertical deviation(in plain English: the longest vertical distance you can draw between the lines)

How to Calculate in Code

The input should be continuous data.

  • sort the data
  • (optional) generate bins/calculate percentiles
  • calculate K-S test result

Advantage

  • making no assumption about the distribution of data in two-sample test
  • non-parametric
  • result can be visualized on a chart (of CDFs) as the maximum vertical deviation
  • (unlike the t-statistic) result not affected by scale changes like log(because it is the relative distribution of the data)

Use Cases

In two-sample: take ECDF for positive data and negative data, effectively the test returns the largest differences between the 2 distributions. The larger the value, the variable can better distinguish positive and negative. The value should be in [0, 1]

Further Readings

http://www.physics.csbsju.edu/stats/KS-test.html