Data Sampling

Updated: 2019-01-13

Why Sampling

  • Statisticians: too expensive to obtain the entire data set
  • Data Miner: too expensive to process the entire data set

with/without replacement

  • Sample without replacement: the selected item will be removed
  • Sample with replacement: the selected item will not be removed; the same item may be selected more than once;simpler to analyze since the probability of selecting any object remains constant during the sampling process.

Advanced Sampling

  • Stratified Sampling: different sample rate for different groups. e.g. proportional to the size of each group
  • Progressive Sampling: the accuracy of predictive models increases as the sample size increases, at some point the increase in accuracy levels off, then stop increasing the size.

Uniform vs Negative

  • Uniform downsampling is applied when you have more than required number of examples to train your model or when all the examples dont fit in the memory of the machine.
  • Negative downsampling is applied when your data distribution is unbalanced and you have far more negative examples than positive examples. When your data distribution is unbalanced the model trained will tend to learn more from the class which has more number of training examples. To avoid this you can downsample only the negative examples so that the difference between the number of positive examples and negative examples is not high.