Data Sanity Check

Updated: 2019-01-13

Check the following items:

  • missing/invalid values: e.g. more than half of all the values are empty or null
  • data size: e.g. data load is 10x smaller or larger than expected, maybe storing double numbers as strings
  • data volume: total count of records.
  • data distribution: e.g. distribution of activity normally seen in the data is off by a factor
  • cardinality: count of unique values
  • uniqueness: in a field that must have all unique values, there is a duplicate
  • median/percentile: 50%, 95%, 99% etc.
  • modal: most frequent discrete value(for categorical variable)

ANOVA

  • ANOVA: Analysis of Variance, includes only one dependent variable

    • there can be several error terms whereas there is only a single error term in regression.
    • mainly used to determine if data from various groups have a common means or not
  • MANOVA: Multivariate Analysis of Variance, includes multiple, dependent variables.

Read more: ANOVA vs MANOVA