Data Sanity Check
Last Updated: 2021-11-19
Check the following items:
- missing/invalid values: e.g. more than half of all the values are empty or null
- data size: e.g. data load is 10x smaller or larger than expected, maybe storing double numbers as strings
- data volume: total count of records.
- data distribution: e.g. distribution of activity normally seen in the data is off by a factor
- cardinality: count of unique values
- uniqueness: in a field that must have all unique values, there is a duplicate
- median/percentile: 50%, 95%, 99% etc.
- modal: most frequent discrete value(for categorical variable)
- ANOVA: Analysis of Variance, includes only one dependent variable
- there can be several error terms whereas there is only a single error term in regression.
- mainly used to determine if data from various groups have a common means or not
- MANOVA: Multivariate Analysis of Variance, includes multiple, dependent variables.
Read more: ANOVA vs MANOVA