We introduce the notion of statistical distortion as an essen-tial metric for measuring the effectiveness of data cleaning strategies. We use this metric to propose a widely applica-ble yet scalable experimental framework for evaluating data cleaning strategies along three dimensions: glitch improve-ment, statistical distortion and cost-related criteria. Exist-ing metrics focus on glitch improvement and cost, but not on the statistical impact of data cleaning strategies. We illustrate our framework on real world data, with a compre-hensive suite of experiments and analyses.
All Science Journal Classification (ASJC) codes
- Computer Science (miscellaneous)
- General Computer Science