Abstract
We introduce the notion of statistical distortion as an essen-tial metric for measuring the effectiveness of data cleaning strategies. We use this metric to propose a widely applica-ble yet scalable experimental framework for evaluating data cleaning strategies along three dimensions: glitch improve-ment, statistical distortion and cost-related criteria. Exist-ing metrics focus on glitch improvement and cost, but not on the statistical impact of data cleaning strategies. We illustrate our framework on real world data, with a compre-hensive suite of experiments and analyses.
Original language | English (US) |
---|---|
Pages (from-to) | 1674-1683 |
Number of pages | 10 |
Journal | Proceedings of the VLDB Endowment |
Volume | 5 |
Issue number | 11 |
DOIs | |
State | Published - Jul 2012 |
Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Computer Science (miscellaneous)
- General Computer Science