Abstract
We introduce the notion of statistical distortion as an essen-tial metric for measuring the effectiveness of data cleaning strategies. We use this metric to propose a widely applica-ble yet scalable experimental framework for evaluating data cleaning strategies along three dimensions: glitch improve-ment, statistical distortion and cost-related criteria. Exist-ing metrics focus on glitch improvement and cost, but not on the statistical impact of data cleaning strategies. We illustrate our framework on real world data, with a compre-hensive suite of experiments and analyses.
| Original language | English (US) |
|---|---|
| Pages (from-to) | 1674-1683 |
| Number of pages | 10 |
| Journal | Proceedings of the VLDB Endowment |
| Volume | 5 |
| Issue number | 11 |
| DOIs | |
| State | Published - Jul 2012 |
| Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Computer Science (miscellaneous)
- General Computer Science