Statistical distortion: Consequences of data cleaning

Tamraparni Dasu, Ji Meng Loh

Research output: Contribution to journalArticlepeer-review

43 Scopus citations

Abstract

We introduce the notion of statistical distortion as an essen-tial metric for measuring the effectiveness of data cleaning strategies. We use this metric to propose a widely applica-ble yet scalable experimental framework for evaluating data cleaning strategies along three dimensions: glitch improve-ment, statistical distortion and cost-related criteria. Exist-ing metrics focus on glitch improvement and cost, but not on the statistical impact of data cleaning strategies. We illustrate our framework on real world data, with a compre-hensive suite of experiments and analyses.

Original languageEnglish (US)
Pages (from-to)1674-1683
Number of pages10
JournalProceedings of the VLDB Endowment
Volume5
Issue number11
DOIs
StatePublished - Jul 2012
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • General Computer Science

Fingerprint

Dive into the research topics of 'Statistical distortion: Consequences of data cleaning'. Together they form a unique fingerprint.

Cite this