Empirical glitch explanations

Tamraparni Dasu, Ji Meng Loh, Divesh Srivastava

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Scopus citations

Abstract

Data glitches are unusual observations that do not conform to data quality expectations, be they logical, semantic or statistical. By applying data integrity constraints, potentially large sections of data could be flagged as being noncompliant. Ignoring or repairing significant sections of the data could fundamentally bias the results and conclusions drawn from analyses. In the context of Big Data where large numbers and volumes of feeds from disparate sources are integrated, it is likely that significant portions of seemingly noncompliant data are actually legitimate usable data. In this paper, we introduce the notion of Empirical Glitch Explanations - concise, multi-dimensional descriptions of subsets of potentially dirty data - and propose a scalable method for empirically generating such explanatory characterizations. The explanations could serve two valuable functions: (1) Provide a way of identifying legitimate data and releasing it back into the pool of clean data. In doing so, we reduce cleaning-related statistical distortion of the data; (2) Used to refine existing data quality constraints and generate and formalize domain knowledge. We conduct experiments using real and simulated data to demonstrate the scalability of our method and the robustness of explanations. In addition, we use two real world examples to demonstrate the utility of the explanations where we reclaim over 99% of the suspicious data, keeping data repair related statistical distortion close to 0.

Original languageEnglish (US)
Title of host publicationKDD 2014 - Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages572-581
Number of pages10
ISBN (Print)9781450329569
DOIs
StatePublished - 2014
Event20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014 - New York, NY, United States
Duration: Aug 24 2014Aug 27 2014

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Conference

Conference20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014
Country/TerritoryUnited States
CityNew York, NY
Period8/24/148/27/14

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Keywords

  • crossover subsampling
  • data quality
  • glitch explanations
  • quantitative data cleaning

Fingerprint

Dive into the research topics of 'Empirical glitch explanations'. Together they form a unique fingerprint.

Cite this