BayesWipe: A scalable probabilistic framework for improving data quality

Sushovan De, Yuheng Hu, Venkata Vamsikrishna Meduri, Yi Chen, Subbarao Kambhampati

Research output: Contribution to journalArticlepeer-review

17 Scopus citations

Abstract

Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like Conditional Functional Dependencies (which have to be provided by domain experts or learned from a clean sample of the database). In this article, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.

Original languageEnglish (US)
Article number5
JournalJournal of Data and Information Quality
Volume8
Issue number1
DOIs
StatePublished - Oct 2016

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Information Systems and Management

Keywords

  • Data quality
  • Offline and online cleaning
  • Statistical data cleaning

Fingerprint

Dive into the research topics of 'BayesWipe: A scalable probabilistic framework for improving data quality'. Together they form a unique fingerprint.

Cite this