BayesWipe: A multimodal system for data cleaning and consistent query answering on structured bigdata

Sushovan De, Yuheng Hu, Yi Chen, Subbarao Kambhampati

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Scopus citations

Abstract

Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this paper, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.

Original languageEnglish (US)
Title of host publicationProceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014
EditorsJimmy Lin, Jian Pei, Xiaohua Tony Hu, Wo Chang, Raghunath Nambiar, Charu Aggarwal, Nick Cercone, Vasant Honavar, Jun Huan, Bamshad Mobasher, Saumyadipta Pyne
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages15-24
Number of pages10
ISBN (Electronic)9781479956654
DOIs
StatePublished - 2014
Event2nd IEEE International Conference on Big Data, IEEE Big Data 2014 - Washington, United States
Duration: Oct 27 2014Oct 30 2014

Publication series

NameProceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

Other

Other2nd IEEE International Conference on Big Data, IEEE Big Data 2014
Country/TerritoryUnited States
CityWashington
Period10/27/1410/30/14

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Information Systems

Keywords

  • Data cleaning
  • Databases
  • Query rewriting
  • Uncertainty
  • Web databases

Fingerprint

Dive into the research topics of 'BayesWipe: A multimodal system for data cleaning and consistent query answering on structured bigdata'. Together they form a unique fingerprint.

Cite this