Abstract
Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like Conditional Functional Dependencies (which have to be provided by domain experts or learned from a clean sample of the database). In this article, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.
| Original language | English (US) |
|---|---|
| Article number | 5 |
| Journal | Journal of Data and Information Quality |
| Volume | 8 |
| Issue number | 1 |
| DOIs | |
| State | Published - Oct 2016 |
All Science Journal Classification (ASJC) codes
- Information Systems
- Information Systems and Management
Keywords
- Data quality
- Offline and online cleaning
- Statistical data cleaning