TY - JOUR
T1 - BayesWipe
T2 - A scalable probabilistic framework for improving data quality
AU - De, Sushovan
AU - Hu, Yuheng
AU - Meduri, Venkata Vamsikrishna
AU - Chen, Yi
AU - Kambhampati, Subbarao
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/10
Y1 - 2016/10
N2 - Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like Conditional Functional Dependencies (which have to be provided by domain experts or learned from a clean sample of the database). In this article, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.
AB - Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like Conditional Functional Dependencies (which have to be provided by domain experts or learned from a clean sample of the database). In this article, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.
KW - Data quality
KW - Offline and online cleaning
KW - Statistical data cleaning
UR - http://www.scopus.com/inward/record.url?scp=84994571337&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84994571337&partnerID=8YFLogxK
U2 - 10.1145/2992787
DO - 10.1145/2992787
M3 - Article
AN - SCOPUS:84994571337
SN - 1936-1955
VL - 8
JO - Journal of Data and Information Quality
JF - Journal of Data and Information Quality
IS - 1
M1 - 5
ER -