TY - JOUR
T1 - Model-based autoencoders for imputing discrete single-cell RNA-seq data
AU - Tian, Tian
AU - Min, Martin Renqiang
AU - Wei, Zhi
N1 - Funding Information:
The research reported in this publication was partially supported by the National Center for Advancing Translational Sciences (NCATS), a component of the National Institute of Health (NIH) under award number UL1TR003017, and Extreme Science and Engineering Discovery Environment (XSEDE) through allocation CIE160021 and CIE170034 (supported by National Science Foundation grant no. ACI-1548562). The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.
Publisher Copyright:
© 2020 Elsevier Inc.
PY - 2021/8
Y1 - 2021/8
N2 - Deep neural networks have been widely applied for missing data imputation. However, most existing studies have been focused on imputing continuous data, while discrete data imputation is under-explored. Discrete data is common in real world, especially in research areas of bioinformatics, genetics, and biochemistry. In particular, large amounts of recent genomic data are discrete count data generated from single-cell RNA sequencing (scRNA-seq) technology. Most scRNA-seq studies produce a discrete matrix with prevailing ‘false’ zero count observations (missing values). To make downstream analyses more effective, imputation, which recovers the missing values, is often conducted as the first step in pre-processing scRNA-seq data. In this paper, we propose a novel Zero-Inflated Negative Binomial (ZINB) model-based autoencoder for imputing discrete scRNA-seq data. The novelties of our method are twofold. First, in addition to optimizing the ZINB likelihood, we propose to explicitly model the dropout events that cause missing values by using the Gumbel-Softmax distribution. Second, the zero-inflated reconstruction is further optimized with respect to the raw count matrix. Extensive experiments on simulation datasets demonstrate that the zero-inflated reconstruction significantly improves imputation accuracy. Real data experiments show that the proposed imputation can enhance separating different cell types and improve the accuracy of differential expression analysis.
AB - Deep neural networks have been widely applied for missing data imputation. However, most existing studies have been focused on imputing continuous data, while discrete data imputation is under-explored. Discrete data is common in real world, especially in research areas of bioinformatics, genetics, and biochemistry. In particular, large amounts of recent genomic data are discrete count data generated from single-cell RNA sequencing (scRNA-seq) technology. Most scRNA-seq studies produce a discrete matrix with prevailing ‘false’ zero count observations (missing values). To make downstream analyses more effective, imputation, which recovers the missing values, is often conducted as the first step in pre-processing scRNA-seq data. In this paper, we propose a novel Zero-Inflated Negative Binomial (ZINB) model-based autoencoder for imputing discrete scRNA-seq data. The novelties of our method are twofold. First, in addition to optimizing the ZINB likelihood, we propose to explicitly model the dropout events that cause missing values by using the Gumbel-Softmax distribution. Second, the zero-inflated reconstruction is further optimized with respect to the raw count matrix. Extensive experiments on simulation datasets demonstrate that the zero-inflated reconstruction significantly improves imputation accuracy. Real data experiments show that the proposed imputation can enhance separating different cell types and improve the accuracy of differential expression analysis.
KW - Deep learning
KW - Imputation
KW - scRNA-seq
UR - http://www.scopus.com/inward/record.url?scp=85092251117&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85092251117&partnerID=8YFLogxK
U2 - 10.1016/j.ymeth.2020.09.010
DO - 10.1016/j.ymeth.2020.09.010
M3 - Article
C2 - 32971193
AN - SCOPUS:85092251117
SN - 1046-2023
VL - 192
SP - 112
EP - 119
JO - Methods
JF - Methods
ER -