TY - JOUR
T1 - Analysis of high dimensional data using pre-defined set and subset information, with applications to genomic data
AU - Guo, Wenge
AU - Yang, Mingan
AU - Xing, Chuanhua
AU - Peddada, Shyamal D.
N1 - Funding Information:
The research of Wenge Guo is supported by NSF Grant DMS-1006021 and the research of Shyamal Peddada is supported [in part] by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences (Z01 ES101744). Authors thank Drs. Leping Li and Keith Shockley for carefully reading the manuscript and making numerous suggestions which substantially improved the presentation.
PY - 2012/7/24
Y1 - 2012/7/24
N2 - Background: Based on available biological information, genomic data can often be partitioned into pre-defined sets (e.g. pathways) and subsets within sets. Biologists are often interested in determining whether some pre-defined sets of variables (e.g. genes) are differentially expressed under varying experimental conditions. Several procedures are available in the literature for making such determinations, however, they do not take into account information regarding the subsets within each set. Secondly, variables (e.g. genes) belonging to a set or a subset are potentially correlated, yet such information is often ignored and univariate methods are used. This may result in loss of power and/or inflated false positive rate.Results: We introduce a multiple testing-based methodology which makes use of available information regarding biologically relevant subsets within each pre-defined set of variables while exploiting the underlying dependence structure among the variables. Using this methodology, a biologist may not only determine whether a set of variables are differentially expressed between two experimental conditions, but may also test whether specific subsets within a significant set are also significant.Conclusions: The proposed methodology; (a) is easy to implement, (b) does not require inverting potentially singular covariance matrices, and (c) controls the family wise error rate (FWER) at the desired nominal level, (d) is robust to the underlying distribution and covariance structures. Although for simplicity of exposition, the methodology is described for microarray gene expression data, it is also applicable to any high dimensional data, such as the mRNA seq data, CpG methylation data etc.
AB - Background: Based on available biological information, genomic data can often be partitioned into pre-defined sets (e.g. pathways) and subsets within sets. Biologists are often interested in determining whether some pre-defined sets of variables (e.g. genes) are differentially expressed under varying experimental conditions. Several procedures are available in the literature for making such determinations, however, they do not take into account information regarding the subsets within each set. Secondly, variables (e.g. genes) belonging to a set or a subset are potentially correlated, yet such information is often ignored and univariate methods are used. This may result in loss of power and/or inflated false positive rate.Results: We introduce a multiple testing-based methodology which makes use of available information regarding biologically relevant subsets within each pre-defined set of variables while exploiting the underlying dependence structure among the variables. Using this methodology, a biologist may not only determine whether a set of variables are differentially expressed between two experimental conditions, but may also test whether specific subsets within a significant set are also significant.Conclusions: The proposed methodology; (a) is easy to implement, (b) does not require inverting potentially singular covariance matrices, and (c) controls the family wise error rate (FWER) at the desired nominal level, (d) is robust to the underlying distribution and covariance structures. Although for simplicity of exposition, the methodology is described for microarray gene expression data, it is also applicable to any high dimensional data, such as the mRNA seq data, CpG methylation data etc.
UR - http://www.scopus.com/inward/record.url?scp=84866451197&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84866451197&partnerID=8YFLogxK
U2 - 10.1186/1471-2105-13-177
DO - 10.1186/1471-2105-13-177
M3 - Article
C2 - 22827252
AN - SCOPUS:84866451197
SN - 1471-2105
VL - 13
JO - BMC Bioinformatics
JF - BMC Bioinformatics
IS - 1
M1 - 177
ER -