A feature sampling strategy for analysis of high dimensional genomic data

Jie Zhang, Zhigen Zhao, Kai Zhang, Zhi Wei

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

With the development of high throughput technology, it has become feasible and common to profile tens of thousands of gene activities simultaneously. These genomic data typically have sample size of hundreds or fewer, which is much less than the feature size (number of genes). In addition, the genes, in particular the ones from the same pathway, are often highly correlated. These issues impose a great challenge for selecting meaningful genes from a large number of (correlated) candidates in many genomic studies. Quite a few methods have been proposed to attack this challenge. Among them, regularization-based techniques, e.g., lasso, become much more appealing, because they can do model fitting and variable selection at the same time. However, the lasso regression has its known limitations. One is that the number of genes selected by the lasso couldn't exceed the number of samples. Another limitation is that, if causal genes are highly correlated, the lasso tends to select only one or few genes from them. Biologists, however, desire to identify them all. To overcome these limitations, we present here a novel, robust, and stable variable selection method. Through simulation studies and a real application to the transcriptome data, we demonstrate the superiority of the proposed method in selecting highly correlated causal genes. We also provide some theoretical justifications for this feature sampling strategy based on the mean and variance analyses.

Original languageEnglish (US)
Article number8126867
Pages (from-to)434-441
Number of pages8
JournalIEEE/ACM Transactions on Computational Biology and Bioinformatics
Volume16
Issue number2
DOIs
StatePublished - Mar 1 2019
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Biotechnology
  • Genetics
  • Applied Mathematics

Keywords

  • Feature sampling
  • L1 regression
  • feature selection
  • high dimensional genomic data analysis

Fingerprint

Dive into the research topics of 'A feature sampling strategy for analysis of high dimensional genomic data'. Together they form a unique fingerprint.

Cite this