Background: Mixed models are a useful tool for evaluating the association between an outcome variable and genetic variables from a family-based genetic study, taking into account the kinship coefficients. When there are ultrahigh dimensional genetic variables (ie, p n), it is challenging to fit any mixed effect model. Methods: We propose a two-stage strategy, screening genetic variables in the first stage and then fitting the mixed effect model in the second stage to those variables that survive the screening. For the screening stage, we can use the sure independence screening (SIS) procedure, which fits the mixed effect model to one genetic variable at a time. Because the SIS procedure may fail to identify those marginally unimportant but jointly important genetic variables, we propose a joint screening (JS) procedure that screens all the genetic variables simultaneously. We evaluate the performance of the proposed JS procedure via a simulation study and an application to the GAW20 data. Results: We perform the proposed JS procedure on the GAW20 representative simulated data set (n = 680 participant(s) and p = 463,995 CpG cytosine-phosphate-guanine [CpG] sites) and select the top d = n/ log(n) variables. Then we fit the mixed model using these top variables. Under significance level, 5%, 43 CpG sites are found to be significant. Some diagnostic analyses based on the residuals show the fitted mixed model is appropriate. Conclusions: Although the GAW20 data set is ultrahigh dimensional and family-based having within group variances, we were successful in performing subset selection using a two-step strategy that is computationally simple and easy to understand.
All Science Journal Classification (ASJC) codes
- Biochemistry, Genetics and Molecular Biology(all)