1) The document proposes a new method called Multiple Imputation Random Lasso (MIRL) to perform variable selection and prediction on incomplete high-dimensional data, such as from an epidemiological study of eating and activity in teens where 80% of individuals had at least one missing variable.
2) MIRL combines penalized regression, multiple imputation, and stability selection. Simulation studies show MIRL outperforms alternatives in high-dimensional scenarios, especially when variable correlation and missing proportions are high.
3) MIRL achieved improved performance over other methods when applied to the teen eating and activity study, predicting outcomes for boys and girls separately and for a subgroup of low SES Asian boys at high risk for obesity.
1) The document proposes a new method called Multiple Imputation Random Lasso (MIRL) to perform variable selection and prediction on incomplete high-dimensional data, such as from an epidemiological study of eating and activity in teens where 80% of individuals had at least one missing variable.
2) MIRL combines penalized regression, multiple imputation, and stability selection. Simulation studies show MIRL outperforms alternatives in high-dimensional scenarios, especially when variable correlation and missing proportions are high.
3) MIRL achieved improved performance over other methods when applied to the teen eating and activity study, predicting outcomes for boys and girls separately and for a subgroup of low SES Asian boys at high risk for obesity.
1) The document proposes a new method called Multiple Imputation Random Lasso (MIRL) to perform variable selection and prediction on incomplete high-dimensional data, such as from an epidemiological study of eating and activity in teens where 80% of individuals had at least one missing variable.
2) MIRL combines penalized regression, multiple imputation, and stability selection. Simulation studies show MIRL outperforms alternatives in high-dimensional scenarios, especially when variable correlation and missing proportions are high.
3) MIRL achieved improved performance over other methods when applied to the teen eating and activity study, predicting outcomes for boys and girls separately and for a subgroup of low SES Asian boys at high risk for obesity.
B Y Y ING L IU , Y UANJIA WANG , YANG F ENG AND M ELANIE M. WALL
Columbia University We propose a Multiple Imputation Random Lasso (MIRL) method to select important variables and to predict the outcome for an epidemiologi- cal study of Eating and Activity in Teens. In this study 80% of individuals have at least one variable missing. Therefore, using variable selection meth- ods developed for complete data after listwise deletion substantially reduces prediction power. Recent work on prediction models in the presence of in- complete data cannot adequately account for large numbers of variables with arbitrary missing patterns. We propose MIRL to combine penalized regres- sion techniques with multiple imputation and stability selection. Extensive simulation studies are conducted to compare MIRL with several alternatives. MIRL outperforms other methods in high-dimensional scenarios in terms of both reduced prediction error and improved variable selection performance, and it has greater advantage when the correlation among variables is high and missing proportion is high. MIRL is shown to have improved perfor- mance when comparing with other applicable methods when applied to the study of Eating and Activity in Teens for the boys and girls separately, and to a subgroup of low social economic status (SES) Asian boys who are at high risk of developing obesity.
1. Motivating example. In large epidemiological studies, accurately predict-
ing outcomes and selecting variables important for explaining the outcomes are two main research goals. One commonly encountered complication in these stud- ies is missing data due to subjects’ loss to follow up or nonresponses. It is not straightforward to handle missing data when performing variable selection since most existing variable selection approaches require complete data. Our motivating study is the Eating and Activity in Teens (Project EAT) with a focus of identifying risk and protective factors for adolescent obesity [Larson et al. (2013), Neumark-Sztainer et al. (2012)]. A primary research goal is to identify the most important household, family, peer, school and neighborhood environmental characteristics predicting a teenagers’ weight status in order to provide recom- mendations for potential prevention strategies. A strength of Project EAT is the
Received July 2014; revised December 2015.
1 The Project EAT was supported by grant number R01HL084064 (D. Neumark-Sztainer, principal investigator) from the National Heart, Lung, and Blood Institute. Supported in part by NIH Grants NS073671, NS082062 and NSF Grant DMS-13-08566. Key words and phrases. Missing data, random lasso, multiple imputation, variable selection, sta- bility selection, variable ranking. 418