Mathematics Ph.D. Dissertations


Two-Stage SCAD Lasso for Linear Mixed Model Selection

Date of Award


Document Type


Degree Name

Doctor of Philosophy (Ph.D.)



First Advisor

Junfeng Shang (Advisor)

Second Advisor

Gabriel Matney (Other)

Third Advisor

Andrew Layden (Committee Member)

Fourth Advisor

Wei Ning (Committee Member)


Linear regression model is the classical approach to explain the relationship between the response variable (dependent) and predictors (independent). However, when the number of predictors in the data increases, the likelihood of the correlation between predictors also increases, which is problematic. To avoid that, the linear mixed effects model was proposed which consists of a fixed effects term and a random effects term. The fixed effects term represents the traditional linear regression coefficients, and the random effects term represents the values that are drawn randomly from the population. Thus, the linear mixed model allows us to represent the mean as well as the covariance structure of the data in a single model.

When the fixed and random effects terms increase in their dimensions, selection as appropriate model, which is the optimum fit, becomes increasingly difficult. Due to this natural complexity inherent in the linear mixed model, in this dissertation we propose a two-stage method for selecting fixed and random effects terms.

In the first stage, we select the most significant fixed effects in the model based on the conditional distribution of the response variable given the random effects. This is achieved by minimizing the penalized least square estimator with a SCAD Lasso penalty term. We used the Newton-Raphson optimization algorithm to implement the parameter estimations. In this process, the coefficients of the unimportant predictors shrink towards exactly zero, thus eliminating the noise from the model.

Subsequently, in the second stage we choose the most important random effects by maximizing the penalized profile log-likelihood function. This maximization is achieved using the Newton-Raphson optimization algorithm. As in the first stage, the penalty term appended is SCAD Lasso. Unlike the fixed effects, the random effects are drawn randomly from the population; hence, they need to be predicted. This prediction is done by estimating the diagonal elements (variances) of the covariance structure of the random effects. Note that during this step, for all random effects that are unimportant, the corresponding variance components will shrink to exactly zero (similar to the shrinking of fixed effects parameters in the first stage). This is how noise is eliminated from the model while retaining only significant effects. Hence, the selection of the random effects is completed.

In both stages of the proposed approach, it is shown that the selection of the effects through elimination is done with the probability tending to one. It is indicative that the proposed method surely identifies all true effects, fixed as well as random. Also, it is shown that the proposed method satisfies the oracle properties, namely asymptotic normality and sparsity. At the end of these two stages, we have the optimal linear mixed model which can be readily applied to correlated data.

To test the overall effectiveness of the proposed approach, four simulation studies are conducted. Each scenario has a different number of subjects, different observations per subject, and different covariance structures on which the data are generated. The simulation results illustrate that the proposed method can effectively select the fixed effects and random effects in the linear mixed model. In the simulations, the proposed method is also compared with other model selection methods, and the simulation results make it manifest that the proposed method performs better in choosing the true model. Subsequently, two applications, Amsterdam growth and health study data (Kemper, 1995) and Messier 69 data-Astronomy application (Husband, 2017), are utilized to investigate how the proposed approach behaves with the real-life data. In both applications, the proposed method is compared with other methods. The proposed method proves to be more effective than its counterparts in identifying the appropriate mixed model.