Mathematics Ph.D. Dissertations

Title

Linear Mixed Model Selection by Partial Correlation

Date of Award

2020

Document Type

Dissertation

Degree Name

Doctor of Philosophy (Ph.D.)

Department

Statistics

First Advisor

Junfeng Shang (Advisor)

Second Advisor

Andy Garcia (Committee Member)

Third Advisor

Hanfeng Chen (Committee Member)

Fourth Advisor

Craig Zirbel (Committee Member)

Abstract

Linear mixed models (LMM) are commonly used when observations are no longer independent of each other, and instead, clustered into two or more groups. In the LMM, the mean response for each subject is modeled by a combination of fixed effects and random effects. The fixed effects are characteristics shared by all individuals in the study; they are analogous to the coefficients of the linear model. The random effects are specific to each group or cluster and help describe the correlation structure of the observations. Because of this, linear mixed models are popular when multiple measurements are made on the same subject or when there is a natural clustering or grouping of observations. Our goal in this dissertation is to perform fixed effect selection in the high-dimensional linear mixed model.

We generally define high-dimensional data to be when the number of potential predictors is large relative to the sample size. High-dimensional data is common in genomic and other biological datasets. In the high-dimensional setting, selecting the fixed effect coefficients can be difficult due to the number of potential models to choose from. However, it is important to be able to do so in order to build models that are easy to interpret.

Many current techniques for fixed effect selection in the high-dimensional LMM are based on the penalized log likelihood. However, adding a penalized term to the log likelihood results in a non-convex optimization problem which requires numerical methods to solve and a data dependent tuning parameter to select the amount of regularization.

In contrast to the penalized likelihood, the partial correlation is based on the marginal measures of association between each predictor and the conditioned response. Techniques based on the partial correlation have two main advantages to those based on penalized likelihoods: no data dependent tuning parameter is required to select the fixed effects and the partial correlation is not influenced by strong correlation between covariates. In this dissertation we propose using the partial correlation between the response variable conditioned on the random effects to select fixed effects in the LMM. This is an extension of variable selection using partial correlation developed by Bühlmann, Kalisch, and Maathuis (2010) to the linear mixed model by conditioning the response variable on the random effects. At the time of this writing, selection methods using partial correlation have not been attempted to select fixed effects in the linear mixed model.

This dissertation proposes a two stage procedure for selecting the fixed effects in the high-dimensional linear mixed model. In the first stage, we use the partial correlation to perform an initial fixed effect variable screening procedure in order to estimate an initial linear mixed model. In the second stage, we use the initial linear mixed model to predict the values of the random effects using the Best Linear Unbiased Predictor (BLUP). These predicted values are used to condition the response variable by subtracting the group-specific random effects from the response. After conditioning on the random effects, the observations are effectively independent, and we select variables using the partial correlation between the covariates and the conditioned response. In this dissertation, we show that this procedure consistently selects fixed effects in the linear mixed model.

To use the partial correlation to select variables in the LMM, we require the assumption of partial faithfulness on the design matrix X. The partial faithfulness assumption in the LMM describes the relationship between the response conditioned on the random effects and the coefficients of the fixed effects of the LMM. Partial faithfulness in the LMM says that the fixed effect coefficient is equal to zero if and only if the partial correlation between the conditioned response and a predictor under consideration is equal to zero for some set of controlling variables. We present theoretical results that demonstrate that when partial faithfulness holds for the LMM, the relationship between the partial correlation and the coefficients of the fixed effects holds.

We investigate the performance of this method in a variety of simulated high-dimensional scenarios, including non-normal distributions of the random effects. We find that the method is effective at selecting the active set of variables even in the presence of many covariates. Through these simulations, we observe that the proposed technique selects variables quickly and with few false positives, especially in the case where the covariates are highly correlated with each other. We also apply the method to a real high-dimensional dataset regarding the production of riboflavin.

COinS