Mathematics Ph.D. Dissertations

Feature Screening for High-Dimensional Variable Selection in Generalized Linear Models

Date of Award

2021

Document Type

Dissertation

Degree Name

Doctor of Philosophy (Ph.D.)

Department

Statistics

First Advisor

Junfeng Shang (Advisor)

Second Advisor

Emily Pence Brown (Committee Member)

Third Advisor

Hanfeng Chen (Committee Member)

Fourth Advisor

Wei Ning (Committee Member)

Abstract

High-dimensional data are widely encountered in a great variety of areas such as bioinformatics, medicine, marketing, and finance over the past few decades. The curse of high-dimensionality presents a challenge in both methodological and computational aspects. Many traditional statistical modeling techniques perform well for low-dimensional data, but their performance begin to deteriorate when being extended to high-dimensional data. Among all modeling techniques, variable selection plays a fundamental role in high-dimensional data modeling.

To deal with the high-dimensionality problem, a large amount of variable selection approaches based on regularization have been developed, including but not limited to LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), Dantzig selector (Candes and Tao, 2007). However, as the dimensionality getting higher and higher, those regularization approaches may not perform well due to the simultaneous challenges in computational expediency, statistical accuracy, and algorithm stability (Fan et al., 2009). To address those challenges, a series of feature screening procedures have been proposed. Sure independence screening (SIS) is a well-known procedure for variable selection in linear models with high and ultrahigh dimensional data based on the Pearson correlation (Fan and Lv, 2008). Yet, the original SIS procedure mainly focused on linear models with the continuous response variable. Fan and Song (2010) also extended this method to generalized linear models by ranking the maximum marginal likelihood estimator (MMLE) or maximum marginal likelihood itself. In this dissertation, we consider extending the SIS procedure to high-dimensional generalized linear models with binary response variable.

We propose a two-stage feature screening procedure for generalized linear models with a binary response based on point-biserial correlation. The point-biserial correlation is an estimate of the correlation between one continuous variable and one binary variable. The two-stage point-biserial sure independence screening (PB-SIS) can be implemented in a straightforward way as the original SIS procedure, but it targets more specifically on high-dimensional generalized linear models with the binary response variable. In the first stage, we perform the SIS procedure by using point-biserial correlation to reduce the high dimensionality of a model to a moderate size. In the second stage, we apply a regularization method, such as LASSO, SCAD, or MCP, to further select important variables and find the final spare model.

We establish the sure screening property under certain conditions for the PB-SIS method for high-dimensional generalized linear models with the binary response variable. The sure independence property for PB-SIS shows that our proposed method can select all the important variables in the screened submodel with probability very close to one.

We also conduct simulation studies for generalized linear models with binary response variable by generating data from different link functions. To evaluate the performance of our proposed method, we compare the proportion of submodel with size d that contains all the true predictors among 1000 simulations, P , and computing time for our proposed method with MMLE and Kolmogorov filter methods after the first stage screening. We also compare the performance of two-stage PB-SIS methods with different penalized methods by using different tuning parameter selection criteria. The simulation results demonstrate that PB-SIS outperforms the Kolmogorov filter methods in both the selection accuracy and computational cost in different settings and has almost the same selection accuracy as MMLE but with much lower computational cost. A real data application is given to illustrate the performance of the proposed two-stage PB-SIS method.

Share

COinS