Mathematics Ph.D. Dissertations

Methodology for Estimation and Model Selection in High-Dimensional Regression with Endogeneity

Date of Award

2023

Document Type

Dissertation

Degree Name

Doctor of Philosophy (Ph.D.)

Department

Statistics

First Advisor

Junfeng Shang (Committee Chair)

Second Advisor

Meagan Docherty (Other)

Third Advisor

John Chen (Committee Member)

Fourth Advisor

Wei Ning (Committee Member)

Abstract

Since the advent of high-dimensional data structures in many areas such as medical and biological sciences, economics, and marketing investigation over the past few decades, the need for statistical modeling techniques of such data has grown. In high-dimensional statistical modeling techniques, model selection is an important aspect. The purpose of model selection is to select the most appropriate model from all possible high-dimensional statistical models where the number of explanatory variables is larger than the sample size. In high-dimensional model selection, endogeneity is a challenging issue. Endogeneity is defined as when a predictor variable (X) in a regression model is related to the model error term (ϵ), which results in inconsistency of model selection. Because of the existence of endogeneity, Fan and Liao (2014) pointed out that exogenous assumptions in most statistical methods are not able to validate in high-dimensional model selection, and exogenous assumptions means a predictor variable (X) in a regression model is not related to the model error term (ϵ). To avoid the effect of endogeneity, Fan and Liao (2014) proposed the focused generalized method-of-moments (FGMM) approach in high-dimensional linear models with endogeneity for selecting significant variables consistently. We propose the FGMM approach with modifications for high-dimensional linear and nonlinear models with endogeneity to choose all of the significant variables. The theorems in Fan and Liao (2014) show that FGMM approach consistently chooses the true model as the sample size goes to infinity in both the linear and nonlinear models. In linear models with endogeneity, we modify the penalty term to improve the selection performance. In nonlinear models with endogeneity, we adjust the loss function in the FGMM approach to achieve model selection consistency, which is to select the true model as the sample size n goes to infinity. This modified approach adopts instrumental variables to satisfy an exogenous assumption for consistently selecting the most appropriate model. The instrumental variables are defined as variable W that is correlated with the independent variable X and uncorrelated with the error term ϵ. In other words, the instrument variables do not have endogenous problems. In the modified approach, instrumental variables are utilized to develop the loss function and penalized objective function for selecting consistent and significant variables in the model. Further, the modified approach can do model selection and estimation simultaneously. The simulations for high-dimensional linear and nonlinear models with endogeneity are conducted to illustrate the performance of the modified approach. In the simulations, we compare the performances of the modified FGMM approach and that of the penalized least square method with a variety of penalty functions, like Lasso, Adaptive Lasso, SCAD and MCP to select significant variables in the optimal model. The simulation results demonstrate that the modified FGMM approach has better performance in model selection and has higher estimation accuracy than those of the penalized least squared method in high-dimensional linear and nonlinear models. The simulation results also indicate that the utilization of different penalty terms, such as Adaptive Lasso, SCAD, and MCP, can improve estimation accuracy of parameters in the model compared with the Lasso. A real-world example is employed to evaluate the effectiveness of the modified FGMM approach.

Share

COinS