Mathematics Ph.D. Dissertations

Title

Energy Distance Correlation with Extended Bayesian Information Criteria for Feature Selection in High Dimensional Models

Date of Award

2021

Document Type

Dissertation

Degree Name

Doctor of Philosophy (Ph.D.)

Department

Statistics

First Advisor

Hanfeng Chen (Advisor)

Second Advisor

Yuning Fu (Committee Member)

Third Advisor

Wei Ning (Committee Member)

Fourth Advisor

Maria Rizzo (Committee Member)

Abstract

In this research, we investigate the sequential lasso method for feature selection in sparse high dimensional linear models. It was recently proposed by Luo and Chen (2014). In this project, we propose a new method by introducing the energy distance correlation by Szekely et al. (2007) to replace the ordinary correlation in Luo and Chen's algorithm. We continue to adopt the extended Bayesian Information Criteria as the stopping criteria in the computing algorithm. The advantage of energy distance correlation is that it is able to detect linear and non-linear association between two variables, while the ordinary correlation can detect only linear part of association between two variables. As a result, it appears that the new method is shown to be more powerful than Luo and Chen's method for feature selections. This is demonstrated by simulation studies and illustrated by two real-life examples. It is shown that the proposed new algorithm is also selection consistent.

For the first part of our research we examine through simulations the model size selection by Adaptive Lasso and SCAD after a sure screening method proposed by Li et al. (2012) using distance correlation is applied to the data first. We observe that the average model size selected was quite high.

In the second part we describe the new sequential variable selection method which we call energy distance correlation with extended Bayesian Information Criteria (Edc+EBIC). At each stage of the sequential procedure we maximize the energy distance correlation between the response and each of the predictor variables. This maximization is done such that if a variable is selected in the previous stage, it's contribution to the response is removed so that it won't have a chance of being selected again. The active set of selected variables is updated once a variable is selected and the EBIC of the set is calculated. The process stops if the EBIC for the current active set is greater than the EBIC of the previous active set. We compare the performance of Edc+EBIC with sequential Lasso, Adaptive Lasso, SCAD and SIS+SCAD. We observed that our proposed method on average has a positive discovery rate close to 100%, a low false discovery rate and an average model size as expected in our simulation set-up.

COinS