Mathematics Ph.D. Dissertations

Title

Improving the Accuracy of Variable Selection Using the Whole Solution Path

Date of Award

2015

Document Type

Dissertation

Degree Name

Doctor of Philosophy (Ph.D.)

Department

Statistics

First Advisor

Hanfeng Chen (Committee Chair)

Second Advisor

Peng Wang (Advisor)

Third Advisor

James Albert (Committee Member)

Fourth Advisor

Jonathan Bostic (Other)

Abstract

The performances of penalized least squares approaches profoundly depend on the selection of the tuning parameter; however, statisticians did not reach consensus on the criterion for choosing the tuning parameter. Moreover, the penalized least squares estimation that based on a single value of the tuning parameter suffers from several drawbacks. The tuning parameter selected by the traditional selection criteria such as AIC, BIC, CV tends to pick excessive variables, which results in an over-fitting model. On the contrary, many other criteria, such as the extended BIC that favors an over-sparse model, may run the risk of dropping some relevant variables in the model.

In the dissertation, a novel approach for the feature selection based on the whole solution paths is proposed, which significantly improves the selection accuracy. The key idea is to partition the variables into the relevant set and the irrelevant set at each tuning parameter, and then select the variables which have been classified as relevant for at least one tuning parameter. The approach is named as Selection by Partitioning the Solution Paths (SPSP). Compared with other existing feature selection approaches, the proposed SPSP algorithm allows feature selection by using a wide class of penalty functions, including Lasso, ridge and other strictly convex penalties.

Based on the proposed SPSP procedure, a new type of scores are presented to rank the importance of the variables in the model. The scores, noted as Area-out-of-zero-region Importance Scores (AIS), are defined by the areas between the solution paths and the boundary of the partitions over the whole solution paths. By applying the proposed scores in the stepwise selection, the false positive error of the selection is remarkably reduced.

The asymptotic properties for the proposed SPSP estimator have been well established. It is showed that the SPSP estimator is selection consistent when the original estimator is either estimation consistent or selection consistent. Specially, the SPSP approach on the Lasso has been proved to be consistent over the whole solution paths under the irrepresentable condition.

Additionally, a number of simulation studies have been conducted to illustrate the performance of the proposed approachs. The comparison between the SPSP algorithm and the existing selection criteria on the Lasso, the adaptive Lasso, the SCAD and the MCP were provided. The results showed the proposed method outperformed the existing variable selection methods in general.

Finally, two real data examples of identifying the informative variables in the Boston housing data and the glioblastoma gene expression data are given. Compared with the models selected by other existing approaches, the models selected by the SPSP procedure are much simpler with relatively smaller model errors.

COinS