Mathematics Ph.D. Dissertations


Innovations of random forests for longitudinal data

Date of Award


Document Type


Degree Name

Doctor of Philosophy (Ph.D.)



First Advisor

John Chen (Advisor)

Second Advisor

Junfeng Shang (Committee Member)

Third Advisor

Wei Ning (Committee Member)

Fourth Advisor

Andrew Gregory (Other)


Ensemble methods have gained attention over the past few decades and are effective tools in data mining. An example is the random forest, a state-of-the-art ensemble method proposed by Leo Breiman in 2000. Random forest is one of the powerful and widely used machine learning algorithms. It is popular for its prediction accuracy and ability to handle large data sets. Random forest has many attractive features and can be used for both regression and classification tasks. It works well in high dimensional settings as well where the number of features p is larger than the number of observations (Cutler, Edwards, Beard, Cutler, Hess, Gibson, and Lawler, 2007). However, random forest performs poorly if a small number of informative/relevant variables are hidden among a great number of noisy/irrelevant variables (Hastie, Tibshirani, and Friedman, 2001). The presence of irrelevant/noise variables adds to the computational burden and it becomes more difficult to train especially when the number of variables exceed the number of observations.

This research proposes a new development for random forest. Our focus is on regression problems, as opposed to classification problems. The central idea is to construct the random forest using only the features that are relevant and have an influence in predicting the response. To achieve this, we incorporate feature selection, specifically the filter method as a pre-processing step into the random forest algorithm. We subset the features that are correlated with the response and use the selected features for the construction of random forest. To subset the features, we use the Pearson product-moment correlation to select features that are correlated with the response.

The proposed method begins by identifying the features that are useful in predicting the response and then constructing the random forest using only the selected features. One advantage of our approach is that, only relevant features are used in the construction of the random forest and simulation results show that combining feature selection with random forest regression achieves better prediction accuracy. Although this is done at a cost of losing information, the convergence rate is faster since it depends only on the relevant variables. Our method is computationally efficient and advantageous compared to other methods.

In addition, we extend the proposed method to longitudinal data. Longitudinal data arises when measurements are taken repeatedly for the same individual over a period of time (Fitzmaurice, Laird, and Ware, 2004). At different time points, measurements are collected for each individual. In our formulation, we construct a random forest for each time point and then investigate the effect of subsetting features prior to the construction of the random forest. At the very end of this dissertation, a brief conclusion is given and some possible further improvements are discussed.