Mathematics Ph.D. Dissertations


Influence of Correlation and Missing Data on Sample Size Determination in Mixed Models

Date of Award


Document Type


Degree Name

Doctor of Philosophy (Ph.D.)



First Advisor

Junfeng Shang (Advisor)

Second Advisor

Mark Earley (Committee Member)

Third Advisor

Hanfeng Chen (Committee Member)

Fourth Advisor

John Chen (Committee Member)


Sample size determination plays an important role in clinical trials. In the early stage of a design, we have to decide the right amount of data needed to reach desired accuracy in the follow-up statistical hypothesis testing. With a stated significance level, the accuracy of a hypothesis test largely depends on the test power. The test power measures the probability of correctly detecting a significant difference, which indicates how trustworthy the statistical decision is from the test in the case of a significant difference. Test power and sample size are closely related. In general, the testing power goes up when the sample size increases. However, the exact dependency of the testing power on the sample size is rather complicated to pin down. Furthermore, missing data and certain correlation structures in the data that are particularly common in clinical trials by nature would complicate the relationship between sample size and test power. It is thus crucial to assess and to analyze the effects of missing data and correlation structures of measurements on the sample size and test power.

In this dissertation, we focus on estimating the adequate sample size for testing means for longitudinal data with missing observations. We derive formulas for determining the sample size in the settings of compound symmetry and autoregressive order one models. The longitudinal data structure is commonly observed in clinical trials, in which measurements are repeatedly recorded over time points for each subject. These measurements of different subjects are mutually independent, while the repeated measurements at different time points within the same subject are correlated. It is assumed that the missing data mechanism is missing completely at random (MCAR). We mainly study the compound symmetry and autoregressive order one correlation structures. The generalized estimating equations (GEE) method for the analysis of longitudinal data, proposed by Liang and Zeger (1986), is applied to estimate the parameters in the mixed models. The GEE methodology yields consistent estimators of the parameters and of their variances by introducing a working correlation matrix. The GEE consistent estimators are robust to the choice of working correlation matrices under the assumption of MCAR. The sample size estimation procedure utilizes all the available data and incorporates correlations within the repeated measurements and missing data while the derived formulas stay simple. To evaluate the performance of the formulas through empirical powers, the simulation studies have been carried out. The simulation results show that the derived sample size formulas effectively reflect the influence of correlation structures and missing data on the sample size and test power.