Mathematics Ph.D. Dissertations

Title

Estimating the Proportion of True Null Hypotheses in Multiple Testing Problems

Date of Award

2016

Document Type

Dissertation

Degree Name

Doctor of Philosophy (Ph.D.)

Department

Statistics

First Advisor

Hanfeng Chen (Advisor)

Second Advisor

John Laird (Other)

Third Advisor

John Chen (Committee Member)

Fourth Advisor

Junfeng Shang (Committee Member)

Abstract

The problem of estimating the proportion, π0, of the true null hypotheses in a multiple testing problem is important in cases where large scale of parallel hypotheses tests are performed independently. While the problem is a quantity of interest in its own right in many applications, a reliable estimate of π0 is crucial when we want to assess and/ or control the false discovery rate in a multiple testing problem.

In this dissertation, we investigate the estimation problem coupled with assessing/controlling the false discovery rate. The dissertation develops a new estimating procedure under the two-component mixture model. The components of the mixture are the null and alternative distributions with mixing proportions π0 and 1- π0 respectively, where π0 is the unknown proportion to be estimated. We establish an innovative non-parametric maximum likelihood estimation of the p-values density, restricting the alternative to multinomial distribution family of k categories to address this problem.

To apply this approach, we need to settle two things first: (a) select an integer k, and (b) convert the continuous-type observations (p-values) into discrete data with k categories. As many authors have noticed, in applications, the p-values are highly skewed, so we recommend Sturges' rule modified for skewness in determining k.

We then propose an iterative optimization technique - EM algorithm to characterize the maximum likelihood estimate for an approximation to the maximum likelihood estimate of π0. Simulation studies are conducted to assess the performance of the proposed procedure. The simulation results show that our proposed procedure perform significantly better than the existing procedures. The new procedure is applied to the leukemia gene expression dataset and the inherited breast cancer cDNA dataset that were analyzed by many other statisticians. Again, our procedure provides an overall satisfactory performance.

COinS