Mathematics Ph.D. Dissertations


Bayesian Model Checking Methods for Dichotomous Item Response Theory and Testlet Models

Date of Award


Document Type


Degree Name

Doctor of Philosophy (Ph.D.)



First Advisor

James Albert (Advisor)

Second Advisor

Lynne Hewitt (Committee Member)

Third Advisor

Hanfeng Chen (Committee Member)

Fourth Advisor

Maria Rizzo (Committee Member)


The predominant model checking method used in Bayesian item response theory (IRT) models has been the posterior predictive (PP) method. In recent years, two new Bayesian model checking methods have been proposed that may be used as alternatives to the PP method. We refer to these as the prior-predictive posterior simulation (PPPS) method of Dey et al. (1998), and the pivotal discrepancy measure (PDM) method of Johnson (2007). These methods have shown to be effective in other Bayesian models, but have never been implemented with Bayesian IRT models. It is of practical interest to see if either of these two new methods will perform better than the PP method in assessing aspects of fit in an IRT model setting.

In this dissertation, we compared the effectiveness of the PPPS and PDM model checking methods with the PP method in evaluating person fit in two-parameter normal ogive (2PN) IRT models, and overall model goodness-of-fit in 2PN testlet models. Two simulation studies were performed. The first study explored the performance of each method (PP, PPPS, and PDM) in assessing person fit, or the goodness-of-fit of an individual's set of test answers with the assumed Bayesian 2PN IRT model. Several classical person fit measures were employed under each method. We also introduced using the sum of squared Bayesian latent residuals as a person fit measure. Four different types of person miss-fit were taken from the literature, and response data sets were simulated with certain examinee's responses following these violations. We found that for most of the measures, the PPPS and PDM methods outperformed the PP method in detecting the examinee's response patterns simulated to be aberrant under the model. In particular, the sum of squared Bayesian latent residuals showed to be a very effective measure under the PPPS method.

The second simulation study compares the performance of the PP method and the PPPS method in assessing the overall goodness-of-fit of a Bayesian 2PN IRT model fitted to data generated under a Bayesian 2PN testlet model with equal variance across testlets. Under the PP method we used three goodness-of-fit measures based on biserial correlations that were previously employed for checking the goodness-of-fit of a three-parameter logistic (3PL) IRT model to 3PL testlet data. For use under the PPPS method, we introduced three new goodness-of-fit measures which are calculated from posterior values of the item discrimination parameters. Data sets were simulated under four different values of testlet variance, ranging from very low to fairly high. Looking at the detection rates under the PP method, we saw that the measures performed very poorly in detecting a lack of fit of the 2PN IRT model for all data values of testlet variance. The detection rates of the new measures under the PPPS method showed to be higher than those under the PP method. However, the measures under the PPPS method only showed descent power in detecting lack of fit for large values of data generating testlet variance.