Criterion-related Validity of Forced-Choice Personality Measures: A Cautionary Note Regarding Thurstonian IRT versus Classical Test Theory Scoring

Recommended Citation Fisher, Peter A.; Robie, Chet; Christiansen, Neil D.; Speer, Andrew B.; and Schneider, Leann (2019) "Criterion-related Validity of Forced-Choice Personality Measures: A Cautionary Note Regarding Thurstonian IRT versus Classical Test Theory Scoring," Personnel Assessment and Decisions: Vol. 5 : Iss. 1 , Article 3. DOI: https://doi.org/10.25035/pad.2019.01.003 Available at: https://scholarworks.bgsu.edu/pad/vol5/iss1/3

Personnel Assessment And decisions tirt vs ctt to which a single statement describes them), FC response formats present blocks of statements from which applicants must choose from equally desirable self-descriptions. FC measures of personality have been found to reduce applicants' ability to fake as it is more difficult to determine the "correct" response to any given block of statements and increases the cognitive load involved in impression management (Tett & Simonet, 2011). Traditional classical test theory (CTT) scoring of FC assessments involves adding the inverted rank order of items in their blocks to their respective scales. As a result, a fixed number of points are allocated to an individual within each block, and so the same total number of points are allocated on each assessment. Ultimately, CTT scoring of multidimensional FC measures is, to varying extents, ipsative: Trait scores are relative within person rather than absolute on a normative scale. Ipsative scores present a variety of problems in selection testing (Brown & Maydue-Olivares, 2011). In particular, ipsative scores are limited in their ability to make meaningful comparisons between individuals, which is critical to employee selection. Furthermore, construct validity, criterion validity, and reliability estimates are all distorted as scale scores are inherently negatively correlated, regardless of true-score relationships, and measurement errors are not independent (cf. Johnson, Wood, & Blinkhorn, 1988;Meade, 2004). Despite these concerns, the increased difficulty in faking on FC assessments results in comparable to slightly improved criterion-related validity versus single-stimulus assessments (Christiansen et al., 2005;Salgado & Táuriz, 2014).

Thurstonian Item Response Theory
Maydue-Olivares and Böckenhalt (2005) introduced a promising technique to recover normative scores from a forced-choice instrument, thus overcoming the major weakness involved in CTT scoring of forced-choice instruments by allowing for direct between-person comparisons, which is critical to high-stakes assessments and employment selection. Brown and Maydue-Olivares (2011; have since named this framework Thurstonian item response theory or TIRT (for a basic introduction to TIRT, which is well beyond the scope of this article, see Dueber, Love, Toland, & Turner, 2019). Work by Brown and Maydue-Olivares (2013) has found measurement properties to improve using TIRT, including increased reliability, positive correlations among scale scores, and a cleaner factor structure. Furthermore, differences in criterion-related validity for employed call center operators were found, with an average .09 difference favoring TIRT over CTT across scales for a measure of personality predicting incentive bonus (an outcome awarded to employees based upon various performance indicators, N = 219). On the other hand, P. Lee, S. Lee, and Stark (2018) did not find improvement in the criterion-related validity of TIRT estimates over forced choice CTT estimates in the prediction of nonwork external measures in a sample of university students (N = 417).
Although initial work on TIRT is promising, these conflicting results highlight the need for additional research into the use of TIRT scoring in high-stakes assessment situations. Both theoretically and via these empirical findings, it has been shown that TIRT is a potentially promising method that exhibits favorable FC features (i.e., nontransparent items) while avoiding the undesirable psychometric properties that plague traditional CTT FC scoring. However, implementing TIRT scoring in an applied setting can be challenging, given the more stringent data requirements and substantially more complex models necessary to derive scores. To offset the intensive data and modeling needs, TIRT should therefore consistently demonstrate better psychometric properties than the simpler CTT FC method. In employee selection contexts, this would involve improved estimates of criterion-related validity. Unfortunately, there have only been a handful of studies that have examined the criterion-related validity of TIRT-developed scales, which we reference above, and results from these were mixed. Furthermore, only one of these studies was conducted in an actual work context and used actual work criteria. To more confidently support the use of TIRT in high-stakes preemployment settings, research must be conducted to show superiority for TIRT across many work settings and to generalize those findings across different forced-choice scales.
In order to extend the existing research examining CTT and TIRT criterion-related validity, we directly compared criterion-related validity estimates for the two scoring methods for an existing, proprietary personality assessment across 11 concurrent validity sample data sets, meta-analytically corrected for measurement artifacts. Based on the previous research findings and theory presented above, it was expected that TIRT scoring would result in better criterion validity than traditional CTT scoring for selection testing and thus better selection outcomes. Contrary to our expectations, CTT scoring vastly outperformed TIRT scoring for the samples and assessment involved in this study. Below we present our methods and results in detail, and discuss these findings, as well as a potential explanation for our unexpected results.

Participants and Procedures
This study included one data set that was used for purposes of calibrating the TIRT model estimates (n = 12,018) and 11 concurrent validity data sets used as a collective validity sample (total N = 612). The organization owning the proprietary FC measure that provided the data does not collect demographic information. Three of the data sets came from an international marketing firm with jobs titled assistant marketing manager (n = 115), marketing coordinator (n = 54), and marketing executive (n = 75). The remaining eight data sets came from a sales-based organization -although all of the jobs were sales representatives, their roles were deemed different enough by subject matter experts to require different personality competencies: business-to-business (n = 23), energy (n = 48), insurance (n = 28), Internet 1 (n = 112), Internet 2 (n = 22), multiproduct (n = 15), personal security (n = 16), and television (n = 108). Participants were incumbents, instructed to complete the assessment in an unproctored setting for research purposes. Participants were informed that the assessment was meant to help determine hiring criteria for their jobs, and so they had little incentive to distort their responses. Indeed, meta-analytic evidence suggests that incumbents' personality assessment scores tend to be much more consistent with experimental samples instructed to respond honestly than with job applicants or experimental samples instructed to fake (Birkeland, Manson, Kisamore, Brannick, & Smith, 2006).

Measures
Personality. The personality measure used in this study is a five-factor model (FFM)-based (Goldberg, 1992) commercial instrument based on DeYoung, Quilty, and Peterson's (2007) 10-factor (2 facets per FFM) model that uses a partially ipsative, forced-choice methodology (see Salgado, Anderson, & Táuriz, 2015, for additional information on partially vs. fully ipsative FC measures). Facet-level scores were used in the current study because prediction of work outcomes is generally better at a level below the FFM (Christiansen & Robie, 2011), and the instrument was initially constructed with this in mind. Facet dimensions, number of items per facet dimension, test-retest reliabilities (n = 124 from a student sample with a 2-to 4-week retesting interval),1 broad dimensions, and definitions can be found in Appendix A and B. Assessment takers are presented with 60 items arranged in 20 triplets that have been matched on attractiveness (i.e., social desirability). For each triplet, respondents are asked to choose the statement that is "least like you" and "most like you" (e.g., "I tend to take an interest in other people's lives," "I don't mind taking charge," and "I usually need a creative outlet").
The CTT scoring of the measure was straightforward. If all statements in a triplet were positively keyed, items were scored 2 if chosen as "most like me," 0 if chosen as "least like me," and 1 if not chosen. If all statements in a triplet were negatively keyed, items were scored 0 if chosen as "most like me," 2 if chosen as "least like me," and 1 if not chosen. Scores were then summed across triplets for each facet. No calibration sample was necessary.
The TIRT scoring was more complex, the details of which are beyond the scope of this paper (see Brown & Maydue-Olivares, 2012 for details). TIRT is a model-based scoring methodology with large numbers of parameters to estimate so a calibration sample large enough to reliably estimate said parameters is recommended (Brown & Maydue-Olivares, 2011). The TIRT model fit the 10-factor model in our calibration sample well (RMSEA = .029) using MPlus 7.4 (Muthén, & Muthén, 2015). Model estimates from this calibration sample were used to score personality in the validity sample data sets that contained job performance data.
For jobs within each validity sample, personality composite scores were formed that only included the personality traits deemed relevant to each job (see Tett & Christiansen, 2007). These were formed to reflect the total composite that might be used when making decisions regarding which applicants to hire by combining only those trait scores theoretically linked to performance. The composites, which were formed for both CTT and TIRT scoring methods, aggregated personality facet scores separately for each of the 11 validity sample data sets. Decisions on which facets were relevant to each job were made by subject matter experts who were experts in both personality psychology and knowledgeable of the requirements for each of the jobs.
Job performance. Job performance was measured differently for the marketing versus the sales jobs, both according to the standards of the client organizations. Marketing job performance was provided by the incumbent's supervisor, which was an aggregate of several scales, including an average of key performance indicators for each role, an average proficiency rating of several skills, and an overall subjective rating. Sales job performance was based on a combination of multiple objective performance scores, adjusted for several organization-specific factors such as region and quota.

RESULTS
Criterion-related validity coefficients for the 11 different samples for CTT and TIRT scoring methods are presented in Table 1.2 Tests of dependent correlations (|Z|; Steiger, 1980) were used to compare the correlations across scoring methods. Contrary to expectations, all of the comparisons favored CTT over TIRT; two of these were statistically 1 Please note that test-retest reliability is generally considered to be more appropriate for estimating FC reliability (O'Neill, et al., 2017). 2 Note that the operational validity estimates for CTT scoring may be lower than the ones used for commercial purposes by the test vendor because some personality assessment items were eliminated from the present analyses. These are composed of 50 adjectives, presented in groups of 10, for which each group asks respondents to choose the three adjectives that were "most like you" and three adjectives that were "least like you." CTT-based estimates could be derived from these adjectives, but a TIRT model to score these could not be identified. Thus, analyses in the present study were restricted to the triplet statements so that a fair comparison could be made between scoring methods.  Personality Criterion-Related Validity Estimates for CTT and TIRT Scoring Methods significant at conventional levels (p < .05) and two more at a relaxed level (p < .10). Given the low sample sizes for some of the correlations, psychometric meta-analysis was used to compare aggregated correlations across scoring methods (Schmidt & Hunter, 2015;Schmidt & Le, 2014). The validity generalization meta-analytic framework developed by Schmidt and Hunter (1977) was designed explicitly for this purpose. Meta-analysis sample-size weights the correlations before aggregating to ensure that those derived from larger samples have better representation in the average than those from smaller samples. Omitting correlations from the smaller samples would therefore provide a worse estimate of the population validity coefficient than weighting them appropriately. In fact, after controlling for sample size and other artifacts, there was no remaining variance in the correlations for CTT and very little for TIRT. Validity estimates were corrected for criterion unreliability using an estimate of .52 (Viswesvaran, Ones, & Schmidt, 1996) and average indirect range restriction of .89 (Salgado & Táuriz, 2014).
Meta-analysis of personality criterion-related validity estimates for CTT and TIRT scoring methods can be found in Table 2. The sample size-weighted mean observed correlations were .25 for CTT and .00 for TIRT. The sam-ple-weighted mean corrected correlations were .38 for CTT and .00 for TIRT. Standard deviation of rho was 0 for the CTT scoring method but was .14 the TIRT scoring method. Thus, no variance remained in the criterion-related validity estimates for the CTT scoring method after sample size, criterion unreliability, and indirect range restriction were accounted for, in contrast to the considerable variance remaining for the TIRT scoring method. The confidence intervals for the validity estimates from the two scoring methods did not overlap, indicating that the aggregated criterion-related validity estimate for CTT scoring was higher than that for TIRT.3

DISCUSSION
There are many positive aspects to FC assessments in high-stakes testing settings, and TIRT has been promoted as a solution to the problems commonly found when CTT is used to score these measures. However, these findings contrast starkly with what one might expect upon implementing TIRT scoring for a FC employment selection assessment, particularly when considering the theoretical advantages that have previously been proposed. In an applied setting, using real-world incumbents and job performance criteria, TIRT scoring resulted in negligible criterion validity. On the other hand, CTT scoring resulted in acceptable criterion-related validity in the prediction of job performance outcomes. Compared to traditional CTT scoring of the same data, TIRT was clearly inferior and implementation would not have resulted in any benefit in a selection scenario. It is also worth noting that these differences may actually be understated in the results presented: In an attempt to compare the two scoring methods as fairly as possible, no minor modifications were made to the CTT scoring method that are otherwise included in practice by the test vendor. For example, modifications such as differential weighting of specific items that have empirically demonstrated higher reliabilities, or allowing for cross-loading items that have evidenced significant facet overlap, were 3 To help assuage doubts about our use of some of the smaller samples in the meta-analysis, we estimated the sample-weighted mean correlations omitting studies with n < 30 that an anonymous reviewer identified as being potentially problematic: For CTT the sample-weighted mean correlation was .24 (compared to .25 estimate including them), and for TIRT the sample-weighted mean correlation was -.03 (compared to 0 estimate including them). Thus, the substantive conclusion would be unchanged. In fact, omitting small N samples (and pretending they do not exist) will actually bias estimation of the population validity coefficients as compared to including them and weighting appropriately.

Meta-Analysis of Personality Criterion-Related Validity Estimates for CTT and TIRT Scoring Methods
not implemented when computing the CTT results reported above. As a result, the criterion validity estimates evidenced by CTT scoring would be expected to be slightly higher in practice, further widening the gap between the two scoring methods. Ultimately, the results presented here are relatively consistent with existing research comparing classical test theory scoring of personality assessments with item response theory counterpart, where IRT-derived scoring does not tend to improve trait estimations (Chernyshenko, Stark, Drasgow, & Roberts, 2007;Ferrando & Chico, 2007;Ling, Zhang, Locke, Li, & Li, 2016;Xu & Stone, 2012), and specifically for selection purposes (Speer, Robie, & Christiansen, 2016). Although TIRT is not without theoretical merits, and assessments constructed with TIRT scoring in mind may be useful for other purposes (e.g., low-stakes, developmental assessments; although more research is certainly required to make that claim), it is clear that applying TIRT scoring to an assessment that was designed to be scored with a CTT methodology may result in inadequate criterion validity. Overall, the CTT scoring method provided adequate and expected levels of criterion-related validity, consistent with the original goal and value proposition of the assessment. Thus, these results serve as a warning against the blind implementation of TIRT scoring over traditional CTT on existing FC assessments without conducting rigorous validation.

Reconciling Results: Trait Retrieval
Although the results we present above favor CTT over TIRT, there are several design factors to consider, particularly with respect to how the mix of response options within an item block can impact trait recovery. In their seminal simulation study, Brown and Maydue-Olivares (2011) demonstrate that blocks of homogeneously keyed (either all positively keyed or all negatively keyed) items merely highlight differences in the latent traits. Thus, TIRT-derived scores for these homogeneously keyed blocks draw conclusions about the relative positions of the underlying traits, rather than absolute, normative locations, similar to CTT scores. When all blocks in an assessment are homogenously keyed, trait retrieval may be poor using TIRT, as little information is provided on absolute trait location.
Notably, the proprietary FC assessment involved in this study consists of entirely homogeneously keyed blocks of statements (e.g., "I tend to take an interest in other people's lives," "I don't mind taking charge," and "I usually need a creative outlet"). This was done by the test developers so that the response options could be equated on attractiveness in order to minimize applicant faking of the FC measure. The results above suggest that CTT scoring of such a measure produces acceptable levels of criterion-related validity for use in selection testing. Such a design is useful in contexts where only some of the scored traits will be considered important and hence when relative trait standing matters, and negative correlations between scores can be overlooked. The introduction of TIRT scoring, however, creates additional burdens that may not have been considered in the development of many contemporary FC personality assessments. In particular, it is especially difficult to match social desirability of heterogeneously keyed items within blocks, which is critical to maintaining construct validity in applicant contexts where faking is likely. As Heggestad, Morrison, Reeve, and McCloy (2006) noted in their development of a multidimensional FC personality assessment, despite rigorous attempts to match items on social desirability "respondents became reluctant to indicate that a statement indicative of a high standing was 'least like me' and that a statement indicative of a low trait standing was 'most like me' under conditions of faking" (p. 21). Failure to take adequate steps to prevent faking can seriously undermine criterion-related validity when a personality assessment is given to actual job applicants (Tett & Christiansen, 2007).

Meta-Analysis of Personality Criterion-Related Validity Estimates for CTT and TIRT Scoring Methods
Note. The composite for personal security contains only a single dimension.
an FC assessment with TIRT scoring in mind and included heterogeneously keyed blocks of items to ensure that trait retrieval was not a concern. Nevertheless, the findings these authors present continue to cast doubt on the usefulness of TIRT scoring in applied, high-stakes testing, as the average criterion-related validity presented for even nonwork-related criteria tended to be smaller for TIRT, ultimately in favor of CTT scoring. In fact, recent theoretical and simulation work being conducted by Bürkner, Schulte, and Holling (2018) bring the authors to a very similar conclusion. Taken together with the current study, this highlights the uncertainty in understanding exactly what variance is being captured by TIRT that is unique from CTT. Because TIRT should better assess the latent trait domain, according to basic validity theory (Binning & Barrett, 1989) it would be expected that TIRT-derived scores should correlate more with theoretically linked outcomes such as job performance. However, at present it is difficult to argue that unique variance captured by TIRT represents true-score variance.
Perhaps this setting was one where relative standing across traits was more important than the absolute standing within traits (in which case TIRT could reflect unique true score variance that simply wasn't well-aligned to the performance criteria), or perhaps these findings are specific to the small number of instruments used and studies conducted thus far. Either way, more research on this topic is warranted.

Analysis at the Dimension Level
The analyses presented above are conducted at the practical, composite level, where composites are comprised of various dimensions at the discretion of subject-matter experts. This parallels the scoring system used for decision making in high-stakes testing situations but may make the interpretation of the differences between scoring approaches more difficult from an academic perspective. Three correlation matrices within and between scoring methods at the dimension level, one for each of the calibration, marketing, and sales samples can be seen in the supplemental material. These correlation matrices highlight a possible alternative perspective on the findings we have presented above. 4 It is apparent that TIRT scoring tended to result in expected positive correlations between personality dimensions, which contrasts starkly with the negative correlations that result from ipsative CTT scoring. However, dimensions that correlate relatively highly will contribute less unique information about a criterion when placed into a composite, which could help to explain the difference in validity between the two scoring methods. As can be seen in Table 3, the average correlations between dimensions within a composite for a job tended to be small and negative under CTT scoring, but larger and positive under TIRT scoring, consistent with this line of reasoning. However, a comparison of the average dimension-level criterion-related validity estimates for each scoring method suggest that CTT scores result in better average criterion-related validity, even at the dimension-level, and the same substantive conclusion as above. Thus, the source of the differences in the validity of composite scores appears to be due both to a decline in criterion-related validity of the TIRT dimension scores as well as an increase in the inter-correlation of the dimension scores contributing to the composite. TIRT scoring reduces the amount that choices related to the dimensions identified as relevant in the job analyses increase scores on the composite, relative to simpler CTT scoring; hence, TIRT results in less criterion-related validity (cf. Christiansen et al., 2005).

Concluding Remarks on Implementing TIRT
Theoretically, TIRT has the potential to provide the "best of both worlds" in personality assessments when it comes to reducing applicant faking while also solving issues related to ipsative scores. The results presented here indicate that TIRT scoring should not be blindly implemented to replace CTT scoring on existing FC personality assessments in practice. As demonstrated in the present study, TIRT assessment scoring does not necessarily represent a panacea for high-stakes assessment situations. Assessments that were not originally constructed or validated with TIRT in mind may not be suitable candidates for TIRT scoring. Thus, care in development of FC assessments, as well as rigorous, empirical, concurrent validation should be undertaken before implementing TIRT scoring in applied assessments.

Facet dimension # Items
Test-retest Reliability FFM dimension Definition Compassion 6 .41 Agreeableness The extent that someone shows empathy, sympathy, and warmth toward others; shows a tendency for being understanding and forgiving of mistakes. It is the degree to which someone is forgiving, helpful, and trusting.

.43 Agreeableness
The extent that someone is pleasant, willing to cooperate, and considerate. It is the degree to which someone is modest, unassuming, and courteous. Mannerliness is sometimes referred to as compliance and politeness. Industriousness

.42 Conscientiousness
The extent that someone maintains high standards, aspires to challenging goals, and is willing to put forth extra effort. It is the degree to which someone is purposeful, efficient, and ambitious. Orderliness

.71 Conscientiousness
The extent that someone acts with deliberation, is focused on quality, and prefers to be organized and have a plan. It is the degree to which someone is thorough, methodical, and organized.

.69 Extraversion
The extent that someone voices their opinions and is comfortable being the center of attention and giving direction to other employees. It is the degree to which someone is influential, persuasive, and self-confident.
Enthusiasm 9 .57 Extraversion The extent that someone is interested in meeting new people, initiates conversations, and is comfortable in social interactions. It is the degree to which someone is talkative, outgoing, and sociable.
Self-regard 5 .59 Emotional Stability The extent that someone has a positive self-image, is satisfied with who they are as a person and tends to be self-assured and optimistic. It is the degree to which someone is content, secure, and cheerful.

.53 Emotional Stability
The extent that someone is calm under pressure, even tempered, and resistant to the effects of stress and unexpected changes. It is the degree to which someone is calm, steady, and composed.

Experiential disposition 7 .56
Openness The extent that someone seeks out new and different experiences, adapts to changes in the workplace, and is tolerant of differences between people. It is the degree to which someone is flexible, unconventional, and reflective.