Faking and the Validity of Personality Tests: An Experimental Investigation Using Modern Forced Choice Measures

Despite the established validity of personality measures for personnel selection, their susceptibility to faking has been a persistent concern. However, the lack of studies that combine generalizability with experimental control makes it difficult to determine the effects of applicant faking. This study addressed this deficit in two ways. First, we compared a subtle incentive to fake with the explicit “fake-good” instructions used in most faking experiments. Second, we compared standard Likert scales to multidimensional forced choice (MFC) scales designed to resist deception, including more and less fakable versions of the same MFC inventory. MFC scales substantially reduced motivated score elevation but also appeared to elicit selective faking on work-relevant dimensions. Despite reducing the effectiveness of impression management attempts, MFC scales did not retain more validity than Likert scales when participants faked. However, results suggested that faking artificially bolstered the criterion-related validity of Likert scales while diminishing their construct validity.


Personnel Assessment And decisions
FAking And the VAlidity oF PersonAlity tests and conscientiousness (.52). The larger effect sizes for emotional stability and conscientiousness mirror the findings from directed faking and validity generalization research, suggesting applicants selectively fake on the most universally job-relevant traits. On the other hand, the lack of experimental control in applicant/non-applicant comparisons limits their ability to isolate the effects of faking. A variety of other factors-including selection, attrition, and differential motivation to take a personality test seriously-may influence group differences in personality scores (as well as validity coefficients). As previously mentioned, validity generalization research shows that applicant faking has not destroyed the predictive potential of personality measures. On the other hand, evidence for validity retained in spite of faking does not tell us much about the amount of potential validity lost. This loss is difficult to measure directly due to the tradeoff between experimental control and generalizability to operational testing, but there is reason to suspect that there is room for improvement. For example, recent meta-analyses have found substantially higher validity coefficients when other-reports are used instead of self-reports (Connelly & Ones, 2010;Oh et al., 2011), which may be partially attributable to differences in response distortion.
Meta-analytic research has also found higher validities for a category of faking-resistant personality measure known as quasi-ipsative multidimensional forced choice (MFC) scales (Salgado et al., 2014). Whereas single stimulus (SS) measures (e.g., Likert scales) have test takers rate one personality statement at a time, MFC items present choices between two or more statements representing different personality dimensions (see Figure 1). The statements can be paired based on estimates of their social desirability, making it difficult for test takers to discern which option will produce the most desirable personality profile.
Although the findings are promising, it is unclear whether any validity advantage of MFC scales can be attributed to their faking resistance. A few experimental studies have supported this connection by comparing MFC and SS scales while simultaneously manipulating the motivation to fake (Christiansen et al., 2005;Hirsh & Peterson, 2008;Mueller-Hanson et al., 2003). However, comparisons of MFC and SS measures cannot control for differences between the two formats other than faking resistance. In addition, all but one of these studies used fake-good instructions, which may exaggerate or otherwise distort the effects of faking-and therefore the effects of reducing fakingdue to the artificial extremity of directed faking. For example, Ellingson et al. (1999) found that faked personality scores showed only modest correlations with honest scores, and a correction for socially desirable responding did not significantly improve convergence. However, as the authors noted, their conclusions about social desirability corrections could reflect the artificial nature of directed faking. Because extreme faking all but eliminated true personality variance from faked scores, the inability to recover true personality variance via a correction was almost a foregone conclusion.
The tension between experimental control and generalizability to typical applicant behavior has been a persistent issue in the faking literature, limiting our ability to draw nuanced conclusions about the effects of applicant faking. The present study was designed to address the limitations of previous research in order to provide a better understanding of the faking-validity relationship. Specifically, we employed more nuanced manipulations of motivation and ability to fake to elicit a gradient of faking behavior, allowing for a more comprehensive analysis of the effects of faking. To better approximate typical faking behavior, we manipulated faking motivation using a subtle incentive to fake. We also tested the effects of explicit fake-good instructions, allowing us to directly compare two methods to induce faking in experimental research. This produced three levels of the faking motivation variable: honest instructions, fake-good instructions, and fake-good incentive.
In addition to comparing MFC and SS scales, we manipulated the fakability of the same MFC measure to eliminate confounding differences between the two measurement formats. This was accomplished using a computer adaptive test (CAT) that allowed for varying restrictions on the social desirability matching (SDM) of statements that were paired to form a single item. Imposing stricter matching rules on the CAT algorithm has been shown to reduce fakability by increasing the perceived similarity of paired statements (Boyce & Capman, 2017).
The faking motivation and ability manipulations produced a 3x3 design that allowed us to test several methodological and theoretical hypotheses. In keeping with past research (Boyce & Capman, 2017;Drasgow et al., 2012), we hypothesized that: Hypothesis 1: MFC scales will show smaller mean differences between honest and faked responses than SS measures of the same dimensions.
Hypothesis 2: Using a stricter SDM rule will reduce mean differences between honest and faked responses.
Our next set of hypotheses concerned the relationship between faking and validity. Assuming faking reduces validity, factors that mitigate faking are likely to improve validity when there is motivation to fake. Therefore, we predicted that: Hypothesis 3: MFC scales will produce higher criterion-related validity than SS measures of the same dimensions but only when respondents are instructed to fake. http://scholarworks.bgsu.edu/pad/

ReseaRch aRticles
Hypothesis 4: Using a stricter SDM rule will produce higher criterion-related validity but only when respondents are instructed to fake.
Finally, our research design allowed for a novel methodological comparison between directed and incentivized faking. Incentivized faking studies still show faking effects, but the effect sizes are more likely to resemble those found in applicant samples (e.g., Mueller-Hanson et al., 2003). Validity may be reduced but not obliterated, and mean scores may be moderately rather than severely inflated. Therefore, we proposed that: Hypothesis 5: Directed faking results will replicate using an incentivized faking manipulation.

Participants
Participants were recruited through Amazon Mechanical Turk (MTurk). Research has found that personality data from MTurk workers has comparable or superior reliability to traditional samples (Buhrmester et al., 2011). MTurk workers also appear to behave similarly to participants in traditional laboratory and field experiments (Casler et al., 2013;Horton et al., 2011).
In order to ensure the internal and external validity of results, participants were screened using a few criteria. First, we limited our participant pool to American MTurk workers over the age of 18. Second, we required participants to have at least 100 approved tasks on MTurk and an approval rate of 90% or higher. Third, participants had to be employed for at least 3 months within the past year in a position where they interacted with coworkers at least 1-2 days per week. This requirement was intended to ensure participants could complete our self-reported job performance measures (discussed below). All participants were paid $3 for their voluntary participation, and 10 were randomly selected to receive $10 bonuses.
Participants were included in the final sample if they passed two embedded attention checks, a manipulation check to ensure they had attended to their faking instructions, and a repetitive responding check. Of the 855 participants who completed the study, 652 passed these checks. The final sample was predominantly White (73%), female (57%), and currently employed (96%); see Table 1 for a breakdown of participants' occupations and educational status. Participants ranged from 19 to 70 years of age with a median age of 33. Participants were randomly assigned to one of six conditions that crossed two three-level independent variables-measurement format and faking instructions. See Table 2 for sample sizes by condition.   (Allen et al., 2017;DeYoung et al., 2016;DeYoung et al., 2007;DeYoung et al., 2009;Kaufman et al., 2016;Quilty et al., 2014). The remaining five dimensions of the MFC inventory capture work-relevant traits beyond the five-factor model. See Table 3 for the dimensions and their theoretical mappings.
The MFC inventory is scored using Stark's multi-unidimensional pairwise preference model, an item response theory (IRT) model for scoring binary MFC items (Stark, 2002;Stark et al., 2005). In this study, each MFC administration included 100 items. Each item consists of two personality statements selected by the CAT algorithm, resulting in approximately 13 statements per personality dimension.
In addition to IRT parameters, each statement has an associated social desirability parameter ranging from 0 to 1 (established based on a directed faking study). In the strict SDM conditions, the CAT algorithm was only allowed to  Note. N all = sample size before applying the attention check filter; N filtered = final sample size after checks for low-effort responding; SS = single stimulus; MFC-relaxed = multidimensional forced choice with relaxed social desirability matching constraint; MFC-strict = multidimensional forced choice with strict social desirability matching constraint. a Honest and faked responses from Conditions 1-3 are treated as separate conditions for data analysis purposes. Thus, the results refer to a total of nine conditions.  Theoretical Mappings of Personality Dimensions From the MFC Inventory pair statements whose social desirability parameters were within .10 of one another. In the relaxed SDM conditions, the social desirability parameters of paired statements could differ by up to .20. SS Personality Scales. Participants in the SS conditions completed Likert-type measures of the 15 constructs assessed by the MFC inventory. To minimize differences with the MFC dimensions, we constructed the SS scales using items from the MFC CAT's statement pool. First, we used existing calibration data from a sample of MTurk workers (N = 6,333), as well as previously estimated item location parameters, to select 12 items per dimension for pilot testing. Next, we administered the chosen items using a fourpoint response format, followed by the MFC inventory, to a pilot sample of 269 MTurk workers. Finally, we used the pilot data to construct reliable six-item scales that had good convergent validity with their MFC counterparts.
All 15 scales showed acceptable reliability, with coefficient alpha reliability estimates ranging from .74 to .90. In addition, the scales demonstrated convergent and discriminant validity with their MFC counterparts. Monotraitheteromethod correlations ranged from .42 to .74 with a mean of .58, whereas the average heterotrait-heteromethod correlation was only .18. See Table S1 in the supplemental materials for reliability and convergent validity results by dimension.
Self-Reported Job Performance. Participants completed Spector and Fox's 20-item organizational citizenship behavior checklist (OCB-C; Fox et al., 2012) and 10-item counterproductive work behavior checklist (CWB-C; Spector et al., 2010) as criterion measures. Fox et al. (2012) reported coefficient alphas of .89 and .94 for the OCB-C in two samples; Spector et al. (2010) reported an alpha of .79 for the CWB-C.
Self-Reported Academic Performance. Participants completed three criterion items assessing academic performance and achievement. First, they reported their highest academic degree completed, which ranged from high school to doctoral degrees. Second, participants reported their GPA at that degree level on an 11-point scale ranging from A+ to E or F (Freeberg et al., 1989). Finally, they reported their high school GPA using the same scale.
A meta-analysis by Kuncel et al. (2005) found an average correlation of .84 between self-reported and school-reported GPA. However, self-reported GPAs were also higher than actual GPAs on average, and individuals with lower GPAs provided far less valid self-reports. Thus, it appears that self-reported GPA is a valid indicator of academic performance but is also susceptible to nontrivial response distortion.
Emotion Management Task. To address the possibility of common method bias arising from self-report criteria, we included an objective performance task as an additional criterion measure. Specifically, we administered the 18-item Situational Test of Emotional Management-Brief (STEM-B; Allen et al., 2015), a performance-based emotional intelligence scale that requires examinees to identify the most effective response to a variety of emotional situations.

Personnel Assessment And decisions
FAking And the VAlidity oF PersonAlity tests Risky Choice Framing. We administered Tversky and Kahneman's (1981) Asian disease problem as a final criterion measure. This problem requires participants to choose between two programs to combat a disease that threatens to kill 600 people. Preferences for safer or riskier options have been shown to vary depending on whether the potential outcomes are presented in positive or negative terms (i.e., lives saved vs. lives lost).
Risky choice problems can be scored to assess two distinct constructs. First, susceptibility to framing is quantified as a difference score between the negative and positive item scores. Second, general risk-taking tendency is assessed by combining the two scores.
Faking Instructions. Participants received one of three instruction sets before completing a personality inventory. The honest instructions, which we borrowed from Mueller-Hanson et al. (2003), asked participants to respond as honestly as possible and emphasized their anonymity. The fake-good instructions asked participants to pretend they were applying for a job and make the best impression possible, responding as an ideal employee would. The incentivized fake-good instructions, adapted from Mueller-Hanson et al. (2003), explained that participants would automatically have a chance to receive one of ten $10 bonuses if they qualified for a fictitious "second part" of the study, which required participants with personality traits that were desired by employers. However, the instructions also warned that providing false responses could disqualify them from the study.

Procedure
As shown in Table 2, participants in Conditions 1-3 completed the same personality inventory (SS, MFCrelaxed, or MFC-strict) under both honest and fake-good instructions with the order of instructions counterbalanced. Thus, these conditions represented six levels of the 3x3 manipulation. Conversely, participants in Conditions 4-6 only completed a personality inventory once with incentivized fake-good instructions, and their results were compared to honest results from Conditions 1-3. The purpose of this between-person comparison was to avoid anchoring effects. Unlike directed fakers, incentivized fakers were instructed to provide honest responses. Asking them to respond honestly once and then immediately asking them to respond honestly a second time with an incentive to distort (or vice versa) would likely elicit suspicion and reluctance to deviate from their initial responses.
All participants began by reading the consent form and indicating their informed consent. They then completed screening and optional demographic questions, followed by the criterion measures. Finally, they completed one of three personality inventories under their assigned faking instructions. The purpose of administering the criterion measures before the predictors was to ensure that criterion responses were not contaminated by subsequent faking instructions.

Motivated Score Elevation
Our first two hypotheses predicted that the degree of score elevation due to directed faking would be inversely related to the faking resistance of the measurement format. To test these hypotheses, we first transformed all personality scores to z-scores (using honest means and SDs) to create a common metric across measurement formats. Next, we conducted a mixed-model MANOVA to assess the combined effects of faking instructions (honest and fake-good) and measurement format (SS, MFC-relaxed, and MFCstrict) across all 15 personality traits. The main effect of instructions was significant, F(1, 273) = 16.92, p < .001, indicating participants generally increased their scores when directed to fake. Furthermore, we found a significant interaction between instructions and measurement format, F(2, 548) = 2.73, p < .001, suggesting the degree of score elevation varied by format.
To test Hypothesis 1, we conducted follow-up 2x2 MANOVAs comparing the SS format to each MFC format. These revealed significant instruction-format interactions for both the SS/MFC-relaxed comparison, F(1, 187) = 4.05, p < .001, and the SS/MFC-strict comparison, F(1, 170) = 4.37, p < .001. We also computed standardized mean differences (Glass's Δ) between honest and faked personality scores for all three measurement formats (see Tables S2, S3, and S4 in the supplemental materials for associated means and standard deviations). As shown in Table 4, directed faking produced large gains on the SS personality scales (mean Δ = .81). In support of Hypothesis 1, the degree of faking was much smaller on both MFC formats compared to the SS format, with a mean Δ of .28 for the MFC-relaxed inventory and .27 for the MFC-strict inventory.
Hypothesis 2 predicted that using a stricter SDM rule would also reduce faking gains. However, the difference between the two MFC formats was minimal, and the formatinstructions interaction was nonsignificant in a follow-up 2x2 MANOVA. Thus, Hypothesis 2 was not supported.
On the other hand, comparing mean effect sizes across dimensions may not fully capture the behavior of directed fakers. At the item level, SS scales allow respondents to fake on one dimension without affecting their scores on other dimensions. By contrast, each MFC item requires examinees to choose between two personality dimensions. As a result, fakers may focus their self-presentation on the dimensions they perceive to be more work relevant (e.g., drive) at the expense of others. A stricter SDM rule could have a similar effect by reducing the salience of an alternate cue-that is, social desirability-for determining the "ideal" response.
To investigate this possibility, we calculated the standard deviation of Δ values across dimensions for each measurement format (see Table 4); a higher standard deviation indicates greater variation in faking across dimensions.
Both MFC formats-especially the MFC-strict formathad higher standard deviations than the SS format. This suggests that the MFC format, and perhaps stricter SDM, promoted a selective faking strategy.
Hypothesis 5 predicted that directed faking results would replicate using an incentivized faking manipulation. As shown in Table 4, incentivized faking produced small changes on the SS scales (mean Δ = .13) and even smaller changes on the MFC-relaxed (.08) and MFC-strict (.04) scales. A two-way MANOVA revealed a significant main effect of faking instructions, F(1, 632) = 2.17, p = .006. However, neither the main effect of measurement format nor the format-instructions interaction reached statistical significance. As such, the score elevation results did not support Hypothesis 5. More broadly, the average faking effect sizes suggested that the monetary incentive was only modestly successful at inducing faking. Without a strong incentive to fake in the first place, the relative advantage of the faking-resistant MFC format was greatly diminished.

Criterion-Related Validity
Due to the combination of predictors, criteria, and experimental conditions, it was necessary to summarize a total of 1,080 validity coefficients to test Hypotheses 3 and 4. One option would be to simply calculate a mean validity coefficient for each experimental condition. However, this incorrectly assumes that the true correlations between all predictors and criteria are positive. In fact, a negative predictor-criterion correlation can be equally useful for selection if it represents the true direction of the relationship. Therefore, we developed a universal set of keys to indicate the appropriate signs for all 120 predictor-criterion relationships.
To do so, we first calculated an unweighted mean of validity coefficients for every predictor-criterion pair across all conditions. To minimize the effects of sampling error on keying decisions, we discarded any pair whose mean validity coefficient was less than .10 in absolute value. For each of the remaining 23 predictor-criterion pairs, we counted the sign of the grand mean validity coefficient as the true direction of the relationship and penalized conditions that produced a relationship in the opposite direction. Validity coefficients for these 23 pairs are summarized in Table S5, and validity coefficients for all 120 predictor-criterion pairs are available in Tables S6-S14.
Mean validity coefficients by condition are presented in Table 5. Under honest instructions, all three measurement formats had a mean validity of .15. Thus, as predicted in Hypotheses 3 and 4, no format was more valid than the others in the absence of faking. Contrary to our expectations, however, the SS scales had the highest overall validity under fake-good instructions (although z-tests contrasting the overall SS and MFC-strict/MFC-relaxed validity coefficients did not reach significance). This pattern held for every breakout category of criterion, including academic performance/achievement, job performance, and the STEM-B.
Thus, the results failed to support Hypothesis 3, which predicted that MFC scales would perform better than their SS counterparts when participants were directed to fake. An alternate version of Hypothesis 3 might predict that the relative advantage of SS scales would diminish when participants faked, thereby accounting for the possibility that the SS scales could be more valid to begin with but lose some of that advantage due to faking. However, even this qualified Hypothesis 3 was not supported.
Hypothesis 4 predicted that MFC-strict scales would be more valid than MFC-relaxed scales but only under faking instructions. On average, MFC-strict validity coefficients were .05 higher than MFC-relaxed ones when participants faked, but the difference was not statistically significant. Therefore, Hypothesis 4 was not supported.
Once again, Hypothesis 5 predicted that directed faking results would replicate using an incentivized faking manipulation. Because the directed faking manipulation did not produce the expected outcomes (or other significant results to replicate), we did not formally evaluate Hypothesis 5 with respect to the validity results. Regardless, it is worth noting that the SS scales produced the highest validity coefficients among incentivized fakers, although the SS-MFC differences in the incentivized group did not reach statistical significance.

Motivated Score Elevation
Our directed faking results showed substantial differences between measurement formats in both the magnitude and pattern of faking. As expected, fakers were far less successful at raising their scores on the MFC scales. In addition, it appears that fakers selectively distorted on specific traits to a greater extent when responding to an MFC inventory. A closer examination of the distortion patterns suggests they favored traits with higher face validity for employee selection, including drive, cooperativeness, composure, ambition, and mastery.
Furthermore, except for openness, selective faking produced notable discrepancies between aspects of the same Big Five traits. Although the SS scales showed strong distortion on both aspects of conscientiousness, MFC fakers focused primarily on drive and had only modest score elevation on structure. Extraversion showed a similar pattern, with fakers elevating their scores by nearly half a standard deviation on MFC-Liveliness but barely at all on MFC-Assertiveness. Fakers consistently elevated their scores on both aspects of emotional stability. However, whereas faking produced almost identical (very large) increases on both SS scales, participants faked more on composure than positivity in the MFC conditions. The difference between aspects was the most pronounced for agreeableness: Participants raised their MFC-Cooperativeness scores by an average of .67 standard deviations, whereas faked MFC-Sensitivity scores were .20 standard deviations lower than   focusing on specific traits that are attractive to fakers. The latter suggestion may be especially helpful for producing clearer results with incentivized faking designs, given the modest strength of these manipulations compared to directed faking.
Our results also suggest that practitioners should consider potential tradeoffs between face validity and reducing impression management when designing selection systems. When response distortion is a concern, there may be substantial benefits to selecting on predictively valid traits that are less attractive to fakers. If an MFC inventory is used for selection, the inclusion of unscored "distractor" scales may reduce impression management on target dimensions while also increasing the assessment's face validity.

Criterion-Related Validity
Our validation results failed to replicate Salgado et al.'s (2014) meta-analytic findings, which suggested quasi-ipsative MFC scales should outperform their SS counterparts. As such, it is possible that quasi-ipsative MFC scales do not provide a robust validity advantage. In keeping with this possibility, Lee et al. (2018) compared three sets of personality scores obtained from an MFC measure (using one quasi-ipsative and two ipsative scoring methods) to scores from a Likert-type version of the measure. Although all four methods showed a similar pattern of correlations with criterion measures, the Likert-type measure generally produced larger validity coefficients. Although the reason for this difference was unclear, the authors speculated that it could be due to common method bias because the criterion measures were also Likert scales.
Regardless, it is interesting that even the presence of extreme response distortion did not cause a large decrement in the validity of SS scales or improve the relative advantage of faking-resistant alternatives. Furthermore, we observed a similar trend across criteria that varied in terms of potential common method variance with SS scales. On one end of this spectrum, our self-reported job performance measures shared a Likert-type response format with the SS scales, giving the SS scales a potential edge in predicting these criteria. Our measures of GPA and degree attainment requested objective information rather than self-assessments and did not use a Likert-type response scale, but they were still likely prone to some degree of socially desirable distortion (Kuncel et al., 2005). Finally, the STEM-B required participants to correctly identify the most effective responses to specific emotional situations, making it resistant to impression management (i.e., a test taker cannot "fake" knowing the correct response).
One reasonable explanation for our validity results is that faking fundamentally changed what the SS scales measured, adding a new source of variance that contributed to the prediction of various external criteria. Past factor analytic research has found evidence of a general "ideal employee" factor in applicant samples (e.g., Schmit & Ryan, 1993), which may capture predictively useful implicit theories about how to be a good employee. Although the present study was not designed to address this question, we did conduct supplemental analyses to explore the possibility. First, we computed an average correlation of only .32 between participants' honest and faked scores on the same SS scales, suggesting the faked scores no longer assessed the intended constructs. Next, we used confirmatory factor analysis to determine if faking introduced a general method factor. As shown in Table 5, faking strengthened an already substantial general factor in the SS (but not the MFC) scales.
To determine whether this general factor impacted validity, we calculated new validity coefficients with the general factor partialled out from the predictor scores (see Table 5). Removing general factor variance substantially reduced average validity coefficients for all conditions. This suggests that shared variance between personality dimensions, whether real or artifactual, did contribute to the predictive validity of the dimension scores. The SS scales showed the most precipitous decline in validity-especially in the directed faking condition, where the average validity coefficient dropped from .15 to .01. This indicates that (a) directed faking decimated the validity of the individual SS dimensions and (b) the SS scales retained their validity in the presence of faking by measuring a new construct. In other words, faking eroded the SS scales' construct validity while simultaneously preserving their criterion-related validity. This is problematic to the extent that employers are interested in selecting for specific personality traits, as opposed to simply achieving predictive validity. On the other hand, it is unclear to what extent this phenomenon occurs given typical levels of distortion in preemployment testing.

Future Directions
A key feature of this study was that it manipulated both motivation and ability to fake in multiple ways. However, the observed patterns of faking suggested that the faking incentive and SDM manipulations were fairly weak, making it difficult to fully parse their effects. This limited our ability to make nuanced inferences about the effects of typical applicant faking or the merits of directed faking manipulations. Future research could remedy this issue with stronger incentives to fake and larger discrepancies between strict and relaxed SDM rules.
To the extent that quasi-ipsative MFC scales are generally better predictors of performance, it remains unclear why this is the case. The magnitude and causes of their predictive advantage remain important questions for the future of personality testing. Further experimental research using finely tuned faking manipulations, coupled with an increased focus on underlying constructs, should provide valuable insights and could substantially improve the accuracy of high-stakes personality assessment.