A New Investigation of Fake Resistance of a Multidimensional Forced-Choice Measure: An Application of Differential Item/Test Functioning

To address faking issues associated with Likert-type personality measures, multidimensional forced-choice (MFC) measures have recently come to light as important components of personnel assessment systems. Despite various efforts to investigate the fake resistance of MFC measures, previous research has mainly focused on the scale mean differences between honest and faking conditions. Given the recent psychometric advancements in MFC measures (e.g., Brown & Maydeu-Olivares, 2011; Stark et al., 2005; Lee et al., 2019; Joo et al., 2019), there is a need to investigate the fake resistance of MFC measures through a new methodological lens. This research investigates the fake resistance of MFC measures through recently proposed differential item functioning (DIF) and differential test functioning (DTF) methodologies for MFC measures (Lee, Joo, & Stark, 2020). Overall, our results show that MFC measures are more fake resistant than Likert-type measures at the item and test levels. However, MFC measures may still be susceptible to faking if MFC measures include many mixed blocks consisting of positively and negatively keyed statements within a block. It may be necessary for future research to find an optimal strategy to design mixed blocks in the MFC measures to satisfy the goals of validity and scoring accuracy. Practical implications and limitations are discussed in the paper.

Historically, personality measures have been widely used for managerial and organizational decision making (Stark et al., 2012;Hough et al., 2015). Interest in the personality measures stems from research findings that personality predicts important job-related outcomes such as job performance (Barrick et al., 2001), training performance (Colquitt et al., 2000), and teamwork and team performance (Peeters et al., 2006). Additionally, the use of personality measures reduces adverse impact and provides incremental validity over cognitive ability tests in predicting job performance (Hough & Oswald, 2008).
Despite the popularity, there have been overwhelming concerns about faking (i.e., conscious attempts to make a positive impression) associated with Likert-type measures. Likert-type measures present multiple statements individually to respondents and ask them to indicate their level of agreement or disagreement according to a set of response categories (e.g., five option or seven option). However, in the high-stake settings, such as in personnel selections, respondents can easily fake their answers by simply choosing a more socially desirable response option. The resulting responses can distort test reliability and validity, change rankings of applicants, and reduce the utility of selection systems (e.g., Bott et al., 2007, Komar et al., 2008Mueller-Hanson et al., 2003;Peeters & Lievens, 2005;Salgado, 2016).
To address faking issues associated with Likert-type personality measures, multidimensional forced-choice (MFC) measures have recently come to light as important components of personnel assessment systems (e.g., Anguiano-Carrasco et al., 2015;Brown & Maydeu-Olivares, Personnel Assessment And decisions A new investigAtion of fAke-resistAnce of mfc meAsure 2011; Guenole et al., 2018;Lee et al., 2018;Lee et al., 2020;Stark et al., 2012;Wetzel & Greiff, 2018). MFC measures present two (i.e., a pair), three (i.e., a triplet), or four (i.e., a quartet) statements representing different constructs within an item block, which forces respondents to either select a "most like me" statement or to rank statements from "most like me" to "least like me" in each block. Respondents may experience difficulty discerning the most desirable answers because statements within a block are matched based on a similar level of social desirability and/ or item extremity. Therefore, faking responses can be reduced . For the effectiveness of MFC measures, there have been somewhat mixed research findings. For example, Heggestad et al. (2006) discovered that MFC measures do not necessarily reduce faking in an individual-level analysis over Likert-type measures. More recently, Young (2018) identified that the pairwise preference MFC measure of a dark triad was not more fake resistant than a Likert-type measure. Additionally, Ng et al. (2021) similarly found that the triplet MFC measure of character did not reduce faking responses over a Likert-type measure. However, a multitude of studies provided more favorable results to MFC measures, showing that MFC measures successfully reduce test score inflation (e.g., Cao & Drasgow, 2019;Martin et al., 2002;Christiansen et al., 2005;Jackson et al., 2000;Trent et al., 2020;Vasilopoulos et al., 2006;Lee et al., 2019; and maintain validity in motivated testing situations (e.g., Bartram, 2007;Hirsh & Peterson, 2008;Lee et al., 2018;O'Neill et al., 2017;Zhang et al., 2020).

Investigating Fake Resistance for MFC Measures
Despite various efforts to investigate the fake resistance of MFC measures, prior research mainly focused on the scale mean differences between honest and faking conditions (e.g., Martin et al., 2002;Converse et al., 2008;Fisher et al., 2019;Jackson et al., 2000;O'Neill et al., 2017;Vasilopoulos et al., 2006). For example, Jackson et al. (2000) showed that the MFC measure is more effective in reducing faking than a Likert-type measure, as indicated by the mean differences (i.e., Cohen's d) between the honest and faking samples (i.e., 0.32 for the MFC measure vs. 0.95 for the used Likert-type measure). Further, Martin et al. (2002) conducted an analysis of variance to discover a significant interaction between test forms (MFC and Likert-type measures) and test conditions (honest and faking). The MFC measure yielded no differences in personality scores regardless of whether respondents were in the honest or the faking conditions. Alternatively, the Likert-type measure produced significant score inflation in the faking condition.
Nevertheless, previous studies do not provide an indepth understanding of the response process of the two personality item formats between honest and motivated test conditions, as they exclusively focused on the composite scale-level scores. Given the recent advancements of item response theory (IRT) for MFC measures (e.g., Brown & Maydeu-Olivares, 2011;Stark et al., 2005;Lee et al., 2019;Joo et al., 2020), there is a current need to investigate the fake resistance of MFC measures through a new methodological lens. One approach is to apply differential item functioning (DIF) and differential test functioning (DTF) methodologies across different testing situations (Robie et al., 2001). DIF refers to a particular item that may have different response probabilities for different groups of people even though they have the same latent traits level (Camilli & Shepard, 1994), and DTF refers to the differences in the expected total test scores of the respondents with an equal level of latent traits (Drasgow & Hulin, 1990). Through DIF and DTF methodologies, it is possible to evaluate which personality measure (i.e., MFC or Likert-type measures) is more fake resistant at the item and test level across different testing conditions. Research suggests that the presence of DIF and DTF in personality measures can be interpreted as evidence of faking (Griffin et al., 2004;Stark et al., 2001;Zickar & Robie, 1999).

Faking the Response Process in MFC and Likert-Type Measures
To model the faking response process, Zickar and Robie (1999) proposed a changing person paradigm and a changing items paradigm. The former assumes respondents change the person's latent trait (i.e., theta shift) by the process of faking response. In contrast, the latter assumes that respondents perceive items differently, resulting in differences between item parameters. Although research has generally supported the changing person paradigm (Robie et al., 2001;Stark et al., 2004;Zickar & Robie, 1999), this study employs the changing items paradigm because DIF and DTF are related to the item-and test-level biases, and the changing items paradigm enable an evaluation of the differential nature of item responses between MFC and Likert-type measures under honest and faking conditions. Zickar (2000) noted that changes in how items are perceived and interpreted might yield different consequences of choosing particular items. The respondents may experience a different decision-making process between MFC and Likert-type items due to the distinct cognitive processes of perceiving and deciding among different item responses. For Likert-type items, respondents are assumed to evaluate their absolute level of agreement or disagreement for each statement and indicate a response option that best fits their latent trait. In contrast, for MFC items, respondents are assumed to conduct comparative judgment among statements within a block and rank them according to their preference.
In MFC measures, ranking decision making involves a much more complicated interaction among statements within a block. Lin and Brown (2017) noted that item parameters (e.g., loadings and thresholds) for MFC measures could be affected by interactions of surrounding statements within a block, which is referred to as a contextual effect. Some statements become more socially desirable than other statements, depending on a combination of different traits within a block, leading to "desirability-induced response biases" (Lin & Brown, 2017, p. 409). The contextual effect of MFC measures would not only make DIF situations more complicated but also yield different natures of differential functioning compared to Likert-type measures. Therefore, it is not guaranteed that item parameters obtained from the single-statement Likert-type measure are still invariant when they are paired in MFC blocks. Besides, the measurement invariance of MFC measures between honest and faking test conditions should not be simply assumed. Nevertheless, previous research generally accepts the invariance assumption without testing measurement biases (Morillo et al., 2019;Pavlov et al., 2019). Considering that the main purpose of MFC measures is to reduce faking, it is particularly important to confirm the measurement invariance of the MFC measure between honest and faking conditions.

Recent Developments of the MFC DIF Method
Recently, Lee et al. (2020) proposed a new DIF detection method involving triplet MFC measures at the block-level based on the Thurstonian IRT (TIRT) model. Their work showed the efficacy of the proposed MFC DIF method through various Monte Carlo simulation conditions and an empirical demonstration. This MFC DIF method can be applied to test the fake-resistance of MFC measures compared to Likert-type measures through the within-subject experimental design (e.g., honest and faking conditions). However, DIF results based on chi-square significance statistics have been criticized due to the sensitivity to sample size and their minor practical implications (Drasgow et al., 2018;Meade et al., 2012;Stark et al., 2004). Nye and Drasgow (2011) suggested that the statistical significance DIF test "does not address the practical importance of observed differences between groups and does not provide users with information about the effects of nonequivalence on the organizational outcomes of an assessment" (p. 966). To better understand the size of DIF, Lee et al. (2020) proposed the DIF effect size of the MFC measure by adapting Nye's (2011) DIF effect size.
Furthermore, from a practical perspective, "DTF is the primary concern for organizations because selection decisions are based on total test scores rather than individual items" (Stark et al., 2004, p. 498). Lee et al. (2020) also proposed DTF effect sizes of MFC measures by adopting the method used by Stark et al. (2004). The measurement invariance of MFC measure can be evaluated at both the item and test level by applying these methods. If the MFC measure yields fewer DIF items and smaller DIF effect sizes as well as DTF effect sizes between honest and faking conditions, it could serve as further empirical evidence that the MFC measure may be more fake resistant than a Likerttype measure.

The Present Study
This study aims to (a) investigate the measurement equivalence of MFC and Likert-type personality measures between honest and faking conditions; (b) evaluate how DIF occurs differently between the two measures; and (c) determine which measure produces smaller DIF and DTF effect sizes. To achieve this, four research questions (RQs) are proposed:

Research Measure and Sample
This study uses the same Big Five personality MFC triplet measure and Likert-type measure as Lee et al. (2018). The measure comprises 12 statements per dimension, and positively and negatively keyed statements (e.g., 8 positively and 4 negatively keyed statements per dimension). These were mixed to enhance trait score estimation accuracy as recommended by Brown and Maydeu-Olivares (2011).
For data collection, the within-subject design was used. In Korea, 537 college students answered the 20-triplet MFC personality measure and the corresponding Likert-type measure (i.e., the same 60 statements in 20 triplets) under honest responding instructions. Two weeks later, 460 participants among them participated in the faking condition. Under the honest instruction, participants were notified that the results would be used only for research purposes and were requested to answer as honestly as possible. Under the faking instruction, respondents were requested to imagine that they were applying for their dream job in a personnel selection process (e.g., Mueller-Hanson et al., 2006). Four hundred seventeen students completely answered both conditions (50% male/female with an average age of 20.94 years), thereby creating the data analyzed in this study. Because two MFC blocks (all positively keyed) consistently yielded very large residual variances, which caused estimation problems for DIF analysis, they were removed. The remaining 18 blocks were used for subsequent MFC DIF analyses. The same single statements were used for Likerttype measures. The items for the MFC and Likert-type measures are presented in Tables 2 and 3.

Analytical Strategy
For the MFC DIF test, the TIRT model was applied (Brown & Maydeu-Olivares, 2011) as well as the TIRT DIF method . For the DIF test of the Likerttype measure, categorical MACS DIF method was applied.
In the TIRT model for triplet measure, rank response data were transformed into three sets of binary outcomes (i.e., comparison between the first and the second statements (y i1i2 ); comparison between the second and the third statements (y i1i3 ); comparison between the second and the third statements (y i2i3 )). The transformed binary outcomes were then modeled and analyzed with a two-dimensional standard normal ogive IRT model, as described in detail by Brown and Maydeu-Olivares (2011).
In practical settings, it is generally impossible to know in advance which blocks are free from DIF and which are suitable anchors for free baseline DIF tests. Thus, a sequential free baseline approach was applied for TIRT DIF detection and categorical MACS DIF detection . The sequential free baseline approach has been found effective in detecting DIF with low Type I error and high power in simulation studies (Chun et al., 2016;Lopez et al., 2009;Meade & Wright, 2012 Last, the DIF effect size was calculated to further investigate the identified DIF items of the MFC and Likerttype measures by adapting Nye's (2011) method. Furthermore, the DTF effect sizes for the MFC and Likert-type personality measure were computed by adapting Stark et al.'s method (2004). The effect sizes can be interpreted as Cohen's d (0.2, 0.5, and 0.8 for small, medium, and large, respectively). The Appendix also shows the detailed description of DIF and DTF effect sizes.
A direct comparison of DIF results between MFC and Likert-type measures is difficult because MFC DIF is tested at the block level, whereas Likert-type DIF is tested at the single-statement level. Thus, the Likert-type measure was considered a baseline to evaluate how single-statement items in the Likert-type measure function differently when presented in the MFC measure. Also, this study more relied on describing how DIFs differently occurs and evaluating the DIF and DTF effect sizes rather than simply comparing the number of detected DIF items. Table 1 presents descriptive statistics between Likerttype and MFC personality measures across honest and faking conditions. We note that MFC data were scored using the classical test scoring method (in Table 1). The classical test scoring for MFC measures is still being commonly used in research and practical settings (e.g., Bowen et al., 2002;Converse et al., 2008;Fisher et al., 2019;Heggestad et al., 2006;Jackson et al., 2000;Martin et al., 2002;O'Neill et al., 2018;Vasilopoulos et al., 2006). Although there are different approaches to obtain classical test scoring for MFC measures (Salgado & Lado, 2018), we chose the "inverse scoring" method. If a positively keyed statement is chosen as most like me or a negatively keyed statement is chosen as least like me, two points were assigned to the statement. In contrast, if a positively keyed statement was selected as least like me or a negatively keyed statement was selected as most like me, zero points were assigned. The second-ranked statements are scored as one point. Overall, smaller effect sizes (i.e., Cohen's d) were found for the MFC measure than the Likert-type measure across Big Five personality traits (d = 0.54 vs. 0.36 for agreeableness; d = 0.39 vs. -0.10 for openness; d = 1.05 vs. 0.92 for conscientiousness; d = 0.73 vs. 0.60 for extraversion; d = -0.68 vs. -0.59). A preliminary analysis also tested whether the same five personality constructs were measured between two different instruction conditions. The configural invariance was tested, and both measures satisfied the configural invariance between honest and faking conditions (RMSEA = .06, CFI = .89, and TLI = .88 for the Likert-type measure; RMSEA = .03, CFI = .91, and TLI = .91 for the MFC measure). Table 2 shows the DIF analysis results for the Likerttype measures. Based on the Bonferroni corrected alpha, 15 out of 54 items were classified as DIF items. More specifically, two items (items 3 and 25) were identified as DIF for conscientiousness; three items (items 5, 14, and 51) for extraversion; one item (item 28) for agreeableness; four items (items 18, 39, 40, and 46) for openness; and five items (items 16, 26, 42, 45, and 50) for neuroticism. Table 3 shows the DIF analysis results for the MFC measure using both a nominal alpha level and a Bonferroni-corrected alpha level. Interestingly, when single-statements were constructed as MFC blocks, only one (i.e., block 11) was flagged as DIF based on the Bonferroni-corrected alpha. In sum, 15 items were identified as DIF in the Likert-type measures, whereas only one MFC block was identified as DIF when formed as a triplet MFC block (RQ1).

RQ2. How differently does DIF occur between two measures?
Tables 2 and 3 show that fewer DIF items occurred when statements were formed as MFC blocks rather than when they were presented as a single statement in the Likert-type measure. As an example, items 16 and 18 in the Likert-type measure were detected as DIF items, but the MFC block 6 (corresponding with items 16, 17, and 18 in the Likert-type measure) was identified as a non-DIF block. Figure 1 shows the item characteristic curves (ICC) for items 16, 17, and 18 in the Likert-type measure across the honest and faking conditions. Items 16 and 18 were identified as DIF favoring for the faking condition and the DIF effect sizes were 0.30 (small to medium DIF) for item 16 and 1.46 (large DIF) for item 18. In contrast, when items 16, 17, and 18 were formed in an MFC triplet (block 6), the DIF effect was substantially reduced. Figure 2 shows the item response surfaces of three different binary outcomes (i.e., y i16i17 , y i16i18 , and y i17i18 ) that yielded very similar curves for the triplet block. Importantly, the DIF effect sizes of three binary outcomes were negligible (0.09, 0.06, and 0.09, respectively), and the average block effect size was 0.08. Although item 18 (i.e., Am not interested in abstract ideas) in the Likert-type measure showed a very large DIF effect size (d DIF = 1.46), the effect size of the binary outcome associated with this statement was substantially decreased in MFC block 6. That is, d DIF decreased to 0.06 when the third statement (Am not interested in abstract ideas) was compared with the first statement (i.e., Fear for the worst) within the block. Also, d DIF decreased to 0.09 when the third statement was compared with the second statement (i.e., Keep in the background). Similar patterns were also found in other cases.
Interestingly, we found non-DIF statements (e.g., items 31, 32, and 33) in the Likert-type measure became a DIF block (e.g., block 11) when they were formed as a block in the MFC measure. Figure 3 shows quite similar ICCs of items 31, 32, and 33 (in the Likert-type measure) with small effect sizes (d DIF = 0.27, 0.12, and 0.15) between the honest and faking conditions. However, when the same statements were used in MFC block 11, binary outcomes of the block yielded significant DIF. The y i31i32 and y i31i33 in Figure 4 show very different item response surfaces. Particularly, the direction of loading in conscientiousness changed from the honest to the faking condition as they were compared with other statements measuring extraversion (y i31i32 ) and agreeableness (y i31i33 ). It may occur any unexpected interactions of surrounding statements within a block. We examined statement endorsement proportions of binary comparison outcomes and found that endorsement proportions of three statements (A. Waste my time; B. Find it difficult to approaches others; C. Trust what people say) were equally distributed in the honest condition (56.8% vs. 43.2% for the comparison between statements A and B; 51% vs. 49% for the comparison between statements A and C; 41% vs. 59% for the comparison between statements B and C). However, the endorsement proportions substantially changed when the positive statement was compared to negative statement within a block (42% vs. 58% for the comparison between statements A and B; 16% vs. 84% for the comparison between statements A and C; 19% vs. 81% for the comparison between statements B and C). We suspect "desirability-induced response biases" occurred in this case.

RQ3: How do DIF effect size differ across MFC and Likert-type measures?
Tables 2 and 3 generally show that larger DIF effect sizes were found in the Likert-type measures (M = 0.27, range = [0.00 -1.46]) compared to the MFC measures (M = 0.18, range = [0.00 -0.91]). Overall, this finding indicates that MFC measures can be a more fake-resistant assessment tool. However, interesting results were also found. When differently keyed statements were compared in a mixed block (i.e., block consisting of positively and negatively keyed statements), the corresponding pairwise comparison still yielded medium to large DIF effect sizes. For example, in the MFC block 3 (i.e., A. Panic easily; B. Do not enjoy going to art museums; C. Know how to captivate people), when the first statement A was compared with the second statement B, the DIF effect size was 0.16. However, when the first statement A and the second statement B were compared with the third statement C, the DIF effect         We found five blocks yielded block-level DIF effect sizes ranging from 0.2 to 0.3, and one block yielded a medium effect size of 0.69, with all of them being mixed blocks. Overall, these results show the MFC measure generally yields smaller DIF effect sizes than the Likert-type measure. However, DIF still can occur when positively and negatively keyed statements are mixed in the same MFC block.

RQ4: How do DTF effect sizes differ across two measures?
To examine the practical importance of measurement invariance at the test level, this study computed overall DTF effect sizes for MFC and Likert-type measures across five dimensions. d DTF was -0.08 for the MFC measure, but d DTF was -0.48 for the Likert-type measure. At the test level, MFC measures yielded a minimal test bias between test conditions, whereas the Likert-type measure produced a moderate level of test bias favoring in the faking condition.

DISCUSSION
This research employed the changing items paradigm to evaluate the differential nature of item responses between MFC and Likert-type measures under honest and faking conditions. The main findings are as follows. First, fewer DIF occurred when statements were presented as an MFC block compared to a single statement in the Likert-type measures. Based on the Bonferroni correction, only one MFC block was identified as DIF for the MFC measure, whereas 15 items (i.e., statements) were detected as DIF for the Likerttype measure (RQ1). Second, when single-statements in the Likert-type measure are used to make an MFC item, the same statements do not always show the same DIF results in both formats. Importantly, non-DIF items in the Likert-type measure also do not guarantee item invariance in the MFC measure between the honest and faking conditions (RQ2). Third, lower DIF effect sizes were generally found for the MFC measure than the Likert-type measure. However, pairwise comparisons involving positively and negatively keyed statements still present small to medium DIF effect sizes in MFC blocks (RQ3). Last, a much lower overall DTF effect size was found for the MFC measure than the Likert-type measure (RQ4). Taken together, the measurement invariance between test conditions can be better established in the MFC measure, which empirically supports that MFC measures could be more fake resistant than Likert-type measures. Note. Block 6 is a Non-DIF block. The d DIF = 0.09, 0.06, and 0.09 for y i16i17 , y i16i18 , and y i17i18 , respectively. The horizontal axes represent the dimensions associated with the statements in the respective comparisons, and the vertical axis represents the probability of preferring the former statement to the latter in each instance. (a) and (b) are response surfaces for y i16i17 across honest and faking conditions; (c) and (d) are response surfaces for y i16i18 across honest and faking conditions; (e) and (f) are response surfaces for y i17i18 across honest and faking conditions. λ and γ represent factor loading and thresholds.

Contributions to Faking Research on MFC Measures
This research provides important contributions to the personality faking research on MFC measures. Previous studies on the fake resistance of MFC measures mainly relied on changing person paradigm by evaluating cor-relations of scorings or scale mean differences between honest and faking conditions. However, to establish a meaningful scoring comparison between the test conditions, it is essential that items or tests should provide an equivalent measurement across test conditions (Nye &

Personnel Assessment And decisions
A new investigAtion of fAke-resistAnce of mfc meAsure They introduced a regression-based moderation framework to model faking effects and investigated the scorings from MFC and Likert-type measures. They first estimated item parameters of MFC measures from the honest sample, then scored latent traits of the faking sample using the item parameters obtained from the honest sample. To this end, they "assumed measurement invariance across experimental conditions to ensure comparability of scores" (Pavlov et al., 2019, p. 720). However, if the measurement invariance between honest and faking conditions is not satisfied, scores in the faking condition could be biased because the scores were obtained using variant item parameters from the honest sample. If that happens, research findings would not be tenable. In this vein, Pavlov et al. (2019) pointed out that "future studies are advised to more firmly establish the psychometric equivalence of the applied measures to optimize investigation of the forced-choice format as a faking mitigation strategy" (p. 732). The good news is that our results can be served as empirical evidence of measurement invariance between the test conditions and support previous faking research focusing on scoring comparison of MFC measures without testing item invariances (e.g., Pavlov et al., 2019).
Next, this research scored MFC response data using the TIRT model. Many studies examining the fake resistance of MFC measures generally relied on the classical scoring method (e.g., Martin et al., 2002;Converse et al., 2008;Fisher et al., 2019;Heggestad et al., 2006;Jackson et al., 2000;O'Neill et al., 2017;Vasilopoulos et al., 2006). Fisher and colleagues (2019) recently showed classical test scoring can be more valid than IRTbased scoring for MFC measures. Despite the wide use and interests of classical scoring in the organizational or research settings, this method has been criticized by applied psychometricians because it does not represent a comparative judgment process of selecting statements within a block (e.g., Brown & Maydeu-Olivares, 2011;Hontangas et al., 2015;Stark et al., 2012). By applying a model-based MFC IRT method and a newly developed DIF method for MFC measures, this study was able to evaluate a more accurate response process in MFC data (e.g., binary paired comparison between statements in a block) and evaluated measurement invariance at both the item level and the test level.
Last, this study not only examined the differential functioning of MFC and Likert-type measures at the item level but also investigated DTF effect sizes of the two formats at the test level. From an organizational perspective, hiring decisions are generally made based on test scores rather than individual item scores (Stark et al., 2004). This study showed that there was little test-level bias for the MFC measure, but there was a moderate-level of test bias for the Likert-type measure. This result confirms that the MFC measure could be more effective to reduce faking at the test level than the Likert-type measure. Note. Block 11 is a DIF block. The d DIF = 0.88, 0.91, and 0.27 for y i31i32 , y i31i33 , and y i32i33 , respectively. The horizontal axes represent the dimensions associated with the statements in the respective comparisons, and the vertical axis represents the probability of preferring the former statement to the latter in each instance. (a) and (b) are response surfaces for y i31i32 across honest and faking conditions; (c) and (d) are response surfaces for y i31i33 across honest and faking conditions; (e) and (f) are response surfaces for y i32i33 across honest and faking conditions. λ and γ represent factor loading and thresholds.

Practical Implications
Our study provides important practical implications for the development of MFC measures. A common practice for constructing MFC measures begins with developing single statements item pools, evaluating item invariance of single statements (via DIF analysis for single-statement items), and removing any problematic DIF items from the item pools. Then, researchers and practitioners construct MFC item blocks by pairing non-DIF single-statements based on the social desirability. In this process, measurement in-variance between single-statement items and MFC items is generally assumed without testing differential item functioning of MFC measures between different test conditions (Morillo et al., 2019). However, this research shows that a combination of non-DIF single statements in the item pool do not necessarily guarantee item invariance between single statements and MFC blocks. In the test development, we recommend researchers and practitioners conduct MFC DIF tests and ensure whether MFC blocks still achieve measurement invariance.
Although this research shows the MFC measure better holds measurement invariance than the Likert-type measure across the test conditions, it is important to note that DIF still can occur depending on the combination of statements in the MFC block. This is particularly pronounced when statements with a positive and a negative meaning are compared in the same MFC block. Thus, MFC measures may still be susceptible to faking if MFC measures include many mixed blocks consisting of positively and negatively keyed statements (within a block). We examined statement endorsement proportions within each MFC block in the honest condition to investigate whether self-enhancement bias could occur in honest MFC responses. We found almost 30% (i.e., 16 out of 36 binary outcomes) of pairwise comparison involved unequal endorsement (e.g., at least 10% difference) favoring more desirable items. For example, for block 8 (A: Do things according to a plan; B: Get back at others; C: Feel comfortable around people), a much lower endorsement proportion of the B statement was found when it was compared to the A statement (30% vs. 70%) and the C statement (27.6% vs. 72.4%). These findings indicate that participants even in the honest condition may tend to strongly avoid a statement apparently measuring negative personality traits. Thus, self-enhancement bias may still occur in the honest research context or low-stakes setting.
Following Brown and Maydeu-Olivares' (2011) suggestion, many studies developed MFC measures by mixing positively and negatively keyed statements to improve the accuracy of scoring in the TIRT model (e.g., Bürkner et al., 2019;Lee et al., 2018;Ng et al., 2021;. Although the recommendation of including negatively keyed statements may improve the scoring accuracy of MFC measures, several researchers raised a question if a mixed block can harm the original purpose of MFC, which is faking resistance (e.g., Bürkner et al., 2019;Fisher et al., 2019;Lin & Brown, 2017;Ng et al., 2021;Wang et al., 2017). It may be necessary for future research to find an optimal strategy to design mixed blocks in the MFC measures to satisfy the goals of validity and scoring accuracy (e.g., how many mixed blocks are needed? how to create effective mixed blocks?).

Limitations
This research has several limitations. First, this study used student samples in the experimental settings rather than job applicant samples from real organizations. Future research could examine whether the results of this study can be generalized in real personnel selection settings. Second, this study used a somewhat unclear instruction for the faking test condition. Respondents were asked to imagine their "dream job." However, as an anonymous reviewer pointed out, this method could be problematic in faking research because faking can be differently emerged depending on job types. Future research could provide respondents with more specific job instructions or could use real job applicants engaged in a real selection process. Third, this study used a MFC measure developed only for the research purpose (not developed for the personnel selection purpose). Future research could verify this study's results by using a more elaborately developed personnel selection purpose. Last, Lee et al. (2020) showed the TIRT DIF method was effective for detecting DIF blocks with the large DIF size under n = 500 condition and the type I errors were well-controlled. However, the DIF tests were substantially underpowered in the small DIF size condition. This study's sample size of n = 417 may be too small to detect DIF blocks with small DIF sizes. Although an evaluation of DIF and DTF effect sizes was more considered rather than statistical significance DIF test results in our study, future research should conduct a measurement invariance using a larger sample to achieve good power even in small DIF cases.

Conclusions
In sum, MFC measures have been widely applied in noncognitive assessments in industrial and organizational psychology and education (Burrus et al., 2012). Overall, we supported measurement invariance of MFC measures (compared to Likert-type measures) at the item and test level between honest and faking conditions via advanced IRT methodology. However, we do not argue the MFC format itself is essentially more fake resistant than Likert-type measures. As noted by Griffith and Robie (2013), "forced-choice measures of personality may both reduce faking and attain adequate levels of predictive validity if properly developed" (p. 272). We hope that practitioners and researchers ensure the quality of MFC items by testing test measurement invariance and properly developing more fake-resistant MFC noncognitive assessment for various industrial and organizational settings.

Personnel Assessment And decisions
A new investigAtion of fAke-resistAnce of mfc meAsure