Validity Evidence for Off-the-Shelf Language-Based Personality Assessment Using Video Interviews: Convergent and Discriminant Relationships with Self and Observer Ratings

Recommended Citation Hickman, Louis; Tay, Louis; and Woo, Sang Eun (2019) "Validity Evidence for Off-the-Shelf LanguageBased Personality Assessment Using Video Interviews: Convergent and Discriminant Relationships with Self and Observer Ratings," Personnel Assessment and Decisions: Vol. 5 : Iss. 3 , Article 3. DOI: https://doi.org/10.25035/pad.2019.03.003 Available at: https://scholarworks.bgsu.edu/pad/vol5/iss3/3

Personnel Assessment And decisions lAnguAge-BAsed PersonAlity Assessment VAlidity 2005). Personality produces relatively lower adverse impact than general mental ability (Ryan, Ployhart, & Friedel, 1998) yet may have equivalent or superior predictive validity (e.g., conscientiousness; Connelly & Ones, 2010). As such, personality is increasingly used in personnel selection.
Although self-reports are the most common method of personality assessment, there are concerns about faking and self-presentation biases (see Hough & Oswald, 2008). In this vein, observer ratings based on observable behaviors (e.g., language) may be used to overcome undesirable response distortions in self-reported personality assessment. Indeed, personality ratings from coworkers, family, and friends have been found to have validities roughly double the magnitude of self-reports (Oh, Wang, & Mount, 2011).
Recently, researchers have sought to apply automated, language-based models as alternatives to self-reports for assessing personality (e.g., Kern et al., 2016;Park et al., 2015;Schwartz et al., 2013;Youyou, Kosinski, & Stillwell, 2015). These approaches have been imported into off-theshelf applications for personnel assessment and selection. For example, a variety of language-based assessments have been integrated in platforms assessing applicant personality in video interviews (e.g., HireVue; Quantified Communications), and Linguistic Inquiry and Word Count (LIWC; Pennebaker, Booth, Boyd, & Francis, 2015) is part of the Receptiviti language-based executive and leadership assessment system (Receptiviti, n.d.).
Despite the potential promise of language-based models for assessing personality in the selection context, most investigations to date have been validated in the context of social media language use (e.g., IBM, 2018). It is unknown whether language-based models trained on social media text are effective at predicting personality when applied to a selection setting. Models trained on social media text may translate poorly to workplace applications.
In the present article, we seek to address this gap to understand whether off-the-shelf language-based models for assessing personality can be effectively applied to the selection context. Specifically, we investigate the convergent-discriminant validity evidence of an off-the-shelf language-based personality assessment tool validated for assessing personality on social media in the context of video interviews. It is unknown whether language-based personality models trained on social media text can be validly applied in other contexts. A similar study did the reverse: They applied a language-based model of personality created on workplace emails to social media, finding that it underperformed compared to existing solutions (Golbeck, 2017). This follows past work validating off-the-shelf technological solutions for personnel assessment, such as the convergence between human ratings and automated algorithms for achievement record scoring (Campion, Campion, Campion, & Reider, 2016), the potential to identify deceptive impres-sion management in employment interviews (Auer, 2018), and automatically assessing applicant interview performance (Naim, Tanveer, Gildea, & Hoque, 2018).

The Present Study
Our goal is to provide initial evidence regarding the convergence of an off-the-shelf, language-based personality assessment with self and observer ratings of personality. Specifically, we selected IBM Watson Personality Insights (PI) because of its prominence and usage in higher education, finance, and in other software solutions (HG Insights, 2018). To create IBM Watson PI, IBM recruited a set of active Twitter users to complete self-reports of personality (IBM, 2018). Their tweets were then analyzed via global vector for word representation (GloVe; Pennington, Socher, & Manning, 2014), which estimates the similarity of words using the frequency of their co-occurrence and their proximity in texts within the training corpus. The similarity is represented by a vector: Vectors close in value suggest words are similar in meaning, whereas vectors far apart in value suggest words are dissimilar in meaning. Those vectors are then fed into a machine learning algorithm to predict personality traits from language use. Although IBM Watson PI provides information in its documentation about its convergence with self-reports of personality when assessing social media language use, it is unknown whether IBM Watson PI can reliably and validly assess personality in selection contexts such as video interviews. IBM Watson PI's personality assessment has been applied to identify cyber bullies on Twitter (Balakrishnan, Khan, Fernandez, & Arabnia, 2019), but validity has not been assessed outside of social media. To the extent that it shows promise for mass, unproctored video interviews for personality assessment, it could lead to substantial cost savings for organizations. To the extent that it cannot, it would indicate that language-based models of personality trained on social media may be less useful for selection contexts.
Using automated approaches for assessment in selection represents a high level of structure. All candidates are asked the same questions, and all are judged using the same criteria. To assess personality, open-ended questions should be used because they provide fewer behavioral constraints (i.e., lower situational strength), increasing the variety of acceptable responses, and thereby obtaining the freest expression of behavior and personality-relevant information (Blackman, 2002). Participants recorded their responses to an open-ended prompt, and then we transcribed those responses and used IBM Watson PI to assess their personality. We assessed the extent to which personality scores converged with self-reports and reports from observers who rated participants' personality based on behaviors and speech observed in the videos. Doing so provided an initial investigation into the convergent and discriminant validity of off-the-shelf language-based personality assessment in interview settings.

Participants and Procedure
We recruited 180 participants via Amazon Mechanical Turk (MTurk) who participated in exchange for $2. MTurks are generally demographically diverse, better representing the general population than college students and providing data at least as reliable as student samples (Woo, Keith, & Thornton, 2015). Participants recorded their responses to the following open-ended prompt: Talk about a topic or a story that you know and is personal to you. Do not hesitate to talk about your feelings and do not limit your answer to simple descriptions. Options include: 1. a personal experience (traveling, childhood memory, recent event). 2. your dreams (career, love, friends, hobbies). 3. your general views on a matter you feel strongly about.
Of the 180 videos submitted, two were removed from further analyses because the participants read content verbatim from a website, and one other was removed because it had no audio, leaving a final sample of 177 videos (61% female; 80% White). Participants were instructed to make their videos 2-4 minutes in length (M = 3 min 10 s; SD = 40 s; range 0 min 30 s-5 min 38 s). Participants also responded to a self-report personality questionnaire.
Three doctoral students first rated participants' personality for a set of sample videos, discussed behavioral cues and sources of agreement and disagreement, then independently watched the remaining videos and rated participants' personality. Then the videos were transcribed using Google Cloud Speech-to-Text, and the transcription was entered into IBM Watson PI to obtain its personality assessment.

Measures
Self-reports of personality. A 60-item personality survey comprised the 50-item International Personality Item Pool (IPIP; Goldberg, 1999) scale of markers for the FFM (Goldberg, 1992) and the Ten-Item Personality Inventory (TIPI; Gosling, Rentfrow, & Swann, 2003) was used to assess the FFM. The IPIP consists of 10 items for each FFM trait, whereas the TIPI consists of two items for each FFM trait. The combined scale consisted of 12 items for each FFM trait. All five scales showed acceptable internal consistency (see Cronbach's alpha values in Table 1).
Observer ratings on video data. Three doctoral students in industrial-organizational psychology watched the videos and assessed each participant's personality. The TIPI was adapted from a self-report to an other report by asking raters the extent to which the participant appeared to fulfill each of the TIPI's items (e.g., "Extraverted, enthusiastic"). The TIPI was chosen over other measures to reduce the time required to provide ratings. Ratings were averaged, and interrater reliabilities were adequate for all FFM traits (see Table 1). More visible traits such as extraversion and conscientiousness had higher intraclass correlations, whereas less visible traits such as openness and neuroticism (Allik, Realo, Mõttus, & Kuppens, 2010) had lower intraclass correlations.
Off-the-shelf language-based assessment through IBM Watson PI. IBM Watson PI originally used a closed vocabulary approach, including elements of LIWC (Pennebaker, Mehl, & Niederhoffer, 2003), to estimate personality. However, recent advances in text mining have led to the adoption of open vocabulary approaches that inductively associate words and/or phrases with outcomes of interest-in this case, personality traits. IBM Watson PI was developed and validated by using Twitter content to predict self-reports of personality (IBM, 2018). They do not provide specific trait correlations from the initial development work, but they do provide two summary scores describing the overall accuracy of the system's personality predictions from social media language. The mean absolute error (MAE) indexes the difference between self-reported and predicted personality scores, with 0 indicating no error and 1 indicating total error. The average correlation is the average correlation between self-reported and predicted personality scores across all the FFM traits. For the English language version, they reported that the average MAE was .12 and that the average correlation was .33 (IBM, 2018). By way of comparison, Schwartz et al.'s (2013) open vocabulary approaches for predicting personality from social media usage achieved a maximum average correlation of .35 across the FFM traits, and Park et al. (2015) achieved average correlation of .39, suggesting that IBM Watson PI has accuracy comparable to state-of-the-art text mining approaches. Additionally, IBM found that Watson PI's inferred personality traits predicted a variety of consumption preferences, suggesting it holds potential for predicting real world behavior. Of the 177 interview videos in the current study, IBM Watson PI was able to provide personality scores for 166 of the transcripts. Table 1 displays the means, standard deviations, Cronbach's alpha coefficients (for the self-reported personality), interrater reliabilities (for the observer-rated personality), and correlations for the three sources of personality ratings. Intercorrelations for IBM Watson PI scores ranged from -.29 (between agreeableness and openness) to .70 (between conscientiousness and neuroticism). Some of these correlations did not conform to the expected patterns of association Published By ScholarWorks@BGSU, 2019

RESULTS
Personnel Assessment And decisions lAnguAge-BAsed PersonAlity Assessment VAlidity  Correlation matrix of IBM Watson PI, Self-, and Observer Ratings of FFM of Personality from the personality literature. In particular, neuroticism was highly positively correlated with both extraversion and conscientiousness (rs = .70 and .54, respectively). This unusual pattern was reflected in IBM Watson PI neuroticism scores having negative correlations with self and observer ratings of neuroticism, whereas the remaining traits had positive correlations with self-and observer ratings.
For the self-report scale, alpha coefficients ranged from .92 (Extraversion) to .83 (Openness), with mean of .88. These alphas meet or exceed those reported for the IPIP by Goldberg (1999) and Gow, Whiteman, Pattie, and Deary (2005). All factors were correlated with one another in a theoretically expected pattern, consistent with the literature (e.g., Ones, 1993). For example, neuroticism was negatively correlated with the other four FFM scales (rs ranging from -.51 to -.12), whereas the rest of the FFM scales showed modest to moderate correlations with one another (rs ranging from .09 to .47).
The observers had interrater reliabilities ranging from .66 (Neuroticism) to .80 (Conscientiousness), with mean of .74. Similar to self-reported personality scores, all factors were correlated with one another in a theoretically meaningful way consistent with what has been found in the literature (e.g., Ones, 1993). For example, neuroticism was negatively correlated with the other four FFM scales (rs ranging from -.39 to -.22), whereas the rest of the FFM scales showed modest to moderate correlations with one another (rs ranging from .17 to .35).
We present the analyses of the average heterotrait-monomethod (HTMM) correlations for each method, the heterotrait-heteromethod (HTHM) correlations for each pair of methods, and the monotrait-heteromethod (MTHM) correlations for each pair of methods in Table 2. For IBM Watson PI, monotrait correlations with self-reports range from -.20 (Neuroticism) to .18 (Openness), whereas monotrait correlations with observer reports range from -.41 (Neuroticism) to .20 (Agreeableness).
The HTHM correlations were lowest, as expected. However, the HTMM correlations were higher than MTHM correlations, indicating that methods, not traits, represent the major source of variance in the scores. In recent decades, researchers have utilized confirmatory factor analysis to objectively assess MTMM matrix (Kenny & Kashy, 1992). However, the number of estimated parameters in our model compared to our sample size led to model nonconvergence. Therefore, we analyzed the convergence/ discrimination of these measurement methods using generalizability theory and ANOVA methods for partitioning the variance (Schmitt & Stults, 1986;Woehr, Putka, & Bowler, 2012). The bottom row of Table 2 presents these statistics. Specifically, these indices reveal that 15% of observed variance is attributable to shared variance specific to either trait or to person main effects (C1: average MTHM correlations). Only 7% of the trait-method units' observed variance is trait-specific variance (D1: average HTHM correlations). Contrasting D1 to C1 suggests over half of the convergence can be attributed to person main effects. Trait variance is 13 percentage points lower than the amount of variance attributable to a given method (D2; average MTHM correlations minus average HTMM correlations), and method accounts for 20% of the total variance (MV: average HTMM correlations minus average HTHM correlations). Overall, little  Personnel Assessment And decisions lAnguAge-BAsed PersonAlity Assessment VAlidity trait variance is captured. Both analytical methods converge to suggest that convergent and discriminant evidence for construct validity is poor.
To evaluate whether automated solutions may be able to replace a single rater among multiple raters to save organizations money (e.g., Campion et al., 2016), we compared IBM Watson PI's and single observer rating's convergence to self-reports. We calculated single observer correlations with self-reports and averaged the correlations, then compared the average correlations to IBM Watson PI's convergence with self-reports. Compared to the average of single observer correlations, IBM Watson PI showed larger correlations with self-reports for agreeableness (r obs = .13 vs. r PI = .17) and openness (r obs = .10 vs. r PI = .18), similar correlation with conscientiousness (r obs = .05 vs. r PI = .05), and lower correlation with extraversion (r obs = .18 vs. r PI = .06). For neuroticism, the correlation between IBM Watson PI and self-report scores was negative, which was theoretically uninterpretable as mentioned above (r obs = .22 vs. r PI = -.20). This suggests that for agreeableness, openness, and conscientiousness, IBM Watson PI can function as well as a single observer in assessing self-reported personality. A critical caveat is that the magnitude of correlations are low despite performing better than a single observer.
We also assessed how IBM Watson PI's convergence with self-reports compares to personality ratings at zero-acquaintance. Table 3 displays this information, using zero acquaintance correlations from meta-analysis (Connolly, Kavanagh, & Viswesvaran, 2007). Although overall, IBM Watson PI does not outperform zero-acquaintance ratings, its performance was most promising for openness and agreeableness.
Last, we inspected whether demographic differences were observed in the IBM Watson PI and observer ratings of personality. Men received a higher score on IBM Watson PI neuroticism compared to women (t = 2.48, df = 128, p = .01, 95% confidence interval for difference = .02, .20). In contrast, women rated themselves higher on neuroticism than did men (t = -2.09, df = 156, p = .04, 95% confidence interval for difference = -.02, -.82). No other demographic differences were observed.

DISCUSSION
Technological advances afford researchers and practitioners the ability to supplement or even replace human judgment with objective assessments of job applicants. Such approaches hold potential to reduce appearance, gender, and race biases that influence selection decisions. Other researchers claim to have outperformed IBM Watson PI, Schwartz et al. (2013), andPark et al. (2015) in predicting personality from social media posts, but higher accuracy has only been achieved when language features were combined with self-reports of attitudes and behavior as predictors of self-reported personality (Hall & Caton, 2017). To our knowledge, this study is the first to examine the convergent and discriminant validity evidence of language-based personality assessment with self and observer ratings of personality in the context of a video interview. IBM Watson PI showed significant monotrait correlations with self and observer ratings of agreeableness. Additionally, self-reports of openness showed significant monotrait correlations with IBM Watson PI. However, these correlations were very low in magnitude, and no evidence supported convergence with conscientiousness or extraversion.
As noted by a reviewer, the low convergence may be emblematic of a larger concern: that research using language to estimate personality may suffer from a criterion problem (Boyd & Pennebaker, 2017). Specifically, because such approaches utilize self-reports as the gold standard for accuracy, they inherit and compound the known shortcomings of self-reports (i.e., constraints on self-knowledge and response biases). As such, these approaches for estimating personality do not advance our understanding of personality-rather, they can only advance our understanding of how language-use corresponds to people's perceptions of their own personality. Approaches that utilize more valid sources of personality, such as coworkers and family members, may be more useful than models built on self-reports.
The negative correlation between IBM Watson PI's neuroticism score and the self-and observer ratings was a persistent concern in our analyses. IBM Watson PI's neu-  Comparison to Zero Acquaintance Convergence With Self-Reports roticism score was related in the opposite way we would expect it to be with the monomethod extraversion and conscientiousness scores (rs = .54 and .70, respectively), as well as with the monotrait self-and observer ratings (rs = -.20 and -.41, respectively). Due to these unexpected results, we repeatedly inspected the system documentation to ensure we were interpreting the trait score correctly. The trait is labeled emotional range in their system, and they equate it with neuroticism. The various facet scores are all scored such that a high score indicates maladjustment, either through increased stress, anger, or depression. We searched for papers using IBM Watson PI that reported correlations among the trait scores. This search was unsuccessful for two reasons: The search was temporally restricted because of the recent change in IBM Watson PI from a closed vocabulary approach built on LIWC to an open vocabulary approach, and no recent papers we found utilizing IBM Watson PI reported trait correlations. Additionally, we contacted IBM directly to ask for the trait intercorrelations from validation studies, but they were unwilling to provide them, raising concerns that the trait intercorrelations do not match the accepted structure of the FFM. They stated that trait scores of neuroticism sometimes do not align with the other outputs in expected ways, but they plan to correct this in a forthcoming update. Off-the-shelf approaches require caution because they are often a "black box," requiring users to assess the level of rigor in product documentation prior to use and following each update.

Limitations and Future Directions
Although the setting used here is more natural to the selection context than social media (Van Iddekinge, Lanivich, Roth, & Junco, 2016), the data examined here are not from an actual selection context. Using data from an actual selection decision would be ideal because it would allow for assessing the validity of the IBM Watson PI personality assessment for hiring decisions, job performance, and turnover. Relatedly, although the current study findings did not provide strong validity evidence based on convergent and discriminant relationships with self-and observer ratings, future research should investigate other types of validity evidence such as predictive relationships with important individual and organizational outcomes.
The average word count is another concern in our videos. The accuracy of IBM Watson PI's personality scores asymptotes at 3,000 words. The average correlation across all traits is .21 at 600 words, and caps out at .26 at 3,000 words. Future studies of IBM Watson PI may see better convergence with traditional measures of personality by using a longer sample of speech. This suggests that longer interviews may be required to fully utilize this tool.
Finally, one conceptual concern is the trait activation potential of the video prompts. In assessment centers, ex-ercises with higher trait activation potential elicit more accurate ratings of personality (Lievens, Chasteen, Day, & Christiansen, 2006;Speer, Christiansen, & Honts, 2015). Future investigations could benefit from using multiple prompts and assessing whether prompts with greater trait activation potential elicit more accurate personality estimates when using language-based models.

Conclusion
Technological advances hold potential for changing the way we assess and select job applicants. However, to date, little evidence exists to guide researchers and practitioners as to which approaches can accurately assess job applicants. This study took initial steps to fill this gap by analyzing an off-the-shelf language-based personality assessment tool, IBM Watson PI, that has been validated for assessing personality with social media data, in an interview context. The results showed that short video resumes, which are commonly used, apparently provide little personality-relevant information, and in that context, IBM Watson PI demonstrates little convergence with self-and observer ratings. More work is needed to understand whether this tool can be accurate in personnel assessment contexts.