Critical Analytic Thinking Skills: Do They Predict Job-Related Task Performance Above and Beyond General Intelligence?

Recommended Citation Elson, Sara Beth; Hartman, Robert; Beatty, Adam; Trippe, Matthew; Buckley, Kerry; Bornmann, John; Bochniewicz, Elaine; Lehner, Mark; Korenovska, Liliya; Lee, Jessica; Servi, Les; Dingwall, Alison; Lehner, Paul E.; Soltis, Maurita; Brown, Mark; Beltz, Brandon; and Sprenger, Amber (2018) "Critical Analytic Thinking Skills: Do They Predict Job-Related Task Performance Above and Beyond General Intelligence?," Personnel Assessment and Decisions: Vol. 4 : Iss. 1 , Article 2. DOI: 10.25035/pad.2018.002 Available at: https://scholarworks.bgsu.edu/pad/vol4/iss1/2


PERSONNEL ASSESSMENT AND DECISIONS
CRITICAL THINKING SKILLS AND TASK PERFORMANCE sures, it was frequently the case that a given test instrument would feature one or more subscales that had no direct parallel in the other test instruments.
In addition to this uncertainty surrounding the elements of critical thinking, there is the question of whether critical thinking skills can be distinguished from general mental ability (i.e., GMA -intelligence or general cognitive ability; Hunter & Hunter, 1984;Schmidt & Hunter, 1998) or from general intelligence (i.e., g; Jensen, 1998). On the one hand, considerable research supports the "positive manifold" hypothesis that diverse measures of knowledge and reasoning skill tend to be significantly, positively intercorrelated (Hunt, 2011). As noted by Lake and Highhouse (2014), the Watson-Glaser Critical Thinking Appraisal (Watson & Glaser, 2009), which has a long history of use in organizational hiring and promotions since its development in 1925, diverges in format from conventional intelligence tests but can be expected to relate substantially to measures of intelligence, such as the Raven's Advanced Progressive Matrices (r = .53, Raven & Court, 1998) and the WAIS intelligence test (r = .52, Watson & Glaser, 2009). However, other scholars have argued that general intelligence alone cannot explain critical thinking. For example, Stanovich and West (2008) examined critical thinking skills in eight different experiments. They discovered that participants with high cognitive abilities (as measured by self-reported verbal, mathematical, and total SAT scores) displayed the same level of biases as participants with low cognitive abilities, suggesting that general intelligence does not in and of itself enable people to engage in critical thinking tasks that have been discussed in the literature. Stanovich, West, and Toplak (2012) have also highlighted dual process models of cognition (e.g., Frederick, 2005) as helping to elucidate the difference between g/ GMA and critical thinking. Such models posit a distinction between an automatic, heuristic mode of cognitive processing (Type 1) and a slower, more analytic and computationally expensive mode of processing (Type 2). A key distinction between these two processing modes is that, whereas Type 1 processing happens rapidly and relatively automatically, people can make a conscious decision to engage in effortful Type 2 processing, and the willingness to do so can be viewed as a cognitive style. By this conceptualization, g could be considered a form of Type 1 processing, whereas critical thinking could be considered a form of Type 2 processing. On this basis, Stanovich et al. have contended that measures of g (such as IQ tests) do not capture the propensity to engage in effortful, critical thinking.
The question of whether critical thinking is a distinct construct from general intelligence and, in particular, whether it can explain technical performance above and beyond the ability of general intelligence constituted a key impetus for the current study.

Validity of Critical Thinking Measures
Although most studies of critical thinking test validity have focused on correlations with other critical thinking measures or with g (Liu et al., 2014), a set of notable studies have examined the relationship of critical thinking to behaviors, job performance, or life events. In their review of literature on the validity of critical thinking measures, Liu et al. (2014) concluded that many existing studies are missing a key component, namely incremental predictive validity of critical thinking above and beyond general cognitive measures. For example, Ejiogu, Yang, Trent, and Rose (2006) found that the Watson-Glaser Critical Thinking Assessment (WGCTA) correlated moderately with job performance (corrected r = .32 to .52). In addition, Watson and Glaser (2009) found that scores on the WGCTA predicted supervisor ratings of judgment and decision-making job performance (r = .23) in a sample of 142 managers across multiple industries. As noted by Lake and Highhouse (2014), judgment and decision-making performance are considered as part of an "analysis" construct, along with "decisiveness" and "adaptivity," which compose three constructs serving as of managerial decision-making competence than broad constructs like cognitive ability and personality (see Lievens & Chan, 2010). Watson and Glaser (2010) also found that the WGCTA correlated at .40 with supervisor ratings of analysis, problem-solving behaviors, and judgment and decisionmaking behaviors for analysts from a government agency. Butler (2012) found that scores on a different measure of critical thinking (the Halpern Critical Thinking Assessment or HCTA) predicted real-world outcomes of critical thinking, that is, decision outcomes (as assessed by the Decision Outcomes Inventory (DOI: Bruine de Bruin, Parker, & Fischhoff, 2007). Garrett and Wulf (1978) found that Cornell Critical Thinking Test (CCTT) scores predicted academic success in graduate school, i.e., grade point average (GPA). Finally, Stilwell, Dalessandro, and Reese (2011) found that Law School Admission Test (LSAT) scores predicted GPA Unfortunately, none of these studies assessed whether critical thinking predicted criterion variables above and beyond the ability of general intelligence measures. This represents a significant gap in the critical thinking skills test validity literature (see Liu et al., 2014), because g is psychometric indicator of individual job performance (Schmidt & Hunter, 1998; see also Heneman & Judge, 2012 on cognitive aptitude). For example, Hunter's (1980) metaanalysis with 32,000 employees in 515 jobs found that g and work performance correlated strongly (r = .51), with validity coefficients being highest for higher-complexity occupations (.58 vs. .23 for high vs. low complexity jobs). More recently, Ones, Dilchert, Viswesvaran, and Salgado (2010) reported operational validities (correlations corrected for range restriction and reliability) between .35 and .55.

Aims of the Present Research
the conceptual and empirical gaps within the literature. existing definitions and models of critical thinking skills to arrive at a consensus set of critical thinking elements or subconstructs. In addition, we summarize previously unpublished results from a test development effort, in which of critical analytical thinking skills for government analysts. Finally, we present the results of a criterion validity study that examined whether critical thinking skills predict technical performance generally and incrementally, above and beyond a measure of g as well as above and beyond job experience, educational attainment, and a series of other characteristics.
It should be noted that the current study emerged as part of a broader effort to develop the Critical Analytic Thinking Skills (CATS) test (MITRE Corporation, 2014a;MITRE Corporation, 2015), a measure of critical thinking skills intended for use among government analysts. In parhigh face validity for government analysts, which was accomplished by couching the test items in terms of contextualized scenarios. Despite this contextualized framing, items were intended to tap classes of critical thinking skill of broad relevance to any occupation for which such skills are vital. As such, the CATS test can be regarded as an occupapurpose conceptual and test item development framework developed over the course of the project. Further, no specialized knowledge of content is required to comprehend the questions and reason to the correct answers.

Elements of Critical Thinking
Given a lack of consensus among researchers on how text in which we conducted the current study, we pursued the construct of critical thinking for this context. To idendefinitions, we held a CATS Workshop to elicit perspeccritical thinking, and analysis (n = 35). In addition, we assessed existing measures of critical thinking and related literature to understand the full scope of the critical thinking construct and various permutations thereof (e.g., Bondy, Koenigseder, Ishee, & Williams, 2001;Ennis & Weir, 1985;Facione, 1990;Frisby, 1992;Halpern, 2010;Klein, Benjamin, Shavelson, & Bolus, 2007;Watson & Glaser, 2010). We gathered additional input from an informal focus group (n = 4) and the CATS Technical Advisory Committee (TAC; n = 8). We also examined critical thinking skill elements amined 12 government critical thinking training course syllabi to investigate which elements were included as major topics. (Full details of these tasks are discussed in "Critical Analytical Thinking Skills Pilot Test Final Report" [MITRE Corporation, 2014b]). The end products of this effort were reflective use of cognitive skills to make good judgment" along with an associated set of critical thinking "elements" distinct sub-category of critical thinking skills grouped by similarity.
We initially considered several elements of critical thinking for inclusion in the CATS test. In selecting these elements, we prioritized the need to maximize content validity or the degree to which the test represents all aspects of the critical thinking construct. At the same time, we sought to manage the overall test length. Given these conwith the strongest support from the information sources surveyed: Identifying Assumptions, Causal Reasoning, Logical Reasoning, and Hypothesis Evaluation (see Table  1). Although the primary focus of this report is the assessment of the CATS test's predictive/criterion validity with respect to job performance, a review of prior (previously unpublished) CATS test development and validation work is necessary to help establish the measure's general psychometric properties, including test reliability and convergent validity with other relevant cognitive measures. Therefore, before presenting the core hypotheses for the present effort, we provide a short overview of prior psychometric evidence concerning CATS.
Item Analysis and Scale Construction. A total of 246 multiple-choice items were initially generated by trained item writers to measure the four elements of critical thinking, and 209 survived an expert review process. A pilot study was then conducted to collect item statistics using a sample of Amazon's Mechanical Turk (MT) participants (n = 511). The pilot test sample was restricted to US citizens

Identifying assumptions
Assumptions are statements that are assumed to be true in the absence of proof. Identifying assumptions helps to discover information gaps and to accurately assess the validity of arguments. Assumptions can be directly stated or unstated. Detecting assumptions and directly assessing their appropriateness to the situation helps individuals accurately evaluate the merits of arguments, proposals, policies, or practices.
Causal reasoning Causal reasoning involves evaluating the likelihood of causal relationships among events or other variables. Good causal reasoning requires understanding the concepts of and differences between causation and correlation. Causal reasoning involves identifying proper comparison groups, understanding the role of randomness for inferring causation, considering the possible presence of confounding variables, and understanding the role of sample size and representativeness for making appropriate causal inferences.
Logical reasoning Logical reasoning involves identifying logical connections among propositions and avoiding logical fallacies for inductive and deductive inference. These can include fallacious inferences (e.g., conclusions do not follow from premises, reversal of if-then relationships, circular reasoning), fallacies of relevance (e.g., ad hominem arguments), fallacies of ambiguity in language (e.g., equivocation, straw-man fallacy), and fallacies of presumption (e.g., false premises, tautology, false dichotomy). A capacity for logical reasoning protects against belief bias or the tendency to incorrectly evaluate data in syllogistic reasoning because of prior preferences and expectations.

Hypothesis evaluation
Evaluating hypotheses requires the consideration of alternative explanations regarding a range of actual or potential evidence to test their relative strength.
the null hypothesis that nothing special is happening or against one or more competing alternative hypotheses to determine which hypothesis is most consistent with or explanatory of the relevant data.
items was selected based on traditional classical test theory tics, and interitem correlations. Items deemed eligible for discriminating, and had good statistics for all distractors, as gauged by the proportion of test takers answering each distractor item correctly (pvals) and by option-total, pointbiserial correlations (OTCs) used to identify items for which high ability test takers were drawn to one or more distractors.
To meet the needs of potential test users, three forms of CATS were developed to accommodate practical constraints of testing time: A long form containing 156 items that measured all elements, a two-element test (CATS 2-Short) that consisted of only logical and causal reason-ing items, and a four-element short form (CATS 4-Short) length and composition, key consideration was given to (a) the ability to maximize the test's reliability and content validity, (b) resistance to format effects, (c) ceiling effects, (d) guessing and compromise, suitability for Adaptive Computer Testing, and (e) item response theory (IRT) analyses, and (f) test development costs.
Mean scores, standard deviations, reliabilities, and interelement correlations were calculated for each element and test form. Reliabilities of the test forms were high, ranging from .84 to .96. Element scores were highly correlated with each other and with form scores, suggesting a high degree of homogeneity across elements. Results of a confirmatory factor analysis indicated that the CATS elements were correlated at .9 or higher, indicating that test interpretation should focus on the overall test score as opposed to using the element subscores, as the results did not support the hypothesis that the elements were unique.

Convergent Validity
After completing the scale construction study, a convergent validity study was conducted to evaluate the test's correspondence with well-established measures of critical thinking, including the Law School Admission Test Logical Reasoning Scale (LSAT LR; Roussos & Norton, 1998) and the Shipley Institute of Living Scale 2 (Shipley 2) Cognitive Ability test (Kaya, Delen, & Bulut, 2012). Based on analysis of data collected using the MT participant sample, the corrected correlations between the CATS elements and the established reasoning tests demonstrated convergent (r = .70 to .90) and discriminant (r = .30 to .40) validity.

Parallel Forms Development
As a follow-up to the pilot study discussed above, we conducted a separate MT study with almost double the number of participants (n = 943) and many newly constructed items. This study had several goals, including (a) confirming the findings of the pilot study, (b) conducting item response theory (IRT) calibration of the CATS items, and (c) developing parallel forms for testing scenarios when equivalent forms are desired.
Results from this follow-up study replicated the findings of the pilot study. The difficulty of CATS 2.0 items ranged widely, the items were reliable, appeared largely to measure one general factor, and had expected patterns of convergent validity with established cognitive ability measures. IRT calibration was successful, with a low percentage of items iting local dependence.
After completing IRT calibration to obtain the final operational item pool, parallel forms were constructed. A total of three sets of parallel forms, focusing on different ability levels and testing scenarios, were developed. These forms exhibited high internal consistency and test-retest reliability.

Convergent Validity Replication
To determine the convergent validity of the parallel forms, a replication of the Year 1 convergent validity study was conducted, including the LSAT and Shipley-2 test as marker tests. Replicating the Year 1 results, the CATS total and form scores correlated strongly with the LSAT Logical Reasoning subtest (i.e., corrected correlations ranged from .81 to .91, see Table  2), demonstrating convergent validity. On the other hand, discriminant validity evidence comes from the corrected correlations between CATS scores and the Shipley Block Patterns test (i.e., .37 -.50), as would be expected given that this test measures a somewhat distinct construct from CATS. Finally, CATS elements and forms were correlated more highly with the LSAT-Logical Reasoning test than with the Shipley Vocabulary or Abstraction tests (for which corrected correlations ranged from .39-63), thus showing patterns of convergent and discriminant validity.
Although the previous work established the psychometric Note. split half reliability estimates, corrected to test length using the Spearman-Brown formula. Correlations below the diagonal are correlations observed in the study. Correlations above the diagonal are corrected for unreliability where r 1'2' = r 12 11 * r 22 ). Corrected correlations greater than 1 are reported as 1.00.

PERSONNEL ASSESSMENT AND DECISIONS CRITICAL THINKING SKILLS AND TASK PERFORMANCE
soundness of the CATS test, this research was conducted with MT workers, and no relevant criteria were available to determine the criterion-related validity of the test. Therefore, we conducted the present study to examine the extent to which the test might have criterion-related validity -especially when administered to government analysts.

The Present Research: Criterion Validity and Incremental Validity
After establishing the reliability and convergent validity of the CATS test, our next step consisted of determining whether the test - and, ultimately, the construct of critical thinking - predicts job performance above and beyond general intelligence. As such, we conducted a criterion-related validity (CRV) study of the relationship between CATS test scores and a set of performance-related criterion measures. We examined this relationship in a sample of US government analysts. Our research entailed testing three overall hypotheses: Hypothesis 1: Critical thinking test scores will predict performance on an analytic work sample task. Hypothesis 2: Critical thinking skills will predict performance beyond the ability of general intelligence to do so. Hypothesis 3: Critical thinking skills will predict performance beyond a set of individual characteristics, including general intelligence, educational attainment, gender, employment sector (i.e., whether civilian, military, or contractor), job experience related to the analytic work sample task, completion of training in structured analytic techniques, age, motivation on the CATS test, and motivation on the work sample task.

METHOD Participants
Participants consisted of 140 government analysts from across a range of organizations. A priori power analysis indicated that 125 participants would allow detection of correlations greater than .22 (i.e., at the "small" or greater level;; Cohen, 1992) with a power of .8. In addition to participants, 24 supervisory SMEs were recruited from 11 different agencies across the government for purposes of rating analytic products that the participants would provide during the study. All supervisory SMEs had supervisorylevel experience and regularly evaluated analytic products of subordinates.

Materials
CATS test. Participants completed the multiple choice CATS test. For this study, half of participants completed Form A, and the other half completed parallel Form B.
Analytic Work Sample Task. In order to provide empirical evidence that scores on the CATS test predict govern-ment analyst job performance, an Analytic Work Sample Task (AWST) was developed to closely simulate the work government analysts perform on the job. The AWST materials were developed using a modeling approach with sigof the task, participants read a short background primer. After reading this background material, participants viewed a dossier of evidence consisting of reports describing simulated events. Then, participants were instructed to write a short report in the style of an analytic work product, which was evaluated by at least three supervisory SMEs using a standardized rubric developed for this project. The supervisory SMEs were all experienced in evaluating products. Their task scores provided a measurement of how well planations, evaluated the quality of information sources, drew logical conclusions, and reached accurate judgments with appropriate confidence when writing analytic work products. These performance measures are derived from two government publications on the topic of analytic tradecraft and standards for evaluating the quality of analytic products. 1 Further detail on the AWST can be found in Appendix A.
Cognitive ability measure. Our measure of cognitive ability consisted of self-reported Scholastic Aptitude Test (SAT) test scores and self-reported ACT scores. According to Kanazawa (2006), the SAT Reasoning Test (usually known simply as the SAT or the SAT I) is a measure of genor inductively, think abstractly, use analogies, synthesize information, and apply knowledge to new domains, akin to Cattell's (1971Cattell's ( ) (2004 found that the total SAT score is an index of cognitive ability because it loads highly on psychometric g (see also Unsworth & Engle, 2007). Furthermore , Engle, Tuholski, Laughlin, and Conway (1999) characterized the verbal Coyle (2006) correlated scores on the SAT and ACT with performance on three highly g-loaded cognitive measures (college GPA, the Wonderlic Personnel Test, and a word recall task). The g, or general, factor is a common element among all tests of mental ability, the first shared factor that is extracted through factor analysis. Coyle performed a factor analysis that showed high g-loading for raw ACT and SAT scores, and the raw scores were significantly predictive of scores on measures of cognitive ability. In a review of existing research, Baade and Schoenberg (2004) looked at 15 studhigh correlation between a variety of achievement tests (including the ACT) and scores on the WAIS or WISC. Most college bound students take either the Scholastic Aptitude Test (SAT;; College Board Tests Inc., 1995) or the American College Test (ACT;; American College Testing Program, 1987) as a college entrance requirement. These measures are employed as predictors of future academic success (e.g., American College Testing Program, 1987;; College Board Tests Inc., 1995;; Wikoff, 1979), and they correlate highly with measures of intelligence (e.g., Wechsler, 1991). One advantage of using ACT and SAT scores rather than an intelligence test is that intelligence tests administered on g. Rather, in low-stakes settings motivation acts as a third-variable confound that inflates estimates of predictive validity of intelligence for life outcomes (Duckworth, Quinn, Lynam, Loeber, & Stouthamer-Loeber, 2011). ACT/ SAT scores, which are administered in high-stakes settings wherein test results impact college selection decisions, may In addition, Lohman and Lakin (2011) have suggested that domain-independent reasoning, a hallmark characteristic of Gf, is a key ability that underlies performance on problems that require domain-specific knowledge-that is, Gc. According to Kanazawa (2006), the ACT is a measure of acquired knowledge, akin to Cattell's crystallized intelligence (Gc). For this reason, we incorporated selfreported ACT scores into a composite variable, along with self-reported SAT scores, to operationalize the construct of cognitive ability. For the present study, participants were asked to indicate their ACT score or their total SAT score (math and verbal if they took the version with two subtests used prior to March 2005, or math, critical reading/verbal, and writing if they took the version with three subtests used from March 2005 to present).
Several studies have indicated that the correlation between self-reported SATs and verified SAT scores is in the range of 0.80-0.90 (Cassady, 2001;; Kuncel, Crede, & Thomas, 2005), and self-reported scores have been shown to correlate with a third variable to the same extent as veri- Stanovich and West (1998) found that the correlation between a vocabulary test and self-reported SAT total scores (.49) was quite similar to the .51 correlation beprevious investigation using the same vocabulary measure (West & Stanovich, 1991).
Demographic questionnaire. Participants completed a demographic questionnaire, capturing the following information: Gender, Age, Highest level of education completed, Organizational affiliation, Training received in Structured Analytic Techniques, Employment status (i.e., active duty military, civil service. contractor), Years of service, Rank/ grade level at entry and current rank, and Geographic regions worked.
Post-study questionnaire. Finally, participants completed questions indicating how well they felt the CATS test found the CATS test and analytic work sample task, how hard they tried on the CATS test and analytic work sample task, and suggestions for improvement.

Procedure
Administration procedure. Materials were distributed either via computer (n = 127) or paper-and-pencil format (n = 13), depending on participating organizations' preference. Test proctors guided participants through each step of the study. 2 Analytic work sample rating procedure. The principal criterion variables comprised supervisory SME ratings of each participant's one-two page analytic work sample product. To maintain consistency across supervisory SMEs, all supervisory SMEs attended a training session lasting approximately 2 hours. See Appendix A for details on the training sessions. Supervisory SMEs had no access to analysts' CATS test scores so that bias could not affect analytic work sample ratings. Multiple supervisory SMEs rated each product on several discrete dimensions that are central to the task of analysis (i.e., key judgments, referencing, analysis of alternatives, assumptions and judgments, and logical argumentation) using an evaluation rubric (included in Appendix B, "Evaluation Rubric"). In addition to rating work an overall rating of each product from "Unacceptable" to "Excellent" (i.e., item 6 of the rubric in Appendix B).
To assign supervisory SMEs to work products, we used partial counterbalancing. Each supervisory SME rated 20 analytic work sample products, and each product was evaluated by 2-4 different supervisory SMEs (four analytic work sample products were each rated by two supervisory SMEs;; 65 products were each rated by three supervisory SMEs, and 69 products were each rated by four supervisory SMEs). As such, the present study used an ill-structured measurement design (ISMD) wherein supervisory SMEs and participants were neither fully crossed nor nested (Putka, Le, McCloy, & Diaz, 2008). Although at least two supervisory SMEs judged each analytic work sample product, and most products were rated by three of four supervisory SMEs, not all supervisory SMEs scored all participants (i.e., our design was not fully crossed), and neither was there a separate group of supervisory SMEs scoring each participant (i.e., our design was not fully nested). Therefore, to calculate interrater reliability (IRR), we used the G(q,k) statistic proposed by Putka et al. (2008) as our primary measure. This statistic resolves problems with traditional estimators, such as Pearson r and the intraclass correlation (ICC), and serves equally well for crossed, nested, and illstructured designs.

Participant Characteristics
A total of 140 government analysts were recruited and tested for the CRV study. Participants were predominantly male, and had at least a bachelor's degree, with the largest percent having a master's degree or equivalent. The largest percentage of participants were civil service employees. Their average age was nearly 37, and their average SAT and ACT scores were above the average of the general population. Appendix C characteristics.

CATS Test Scores
Out of a possible total score of 32, participants' mean score was 15.5, with a standard deviation of 5.8 and a range from 5 to 29. Scores exhibited a ceiling of 2.8 SDs above

Criterion-Related Validity Results
Scoring the Analytic Work Sample Task. Supervisory SMEs (n = 24) rated analytic work sample products using the evaluation rubric included in Appendix B: "Evaluafollowing five analytic performance dimensions, each of which contained at least two subcomponent ratings: (1) assumptions and judgments (two ratings), (2) analysis of alternatives (two ratings), (3) logical argumentation (four ratings), (4) key judgments (two ratings), and (5) referencing (two ratings). Appendix A contains a full description of how we derived composite scores. Ultimately, we summed dimension contributed equally to the overall score, we unit weighted each of the dimensions. For example, ratings for dimensions comprising two items were each multiplied by .5, and ratings for dimensions comprising four items were each multiplied by .25. After summing across all weighted to produce a single composite score for each participant. We will call this score the "product dimension rating." As noted above, supervisory SMEs also provided an overall rating of each product from "unacceptable" to "excellent" (i.e., item 6 of the rubric in Appendix B). To derive a score for each product, we took an average of supervisory SMEs' ratings. We will call this score the "overall product rating." For purposes of testing the hypotheses listed above, we will focus primarily on the criterion variables of product dimension ratings and overall product ratings.
Assessing interrater reliability. 3 We examined interrater reliability with respect to product dimension ratings and overall product ratings. The interrater reliability (IRR) of supervisory SMEs' analytic work sample ratings was good (product dimension ratings: G(q,k) = .77;; overall product ratings: G(q,k) = .70). 4, 5 Quantifying predictive validity. As discussed above, we examined the ability of CATS scores to predict two crite-rion variables: product dimension ratings and overall product ratings. We took several approaches to examining predictive validity;; these included running Pearson correlations (which is how predictive validity has typically been assessed) and regressions to allow for controlling the effects of general intelligence. As discussed above, our measure of cognitive ability consisted of self-reported Scholastic Aptitude Test (SAT) test scores and self-reported ACT scores. (See Appendix D for details on how we created the SAT-ACT variable.) In support of Hypothesis 1, CATS test scores correlated strongly with analytic work sample performance (product dimension ratings: r = .55, p < .01;; Pearson r corrected for measurement error = .64;; Kendall's Tau = .40, p < .01. Overall product ratings: r = .56, p < .01;; Pearson r corrected for measurement error = .68;; Kendall's Tau = .41, p < .01;; see Table  3).
To test Hypotheses 2 and 3, we ran a set of hierarchical regressions examining the ability of CATS test scores to predict analytic work sample performance above and beyond a models, we examined the ability of CATS scores to predict product dimension ratings and overall product ratings. In all unique variance in ratings above and beyond all other characteristics examined. One of the most important individual characteristics examined consisted of a combined SAT-ACT ACT combined measure (r = .56, p < .001). Table  4, entailed predicting overall product ratings by first entering the combined SAT-ACT variable and then entering CATS test scores. The combined SAT-ACT variable alone (in Step 1) accounted for 10% of the variance in overall product ratings, but a model that included CATS test scores as well as the combined SAT-ACT variable (in Step 2) accounted for an additional 18% of the 3 In no cases did a supervisory SME rate a work sample written by anyone reporting directly to her/him. 4 As recommended by Putka et al. (2008), we estimated the three variance components underlying the calculation of G(q,k) for both the overall ratings and for the composite scores. Regarding the calculation of G(q,k) for the overall ratings, the ratee main effect variance was .52, the rater main effect variance was .35, and the combination of Ratee x Rater interaction and residual error variance was .47. Regarding the calculation of G(q,k) for the composite scores, the ratee main effect variance was 3.09, the rater main effect variance was 1.57, and the combination of Ratee x Rater interaction and residual error variance was 1.69. As discussed by Putka et al. (2008), partitioning the variance underlying G(q,k) into these sub-components can help establish a meta-analytic database of organizational researchers and practitioners. Such a database could then be used to support the calculation of G(q,k) in primary studies that preclude its estimation on locally available data, as explained by Putka et al. (2008). Published By ScholarWorks@BGSU, 2018

PERSONNEL ASSESSMENT AND DECISIONS
CRITICAL THINKING SKILLS AND TASK PERFORMANCE Note. ** to government, military, or contractor. CATS motivation was assessed at the end of the testing session via a question, "How hard did you try on the critical thinking test (i.e., the test with the multiple choice questions)?" AWST motivation was assessed at the end of the testing session via a question, "How hard did you try on the work sample task (i.e., the task that had simulated materials and you wrote an analytic essay)?" Focus on AWST topic refers to whether the participant focus on the AWST topic in their daily work (i.e., Middle East/Asia) vs. other topics. SAT Training refers to whether or not participants had received training in structured analytic techniques. Associations between categorical variables 9-12 are not meaningful in this context but are available on request.
variance. 6 A look at the standardized beta weights also shows that ings above and beyond the ability of SAT or ACT scores.
Our second model, presented in Table 5, entailed predicting product dimension ratings by first entering the combined SAT-ACT variable and then entering CATS test scores. The combined SAT-ACT variable alone (in Step 1) accounted for 14% of the variance in product dimension ratings, but a model that included CATS test scores as well as the combined SAT-ACT variable (in Step 2) accounted for an additional 11% of the variance.
A look at the standardized beta weights also shows that ratings above and beyond the ability of the combined SAT-ACT variable.
In the final set of regression models, we sought to control for a broader set of characteristics - in addition to the SAT-ACT variable -that might predict performance. We provided the full list of characteristics in Appendix C (Participant Characteristics). Table  6 presents the model in which we predicted overall product ratings by entering the test scores in the second step. The combination of variables entered in Step 1 accounted for 23% of the variance in overall product ratings, but a model that includes these variables as well as CATS scores (in Step 2) accounted for an additional 13% of the variance.
A look at the standardized beta weights shows that CATS test scores significantly predicted overall product ratings above and beyond the combination of demographic factors discussed above. In fact, CATS scores constituted ratings within the entire model. 7 Table  7, entailed predicting product dimension ratings by first entering the same demographic characteristics as above and then entering 7 Note that the variables included in step 1 jointly explained 23% of predictors could be due to some multicollinearity. The change in the size suggests there could be some negative suppression in this analysis.

B SE B
Step    CATS test scores. The combination of demographic characteristics (in Step 1) accounted for 28% of the variance in product dimension ratings, but a model that included CATS test scores as well as the demographic characteristics (in Step 2) accounted for an additional 7% of the variance. A look at the standardized beta weights shows that ratings above and beyond the combination of demographic factors discussed above.

DISCUSSION
importance of critical thinking skills to job performance, the current study demonstrated the difference that these skills make when performing tasks that government analysts perform. As noted above, CATS test scores correlated strongly with analytic work sample performance (product dimension ratings: r = .55, p < .01;; Pearson r corrected

RESEARCH ARTICLES
for measurement error = .64;; Kendall's Tau = .40, p < .01;; overall product ratings: r = .56, p < .01;; Pearson r corrected for measurement error = .68;; Kendall's Tau = .41, p < .01). As a point of reference, Hunter's (1980) meta-analysis with 32,000 employees in 515 medium-complexity jobs found r =.51 between general mental ability and work performance (corrected for reliability and range restriction on the predictor in incumbent samples relative to applicant populations). The value is higher for jobs with higher complexity (.58) and lower for jobs with lower complexity (down to .23).
Although the comparison between the current study and the Hunter meta-analysis is not direct, because the current study uses a work sample task whereas the Hunter metaanalysis is based on supervisor ratings of job performance, the Hunter meta-analysis provides an indication of the size of criterion values that are observed when strong predictors of job performance are assessed. Going a step further, however, the current study demonstrated the incremental predictive validity of critical thinking skills above and beyond a general intelligence measure (i.e., the combined SAT-ACT variable). In doing so, the current study addressed a gap discussed by both Kuncel (2011) and Liu et al. (2014) in the literature on the validity of critical thinking measures, in that many existing studies have not examined such incremental predictive validity.
performance above and beyond the ability of general intelligence, the current study entailed controlling for a variety of other individual characteristics that might have accounted for task performance. The fact that critical thinking skills accounted for performance on the work sample task above and beyond the combination of individual characteristics further attests to the importance of these skills to performance.
The findings of this study hold implications for both academic researchers investigating the predictors of job performance and for businesses. For academic studies, the findings suggest that it is worth measuring critical thinking in appropriate contexts. For businesses, the findings substantiate the interest shown in critical thinking skills by managers and government leaders (Pellegrino & Hilton, 2015 measuring and testing critical thinking skills when taking an evidence-based decision-making approach toward business management (Buluswar & Reeves, 2014). Although the tests developed in the current study were not designed as screening tools, the results of the study suggest the potential benefits of measuring critical thinking skills in the hiring process as well as before and after analytical training - to gauge the effectiveness of that training.

Strengths, Limitations, and Future Research Directions
The current study has certain methodological strengths, and ensure the validity of the Critical Analytic Thinking Skills (CATS) test as well as the analytical work sample task used as a proxy for analytical job performance. However, a limitation warrants discussion. Namely, the study included only one operationalization of g, that is, self-reported SAT and ACT scores. Although multiple studies point to the high correspondence between recalled and actual SAT scores (Cassady, 2001;; Kuncel et al., 2005), future research can and should include more diverse measures of general intelligence.
In addition, the criterion and predictor variables both assessed maximal performance (what participants "can do") rather than typical performance (what participants "will do" on the job). A recent meta-analysis shows that measures of typical and maximum performance are only moderately related (r = 0.42; ; Beus & Whitman, 2012). One open question is the degree to which typical critical analytical thinking on the job is aligned with maximal performance. Although we do not have empirical data on this, the nature of participants' work has "high stakes" implications that may motivate them to work at their maximum capacity. Nonetheless, an important question left unanswered by the current study is whether CATS would be equally predictive of a different type of criterion measure that could capture typical performance, such as supervisor ratings.
As a third limitation, readers might note the conceptual overlap between certain elements of the CATS test and performance measures of the AWST (i.e., identifying assumptions, considering alternative explanations, and drawing logical conclusions), whereas other performance measures of the AWST are not elements of the CATS test (i.e., evaluating the quality of information sources or reachwriting analytic work products). As noted above, the performance measures of the AWST are derived from published standards for evaluating the analytic integrity of written products, and because elements of critical analytic thinking are central to analytic integrity (and therefore encapsulated among these standards), some conceptual overlap exists between the AWST and the construct of critical analytic ent project consisted of developing a test that would predict that cannot be predicted by intelligence alone. Notwithstanding the partial conceptual overlap between the CATS test and the AWST, it is worth noting that the CATS is a short, multiple choice test, whereas the AWST takes multiple hours to complete. Furthermore, the SMEs who evaluated the work products were not trained in critical thinking but rather were trained in supervising analysts and evaluating their reports. As such, they were evaluating the work products from the perspective of good work generally (as encapsulated by overall product ratings)-and not simply by the standards of critical thinking.
One could argue that supervisor ratings would be a more effective criterion variable than the AWST. Ideally, and in the future, supervisor ratings will be examined, but there are drawbacks to these. Supervisor ratings are subject to various forms of unreliability or limited validity. For example, they are known to be subjective, agreement across raters is often low, rating processes are often highly unstanvarious ways (e.g., the degree to which the members of the dyad work together closely, duration of the dyad relationship, and degree of supervisor experience in making evaluations), and there are significant variations in evaluation processes across organizations and organizational units. In contrast, some psychometricians have argued that work sample tests have the highest fidelity for measuring criterion performance (Borman, Bryant, & Dorio, 2010).
Finally, we note the issue of range restriction (e.g., the mean ACT score is approximately at the 90th percentile, and the standard deviation is substantially smaller than recent normative data would indicate) such that the correlations between the cognitive ability (i.e., SAT-ACT scores) and the criterion variables as well as the correlation between the SAT-ACT scores and CATS scores may have been atestimate of the incremental validity of CATS scores. Ordinarily, we would correct the attenuated correlations for the range restriction if suitable range restriction correction values can be found. Although such values can be found for purposes of correcting SAT and ACT scores relative to the general population, it is highly likely that CATS scores are heavily restricted relative to the general population or even high school test-taking population given reasonably high correlations with other cognitive ability tests (along with arguments about developing CATS-type skills in college). Given these circumstances, it would seem unwise to correct SAT-ACT scores back to the general population but leave CATS scores as they are - just because data are available to do so. Proceeding this way would be erring in the other direction and risks attenuating the CATS-criterion correlations relative to the SAT-ACT score-criterion correlations. In short, the concern about range restriction is a valid one for which data are unavailable to make proper corrections, In conclusion, the current study addresses the notion tors of job performance in contexts not requiring perceptual it may be necessary to measure critical thinking skills as well. We hope that this research will motivate additional studies into the possibility that critical thinking skills are distinct from and play a role beyond that of general intelligence in predicting job performance. argumentation, key judgments, and appropriate citations. Two of the evaluation rubric items asked the supervisors to provide overall ratings: one of the overall analytic work sample product, and one of the critical thinking skills displayed in the product. Each supervisory SME rated 20 analytic work sample products, and each product was evaluated by 2 to 4 different supervisory SMEs (four analytic work sample products were each rated by two supervisory SMEs;; 65 products were each rated by three supervisory SMEs, and 69 products were each rated by four supervisory SMEs). See Appendix F for details on scoring the AWST.
Assessing Interrater Reliability. 8 To a s s i g n supervisory SMEs to rate participants, we used partia counterbalancing. We examined interrater reliability with respect to two criterion variables: (1) "product dimension ratings" - derived by taking an average (across supervisory SMEs) of each summed, unit-weighted set of scores that supervisory SMEs assigned each analytic work performance and (2) "overall product ratings," derived by taking an average of supervisory SMEs overall ratings of each analytic work sample product (i.e., item 6 of the analytic work sample evaluation rubric).
Scoring the AWST. Ratings for each evaluation rubric item were converted to a -1 to +1 scale, where -1 was assigned to the worst response option, +1 was assigned to the best response option, and all other response options were distributed evenly throughout. For instance, for the or refute judgments," never was coded as -1, sometimes was coded as 0, and almost always was coded as +1. Overall ratings were converted to a 0 to +4 scale, where 0 was assigned to the worst response option, and +4 was assigned to the best response option.
A unit weighting approach was used to calculate the product dimension ratings. Previous research has shown that unit weights perform similarly to, or better than, regression weights, particularly when using smaller samples (Bobko et al., 2007;; Einhorn & Hogarth, 1975;; Schmidt, 1971;; Claudy, 1972). Performance on each dimension was weighted equally, and scores on each dimension were summed to calculate the product dimension rating. Because most evaluation rubric dimensions had two items (i.e., analysis of alternatives;; assumptions and judgments;; key judgments;; referencing), but one had four items (logical argumentation), dimension scores were normalized by the number of items on the dimension so that each dimension contributed equally to the overall composite score. For instance, ratings for dimensions comprising two items were each multiplied by .5, and ratings for dimensions comprising four items were each multiplied by .25. After summing across all weighted items, composite analytic performance scores were calculated by averaging across SMEs to produce a single composite score for each participant.
We attempted to maximize consistency across supervisory SMEs by holding the pre-rating training sessions discussed in Appendix E. Importantly, supervisory SMEs were blind to analysts' performance on the CATS test, so that experimenter bias could not play a role in analytic work sample ratings. In other words, supervisory SMEs could not purposefully rate an analytic work sample higher because they knew someone did well on the CATS test, as they were blind to CATS test scores.
The present study used an ill-structured measurement design (ISMD), wherein supervisory SMEs and participants were neither fully-crossed nor nested (Putka et al., 2008). Although at least two supervisory SMEs judged each analytic work sample product, and most products were rated by three of four supervisory SMEs, not all supervisory SMEs scored all participants (i.e., our design was not fully crossed), and neither was there a separate group of supervisory SMEs scoring each participant (i.e., our design was not fully nested). Therefore, to calculate IRR, we used the G(q,k) statistic proposed by Putka et al. (2008) as our primary measure of interrater reliability. This statistic resolves problems with traditional estimators, such as Pearson r and the intraclass correlation (ICC) and serves equally well for crossed, nested, and ill-structured designs.  9 Please note that some participants put SAT and ACT scores that fell outside the ranges for these tests, so these participants were not included when reporting descriptive statistics or running analyses involving SAT and ACT scores. In the case of SAT scores, two participants put scores that fell outside the range, and two did not indicate which version of the test they took (whether before 2005 or starting in 2005). Therefore, these two participants had to be discarded from analyses due to our inability to scale their scores appropriately according to whether they took two subtests or three. Five participants who took the ACT had to be discarded from analysis because they put scores that fell out of range.