Measuring Intelligence with the Sandia Matrices: Psychometric Review and Recommendations for Free Raven-Like Item Sets

Recommended Citation Harris, Alexandra M.; McMillan, Jeremiah T.; Listyg, Benjamin; Matzen, Laura E.; and Carter, Nathan (2020) "Measuring Intelligence with the Sandia Matrices: Psychometric Review and Recommendations for Free Raven-Like Item Sets," Personnel Assessment and Decisions: Vol. 6 : Iss. 3 , Article 6. DOI: https://doi.org/10.25035/pad.2020.03.006 Available at: https://scholarworks.bgsu.edu/pad/vol6/iss3/6


Northwestern University 2. University of Georgia 3. Sandia National Laboratories
The Raven's Progressive Matrices (RPMs; Raven et al., 1998) are widely used measures of analytical intelligence (Arthur Jr. & Woehr, 1993) in part because they are nonverbal. The RPMs 1 are matrix completion problems that require participants to solve patterns among objects. The Sandia Matrices are software-generated matrix completion problems designed to function similarly to the RPMs (Matzen et al., 2010). Although the Sandia Matrices software  was created to remedy the limited number of RPMs, another advantage over the RPMs is that Matzen and colleagues have made the software-including a large bank of pre-generated items-available for free (https://github.com/LauraMatzen/Matrices). Because the RPMs are proprietary, they are often cost prohibitive for researchers. Consequently, the Sandia Matrices are likely to have some of the advantages of the RPMs without its monetary disadvantages. Matzen et al. (2010) provided an extensive review of the Sandia Matrices development and item properties relative to the RPMs in their introductory norming study. However, an in-depth psychometric review is needed prior to widespread implementation. The first aim of this study is to provide a psychometric review of select Sandia Matrices items. We begin by using item response theory (IRT) to review item parameters and screen for potentially problematic items. 2 Then, given the intended similarity between the Sandia Matrices and RPMs, we briefly review issues historically evaluated in the RPMs: (a) dimensionality and (b) potential sex differences, including measurement bias and ABSTRACT KEYWORDS The Sandia Matrices are a free alternative to the Raven's Progressive Matrices (RPMs). This study offers a psychometric review of Sandia Matrices items focused on two of the most commonly investigated issues regarding the RPMs: (a) dimensionality and (b) sex differences. Model-data fit of three alternative factor structures are compared using confirmatory multidimensional item response theory (IRT) analyses, and measurement equivalence analyses are conducted to evaluate potential sex bias. Although results are somewhat inconclusive regarding factor structure, results do not show evidence of bias or mean differences by sex. Finally, although the Sandia Matrices software can generate infinite items, editing and validating items may be infeasible for many researchers. To aide implementation of the Sandia Matrices, we provide scoring materials for two brief static tests and a computer adaptive test. Implications and suggestions for future research using the Sandia Matrices are discussed.
intelligence, Raven's Progressive Matrices, Sandia Matrices Corresponding author: Alexandra M. Harris Email: alexandra.harris@northwestern.edu 1 Although some researchers may use the RPMs as a generic term to refer to matrix-type problems generally, we use RPMs in the current paper to refer explicitly to the branded, proprietary Raven's tests. We use "matrix-type" to refer to problems that use a matrix-type format but are not a specific Raven's test. 2 Relative to traditional psychometric approaches to test construction (e.g., classical test theory), IRT offers a number of advantages such as more precise reliability estimates, more readily interpretable difficulty parameters, and sample independence such that item parameters can be used to estimate latent trait scores in new samples (see Henson, 2003 andZickar &Broadfoot, 2009 for further review of the benefits of IRT).
Personnel Assessment And decisions meAsuring intelligence with the sAndiA mAtrices score (i.e., trait estimate) differences. Additionally, although the Sandia Matrices software can generate infinite combinations of items, the time involved in generating and curating new items may hinder implementation for many researchers. Matzen et al. (2010) acknowledge many items generated for the norming study required manual alterations to ensure appropriate distractors. As such, the second aim of this study is to identify appropriate sets of pregenerated stimuli and provide corresponding scoring information, including the IRT parameter estimates from our psychometric review. Although other free matrix-type cognitive ability measures similar to the RPMs exist (e.g., International Cognitive Ability Resource Team, 2014), IRT-based psychometric information is rarely available. By using the items recommended here and the associated parameters, researchers who lack the resources to generate and curate new items or the sample sizes necessary for IRT scoring can still benefit from the enhanced precision of IRT estimates.
Thus, we culminate our psychometric review of Sandia Matrices items by recommending two 10-item sets that can be administered in a paper-and-pencil format. Additionally, due to the utility and efficiency of computerized adaptive testing for psychological assessment (van der Linden & Glas, 2010), we provide code for administering a computer adaptive test (CAT). Materials for both the 10-item sets and the CAT are provided such that researchers can administer and score the Sandia Matrices for as few as a single participant. Finally, we report mean raw scores (i.e., proportion correct) and standard deviations for the final items sets so that researchers who do not wish to use IRT parameter estimates can calculate standardized scores.

Dimensionality
One of the commonly debated properties of the RPMs is their dimensionality. Although the RPMs are intended as a unidimensional measure of analytical intelligence, some researchers have proposed that they also assess visuospatial abilities and are therefore two dimensional (e.g., Dillon et al., 1981). This argument stems from a taxonomy that separates the rules underpinning RPM solutions into verbal-analytic and visuospatial-based strategies (Carpenter et al., 1990;DeShon et al., 1995). Despite its popularity, support for a two-dimensional structure driven by distinct cognitive processes has been weak thus far (Vigneau & Bors, 2008;Waschl et al., 2016).
A primary reason that researchers are concerned about the influence of visuospatial processing on the RPMs is that men generally demonstrate advantages over women in spatial ability tasks (Voyer et al., 1995). Some researchers have proposed that group differences between men and women on the RPMs (Lynn & Irwing, 2004) are attributable to those items that invoke visuospatial processes . Thus, despite inconclusive evidence regarding the factor structure of the RPMs, we evaluate the factor structure of the Sandia Matrices in order to better understand potential sex differences.
Relative to the RPMs, the Sandia Matrices utilize a narrower set of rules to inform solutions but can still be mapped on to the taxonomies used to distinguish RPM rules. The Sandia Matrices include object relation (OR) and logic items (see Figure 1). OR problems involve simple transformations (e.g., shape, shading, orientation) across the matrix and are subdivided by the number of transformations that participants must track (one, two, or three relations; here called OR-1, OR-2, and OR-3 respectively). In contrast, logic problems involve conjunction and disjunction rules. Table 1 summarizes approximately how these transformations correspond to four rules defined by DeShon et al. (1995).
Given the debate regarding a two-factor structure in the RPMs, it is possible that the strategies used to solve Sandia Matrices problems would similarly produce a two-factor structure. However, in the Sandia Matrices, the two types of processes (i.e., verbal analytic vs. visuospatial) roughly correspond to the two problem types (i.e., OR and logic). According to DeShon et al.'s taxonomy, all of the Sandia Matrices logic problems involve visuospatial processing, whereas OR problems involve primarily verbal-analytic processing unless they include rotation. Consequently, evidence of a two-factor structure for the Sandia Matrices may stem from a distinction between either underlying visuospatial and verbal-analytic processes or simply the two problem types. Thus, this study evaluates the dimensionality of the Sandia Matrices by considering a unidimensional model as well as alternative two-dimensional models: visuospatial versus verbal-analytic processing and OR versus logic problems.

Sex Differences
As mentioned above, one of the primary reasons for investigating a two-dimensional visuospatial versus verbal-analytic structure is gender difference implications. Gender differences may manifest either as bias in item parameters such that men and women with the same trait scores show different likelihoods of answering correctly (i.e., measurement bias) or score differences even after accounting for biased items (i.e., trait estimate differences). Although some researchers have found no evidence of sex bias in the RPMs (Waschl et al., 2016), others suggest that sex differences on the RPMs persist even after accounting for measurement bias in items that invoke spatial processing (Abad et al., 2004). Thus, we evaluate gender differences by first conducting measurement equivalence (ME) analyses (Drasgow, 1984) to determine whether item parameters are different between groups (i.e., show bias). After accounting for potential bias, we compare trait estimates across genders.

Participants
Sample 1 participants were workers on Amazon Mechanical Turk (N = 1,276, M age = 34.56, SD age = 11.47, 66.9% female, 82.2% White). Sample 2 participants were undergraduates at a large university in the southeastern United States (N = 338; Mage = 19.18, SD age = 1.47, 65.4% female, 77.8% White). Participants completed an online questionnaire including Sandia Matrices and demographic items. These final samples include only participants who passed a variety of attention check items and took longer than 5 minutes to complete the survey.

Sandia Matrices
In their norming study, Matzen et al. (2010) found that Sandia Matrices item types advance in difficulty from OR-1 to logic problems. To target a range of difficulties, in Sample 1 we administered 5 items of each Sandia Matrices item subtype (OR-1, OR-2, OR-3, and logic) for a total of 20 items. Within each item type, we further selected items according to the proportion correct (i.e., 0%, 25%, 50%, 75%, 100%) as reported by Matzen et al. (2010). In Sample 2, we selected additional OR-3 and logic items to increase the number of items expected to show moderate to high difficulty for a 25 total items. Participants completed all items in both samples. All Sandia Matrices items include eight response options, and all items were selected from those included in Matzen et al.'s (2010) norming study. Items were coded according to type (OR vs. logic) and whether the rules used to solve them required visuospatial, verbal-analytic, or both processes. Responses were coded as correct or incorrect.

Data Analysis
Confirmatory multidimensional IRT. Confirmatory  Personnel Assessment And decisions meAsuring intelligence with the sAndiA mAtrices multidimensional item response theory (CMIRT) analyses were conducted using the R package "mirt" (Chalmers, 2012) to compare three possible factor structures: unidimensional; visuospatial versus verbal-analytic processing (two-factor); OR versus logic problems (two-factor). OR items that utilized strategies thought to invoke both verbal-analytic and visuospatial processes were set to load onto both factors. Sex differences. ME analyses were also conducted using the R package "mirt." To conduct ME analyses (Drasgow, 1984;Stark et al., 2006), we compared the fit of three models for the alternative two-factor structures in each sample: a fully freed model in which all item parameters, means, and factor correlations were allowed to vary between genders; a partially constrained model in which item parameters were constrained but means and factor correlations were allowed to vary across genders; and finally a fully constrained model in which all item parameters, means, and factor correlations were set to be equal across genders (i.e., a one group model). Improved fit of the partially constrained model relative to the fully freed model would suggest the Sandia Matrices do not show measurement bias, and improved fit of the fully constrained relative to the partially constrained model would further suggest the Sandia Matrices do not show structural or trait estimate differences across genders.
Recommended item sets. Before determining which items to include in our static test sets and CAT item bank, we conducted multiple groups analysis to test for meaningful group differences in means or item parameters between the two samples. We then considered item parameters as estimated by the final models to select items for recommendation. In the two 10-item sets, we selected items to represent a range of difficulty (b) parameters and approximately balance item types between the two sets.

RESULTS
Because to our knowledge no prior studies have conducted a model-based psychometric evaluation of Sandia Matrices items, we first reviewed model-data fit for unidimensional models and reviewed all items for problematic properties. Both the 2-parameter logistic model (2PLM) and 3-parameter logistic model (3PLM) are appropriate for dichotomously scored multiple choice data. The 2PLM includes two item parameters: the item discrimination, a, which is conceptually similar to factor loadings; and the item difficulty, b, which is defined as the trait level at which persons have a .50 probability of getting the item correct (Embretson & Reise, 2000). The 3PLM includes these parameters as well as a third that accounts for "guessing" (Waller, 1989), c, which is the probability of a correct answer for a person with infinitely low ability. A number of items demonstrated substantial guessing parameters, which suggests that guessing is a concern. Thus, to account for items with large guessing parameters, we chose to proceed with the 3PLM for the purposes of psychometric review.
Although model-data fit was acceptable overall, some items exhibited extreme item parameters that warranted further review. In Sample 1, items B5_1, B5C1, and D2E4 exhibited discrimination parameters of 7.61, 6.99, and 28.84 respectively. Such high discrimination parameters suggest that responses to these items may be similarly influenced by factors other than cognitive ability (i.e., demonstrate local dependence). Closer evaluation of these items revealed that all three utilized the same shading progression strategy ( Figure 2). For each item, the darkest shape was the correct answer, yet a large proportion of respondents chose the lightest shape. This pattern suggests that many respondents thought the problem followed a symmetry rule or repeated. The elimination of these three items resulted in a 17 final items for Sample 1.
In Sample 2, item A3D4E1 showed a relatively high guessing parameter of 0.261. Because the Sandia Matrices include eight response options, we would expect guessing parameters to be approximately at or below .125. Thus, a high value suggests that participants with very low cognitive ability could guess the correct answer to item A3D4E1 at a rate greater than expected by chance. Closer review revealed that this item had only two competitive distractors, which were identical except for the size of the stimuli. Because some participants may have had difficulty discerning the differences in stimuli sizes for reasons other than intelligence level (e.g., size of electronic screen), we chose to eliminate this item from further analyses. Although other items in Sample 2 also showed somewhat high parameter estimates, we discerned no obvious content-related reasons. Eliminating item A3D4E1 resulted in 24 final items for Sample 2. Table 2 displays descriptive statistics for each sample, including coefficient alpha after removing problematic items. The unidimensional 3PLM showed acceptable model-data fit (see Table 3) and χ 2 /df ratio < 3 for all remaining items in both samples.

Dimensionality
Table 3 presents fit statistics for the three factor structures evaluated using confirmatory CMIRT analyses. In both samples, Akaike's information criterion (AIC) and Bayesian information criterion (BIC) show larger values for the unidimensional model than either the visuospatial   Note. *Correct answer. x Incorrect, commonly selected answer.
versus verbal-analytic model or the OR versus logic model. However, for both samples, AIC was lower for the visuospatial versus verbal-analytic model, and BIC was lower for the object-relation versus logic model. Moreover, RMSEA confidence intervals of all three models were nearly identical in Sample 1 and overlapped between the alternative two-factor models in Sample 2. Thus, although there is some evidence that a two-factor structure fits better than a one-factor structure, which two-factor structure is not clear. It is possible that the improved fit of a two-factor structure relative to a one factor is attributable to the distinction between OR and logic item types as opposed to the underlying processing strategies.

Sex Differences
As noted above, one of the primary reasons a two-factor structure is a concern for the Sandia Matrices is because a verbal-analytic versus visuospatial distinction might suggest sex differences. To evaluate whether these factor structures might impact sex differences (i.e., whether any of the factors exhibited evidence of measurement bias or trait estimate differences), we conducted ME analyses. In all cases, the fully constrained (i.e., one group) model fit better than the models in which parameters were free to vary across genders (see Table 4). Because results did not point to a clear two-factor structure that could not be explained by simple differences in item types, nor was their evidence that the factors had meaningful consequences for measurement bias or score differences by gender, we chose to proceed using a unidimensional model for the remainder of our analyses. 3 Finally, to determine whether there were sex differences in Sandia Matrices scores derived using the unidimensional 3PLM model or raw scores, a t-test was performed for using latent trait scores and proportion correct in both samples. Results are shown in Table 5. No significant sex differences were found in either sample, regardless of scoring approach.

Recommended Item Sets
Before determining our recommended item sets, we conducted multiple groups analysis to determine whether there were any meaningful differences in group means or item parameters between our two samples. We first omitted one item that persisted in showing an extreme discrimination parameter in the final unidimensional model for Sample 2 (Y_11, a = 5.24). Removal of this item resulted in a final item set of 26 items. Next, we compared the fit of a 2PLM and 3PLM model for our combined samples. AIC and BIC support fit of the 2PLM (AIC = 24412.62; BIC = 24692.72) relative to the 3PLM (AIC = 24423.52; BIC = 24843.66). Additionally, when estimated with the 3PLM, over one-fourth of items exhibited discrimination parameters over 4.0, which indicates overfitting. Given evidence of overfitting with the 3PLM and inconsistent support for either the 2PLM or 3PLM, we proceeded with the 2PLM for all remaining analyses.
To conduct multiple groups analysis, we compared a model in which group means and item parameters were allowed to freely vary between samples (i.e., fully freed baseline model) with the fully constrained 2PLM. AIC and BIC support fit of the constrained model (fit reported above) relative to the fully freed model (AIC = 24477.33; BIC = 25032.14). Thus, we proceeded with a model that utilized both samples as a single group (N = 1,614).
Items were selected such that each recommended 10item set would reflect a range of difficulty (b) parameters and that the proportion of each item type would be similar between the two sets. Further, we avoided selecting items with particularly high discrimination (a) parameters (e.g., above 2.0) to avoid overly weighting any one item. Empirical reliability for the full 26 items was .70, and empirical reliability for both 10-item measures was above .95. Figure  3 illustrates the test information (TIF) for the full 26 items, and Figure 4 illustrates TIFs for the two 10-item measures. Table 6 includes parameter estimates for all 26 items estimated across samples, an indication of item-set assignment for each of the 10-item measures, as well as mean raw scores and standard deviations for all item sets.
Additionally, we aimed to construct a CAT that researchers could use to efficiently assess intelligence in just a few items. CAT is an iterative assessment procedure whereby item locations are matched as closely as possible to respondent ability levels. The standard process in a CAT is to start by assuming an individual has average/moderate ability, present a single item, update the estimate of ability based upon the respondent's response and a given response model, select and present the next item that maximizes information at that ability level, and so on until the termination criterion has been reached (i.e., a set length or a set standard error of measurement). CATs provide maximum utility when there are many candidate items that can be matched precisely to any estimated ability level (i.e., items span a wide range of locations; Flaugher, 2000). All 26 items were included in the item bank for CAT items. All 26 item stimuli as well as R code for scoring the 10-item measures from participant responses and administering the CAT are included in the online supplementary materials.

DISCUSSION
This study aimed to provide the first modern psychometric review of the Sandia Matrices as well as to recommend two 10-item measures and construct a CAT for use by researchers. Specifically, we reviewed two psychometric issues historically evaluated in the RPMs: dimensionality 3 To evaluate potential measurement bias at the item-level we also conducted differential item functioning (DIF) analyses (Meade & Lautenschlager, 2004). No more than one item demonstrated possible evidence of DIF in any sample (i.e., less than the 10% that would be expected due to type 1 error using a liberal significance criterion of .10). Note. Fully freed baseline model: item loadings, item thresholds, means, and trait correlations allowed to vary across genders. Partially constrained model: item loadings and item loadings constrained; means and trait correlations allowed to vary. Fully constrained: item loadings, item thresholds, means, and trait correlations constrained (i.e., single group model). In Sample 1, 12 participants did not report gender yielding a total sample size of 1,255 for gender analyses.  and sex differences. Present results suggest that the Sandia Matrices may show a two-factor structure, although it is unclear whether that two-factor structure is an artifact of item types or influenced by differences in the underlying cognitive processes required to solve the items. Notably, these results are consistent with prior research that suggests test artifacts are more likely to account for the two-dimensional structure of the RPMs than are differences in required cognitive strategies (Vigneau & Bors, 2008). Regardless, the primary concern for the influence of a two-dimensional structure is rooted in potential sex differences on visuospatial items. Our results do not show evidence of sex  Nonetheless, results highlight other potential concerns. First, several items were removed prior to conducting key analyses due to evidence of extreme item parameters. In all cases, these items seemed to include a single competitive distractor. Even after eliminating these three problematic items, use of the 3PLM was warranted for additional item review due to evidence of substantial c (i.e., "guessing") parameters. Notably, items used in this study were pregenerated and had already been reviewed to ensure appropriate distractors (Matzen et al., 2010). Thus, we caution against new software-generated problems without manually checking or manipulating distractors. Even using pregenerated and edited items without first conducting a thorough IRTbased psychometric review may yield misleading results.
Here, we have recommended two compilations of items with relatively reasonable parameters. The provided R code allows researchers to administer the recommended items sets or CAT to as few as a single participant and still derive theta estimates using the IRT parameters provided here. We expect that these measures and corresponding review of item properties will substantially aid researchers in implementing the Sandia Matrices in their own studies.

Limitations and Future Directions
This study utilized participants recruited from multiple sources, including Amazon Mechanical Turk and an undergraduate participant pool. Although the diversity of sources bolsters confidence in our findings, these populations may have intelligence distributions that differ from the average adult in the United States. We encourage future research to explore characteristics of the Sandia Matrices in the broader population.
Additionally, the factor analytic approach used here is not necessarily appropriate for fully testing the types of cognitive strategies underlying the Sandia Matrices items. Given that the primary aim of investigating dimensionality in this study was to better understand potential sex differences, the limited evidence of sex differences, and the consistency of our approach with other studies investigating the dimensionality of the RPMs (see Waschl et al., 2016 for a review), we believe the analytic approach used here was sufficient for our purposes. Nonetheless, researchers interested in exploring cognitive strategies specifically should consider more advanced analysis approaches that were beyond the scope of this study (see Embretson et al, 1986;Mislevy & Verhelst, 1990).
Finally, additional studies might also consider how item types, including types of transformations and combinations, influence discrimination and location parameters to better inform construction of other Sandia Matrices item sets or use of the software in generating additional items. To fully supplant the RPMs with the Sandia Matrices, researchers will need to understand how to compile sets of Sandia Matrices items equivalent to both the Raven's Standard Progressive Matrices and the Advanced Raven's Progressive Matrices.

Conclusion
Although the RPMs are an extremely popular measure of intelligence, their proprietary status represents a limitation for many researchers. This study offers an initial IRTbased psychometric evaluation of Matzen et al. (2010)'s free alternative that shows no evidence of sex differences. We hope that the multiple, curated item sets recommended here will spur additional exploration of the Sandia Matrices as well as greater implementation of intelligence measurement in psychological research.