An Empirical Investigation of the Fairness of Multiple-Choice Items Relative to Constructed-Response Items on Tests of Students’ Mastery of Course Content

Although many instructors prefer multiple-choice (MC) items due to their convenience and objectivity, many others eschew their use due to concerns that they are less fair than constructed response (CR) items at evaluating student mastery of course content. To address three common unfairness concerns, I analyzed performance on MC and CR items from tests within nine sections of five different biology courses I taught over a five-year period. In all nine sections, students’ scores on MC items were highly correlated with their scores on CR items (overall r = 0.90), suggesting that MC and CR items quantified mastery of content in an essentially equivalent manner—at least to the extent that students’ relative rankings depended very little on the type of test item. In addition, there was no evidence that any students were unfairly disadvantaged on MC items (relative to their performance on CR items) due to poor guessing abilities. Finally, there was no evidence that females were unfairly assessed by MC items, as they scored 4% higher on average than males on both MC and CR items. Overall, there was no evidence that MC items were any less fair than CR items testing within the same content domain.

construct validity (Lukhele, Thissen, & Wainer, 1994;Bacon, 2003). In that sense, a high positive correlation would be strong evidence that MC items are just as fair as CR items in terms of quantifying student mastery. A low correlation, in contrast, would suggest that either the MC items, the CR items, or both possessed inadequate construct validity.
Although students may guess the correct answer to any type of test item, the limited number of options available makes MC items more susceptible to guessing than are CR items (Zimmerman & Williams, 2003;Bar-Hillel, Budescu, & Attali, 2005;Bush, 2015;Campbell, 2015). In particular, unskilled test-writers often unintentionally provide cues that a test-wise student can use to discern the correct response without actually possessing mastery of the tested construct (Dolly & Williams, 1986;Bar-Hillel & Attali, 2002;Attali & Bar-Hillel, 2003). As a result, guessing can lower the discriminability of an item and the reliability of a test. Fortunately, a test-writer can easily learn to avoid cueing in their MC items once they recognize the problem (Haladyna, Downing, & Rodriguez, 2002;Attali & Bar-Hillel, 2003;Schroeder, Murphy, & Holme, 2012;Towns, 2014;Ibbett & Wheldon, 2016;Rodriguez & Albano, 2017). In terms of fairness, the practical issue is not so much that MC items are more susceptible to guessing than are CR items, but that some students may be consistently better guessers than other students. More specifically, students commonly believe that MC items favor test-wise students at the expense of students who know the content better but are not good at guessing (Hoffmann, 1962;Alker et al., 1969). The idea that students vary in test-wiseness is not controversial.
The unproven notion is that the good guessers will tend to be the students with low mastery, while the students who have mastered the content better will be poor guessers. This unproven notion is at the heart of the second fairness issue regarding MC items addressed in the current study.
The third fairness issue is important from a sociological standpoint: MC tests may unfairly disadvantage certain groups-particularly those that are traditionally underrepresented in STEM areas (Bell & Hay, 1987;Ben-Shakhar & Sinai, 1991;Bond, 1995;Lissitz, Hou, & Slater, 2012). For instance, it has often been claimed that females' mastery, competency, or aptitude may be underestimated by a test that uses MC items compared to a test on the same set of constructs that uses CR items (DeMars, 2000;Simkin & Kuechler, 2005;Kuechler & Simkin, 2010). The available evidence addressing this claim is mixed, with variation in outcomes attributed to such factors as the stakes, setting, and subject matter of the tests (Lumsden & Scott, 1987;Feinberg, 1990;Bridgeman & Lewis, 1994;Becker & Johnston, 1999;Beller & Gafni, 2000;DeMars, 2000;Chan & Kennedy, 2002;Stanger-Hall, 2012). Clearly, more empirical data on the issue of gender differences in performance (particularly in classroom settings) would be of great value for teachers who use or are considering the use of MC items in their tests.
To address these three fairness issues, I examined five years of test-performance data from biology courses that I have taught at Roanoke College. Specifically, I analyzed students' relative performance on MC and CR items within courses to ask three main questions: (1) How strongly are students' scores on MC and CR items correlated within tests of the same set of constructs? (A low or negative correlation would suggest unfairness in terms of unequal construct validity.) (2) Are some students substantially worse at answering MC items than their performance on CR items would predict? (Such a pattern would suggest MC items are unfair to "bad guessers.") (3) Do females (or males) perform worse on MC items than their performance on CR items would predict? (Such a pattern would be evidence of gender-based unfairness.)

The Data Set
The sources of data for this study were nine lecture sections of five different biology courses (Table 1) taught by the author at Roanoke College, a selective liberal arts college of nearly 2,000 undergraduate students in southwestern Virginia, USA. These courses included introductory biology courses for majors, general ecology, and a general-education science course for non-science majors. The number of students per section ranged from 17 to 48, and 18 students completed two different courses in the set.
The full dataset included 257 sets of scores (one per student per lecture section) that were calculated from combined items across the tests administered in each course. These tests included a final exam in each course, plus three to five hourly exams for all courses-except for the 2015 sections of BIOL 205, which included one midterm exam and eight half-hour quizzes. (For simplicity, all exams and quizzes will be referred to as "tests.") In all, 239 different students were part of this study, with 18 students taking two separate courses. items constituted about one-third to just over one-half of the total points for the tests in a course (Table   1). On most tests, MC items were counted as two points each, but they were counted as three points on the quizzes and final exam for the 2015 sections of BIOL 205, and one point on the final exam for the 2012 section of BIOL 205 (These differences in points-per-item merely reflect differences in weighting on tests of different lengths, and they have no impact on the inferences of the analyses in this study).
Every MC item consisted of five response options, including one correct answer (the key) and four distractors.
The CR items included the following types of questions: fill-in-the-blank; labeling, interpreting and/or drawing diagrams; mathematical problems (e.g., Hardy-Weinberg and Mendelian genetics scenarios); short-answer questions requiring one or two sentences; and essay questions requiring one or two paragraphs. The point values of the CR items varied depending on the length and complexity of the required responses. All test items were written and graded solely by the author of this paper-a fact that allowed for consistency in content coverage, writing style, and grading procedures.

Statistical Analyses
For each student in each course section, I calculated two standardized performance metrics-one for MC items and one for CR items. Within a course section, I calculated a total for MC points for each student across all tests, then divided this sum by the total possible MC points (Table 1) to obtain a proportional score. I standardized these proportions by subtracting the section's mean score for MC items from each student's score and dividing by the standard deviation. The same procedures were performed on the CR item scores. The means for both MC and CR scores were thus standardized to zero within each section, and students' scores were expressed as the number of standard deviations above or below the mean. This standardization allowed for meaningful comparisons of students' relative performances, as it factored out differences in mean scores between MC and CR items within and between courses.
For each student within each course section, I calculated a single metric (∆ M-C ) to indicate each items disproportionately tend to trip up students who know the material well, then there would be an excess of students with lower-than-expected ∆ M-C scores. In contrast, if MC and CR items are equivalent in how well they test mastery (as opposed to guessing ability), then the distribution of ∆ M-C scores would be expected to follow a normal distribution with a mean of zero and with an absence of extreme (i.e., low-probability) outliers.
To test these predictions, I compared the distribution of ∆ M-C scores with a normal distribution (having the same mean and std as the actual distribution) using the Shapiro-Wilk goodness-of-fit test. I also

Potential Gender Bias
To assess potential gender biases in relative performance on different types of test items, I ran analyses of variance (ANOVAs) on MC scores, CR scores, and ∆ M-C scores using gender as the explanatory variable. I included students' academic standing (i.e., year) as a covariate, allowing for the fact that college experience is likely to explain a substantial amount of the variation in test scores. The academic standings of the students comprised 84 freshmen, 86 sophomores, 59 juniors, and 28 seniors, and the genders comprised 163 females and 94 males. As far as I was aware, the students' identified genders matched their biological genders.

Potential Validity Differences
The Pearson correlation coefficients between MC and CR scores were positive and highly statistically  (Table 1). Across all nine sections together, the correlation between the 257 pairs of standardized MC scores and CR scores was r = 0.90 (P < 0.0001; Figure 1). In addition, the relative rankings of students within courses were strikingly consistent whether based on MC scores or CR scores, as shown by the clustering of points around straight lines with a slope of one in Figure 2. While there are a few points that deviate from the line, there are no points that indicate a reversal in rankings when scores are based on one type of question versus the other.

Discussion
To appreciate fully the implications and limitations of this study, it is important to acknowledge its retrospective nature. Specifically, none of the items on the tests was written with the intention to address the questions that are the focus of this paper. I designed each test to include different types of test items that together assessed mastery of as complete a sample of constructs as possible.

Potential Validity Differences
In this study, students' scores on MC items and CR items on tests within the same content domain were very similar. Specifically, across nine biology courses, the correlation coefficients between scores on MC items and CR items ranged from 0.81 to 0.94, with a mean of 0.90. The high correlation does not by itself address whether the items were effective at estimating mastery. That is, the correlations are not quantifications of construct validity. (The best tool to evaluate true mastery and quantify construct validity is a clinical interview of students-a process which is not plausible for most classes.) What the high correlation does indicate is that the two types of items were nearly equally effective at quantifying mastery, and thus they must at least be very similar in construct validity-whatever the true value of validity may be.
In more concrete terms, equal validity of MC and CR items suggests that a student's grade would be very similar regardless of which type of items made up the tests. In this study, the relative rankings of students were indeed quite similar whether they were based solely on their scores on MC items or solely on their scores CR items. Combined with the high positive correlations between scores on MC and CR items, the consistency in relative grade rankings suggests that a test can be equally fair whether it is made up entirely of MC items, CR items, or a combination of the two types of items. More to the point, this study revealed no indication that MC items were any less fair than CR items to students due to any concerns regarding inadequate construct validity.
possible unless MC and CR items were similar in construct validity. In particular, I submit that if a teacher wants to include MC items on a test to take advantage of their efficiency and objectivity, then the results of the current study can be evinced as support that doing so need not involve a sacrifice in fairness in terms of construct validity.

Potential Guessing Bias
Even a completely naï ve test-taker is expected to answer a certain proportion of multiple-choice items correctly through random guessing. The potential for guessing may thus make a MC item less difficult than a CR item that tests the same construct. However, this fact does not necessarily make MC items less fair. They would only be unfair to the extent that guessing decouples students' performance from students' mastery in a way that is unique to MC items. Specifically, MC items would be unfair if students with low mastery tended to be good at guessing answers to MC items, while students with higher mastery tended to be worse at guessing when they did not know an answer. Such a phenomenon would be evidenced by a cluster of students with higher scores on MC items than their scores on CR would predict (the good guessers) and a cluster of students with lower scores on MC items than their scores on CR items would predict (the bad guessers).
Across the nine courses in this study, the mean score for MC items was five percentage points higher than the mean score for CR items. This difference in difficulty was likely to have been influenced by students with incomplete mastery being more likely to guess a correct response for MC items than to construct a correct written response for CR items. Nevertheless, this study found no evidence that MC items were unfair due to a decoupling of guessing ability from mastery of constructs. Specifically, there were no students with substantially higher (or lower) scores on MC items than their scores on CR items would predict (There were no clusters of scores in the upper left or lower right corner of the plot in Figure 1).
The question remains whether females are disadvantaged by the types of MC items that are typically asked to test classroom mastery-that is, items that cover material that should be familiar to the test-takers, items that are not excessively difficult, and items that are scored without penalties to discourage guessing. The dataset from the biology courses that make up the current study can be used to address this question in an authentic manner because all of the items were designed to test mastery of content and constructs specifically covered in the course. Furthermore, there were no numeric penalties for guessing on any of the tests.
Across the nine biology courses in the current study, females scored an average of 4% higher than males on both MC and CR items. This result is consistent with the nearly universal pattern of females earning higher grades than males in classroom assessments (Halpern, 2004). More importantly in terms of the focal issue of this study, there was no difference between male and female students in terms of relative performance on MC items compared to CR items. Specifically, the ∆ M-C was essentially zero for both genders. Therefore, there was no hint of evidence that MC items were unfair to females (or to males) relative to CR items.
Finally, there was a consistent trend toward more-experienced students performing better at both MC and CR items. Specifically, seniors performed significantly better than juniors, sophomores, and freshmen on both types of items. Nevertheless, academic standing did not influence students' relative performance on MC and CR items. Thus it appears that as students gained experience with college courses, their abilities on MC and CR items increased at the same rate-roughly five percentage points per academic year. It is also possible that the mean scores increased with academic standing as the poorer-performing students tended to drop out of the curriculum while the better-performing students remained through their senior year. Either way, the use of MC items was equally fair to students across the range of academic standing.

Conclusion
Recent studies have shown that multiple-choice items are efficient, reliable, and objective tools for assessing mastery of content, and they can even be used for testing higher-order cognitive skills. The results of the current study on test items from nine biology courses should help alleviate lingering concerns that multiple-choice items may be less fair to students than constructed-response items.
Specifically, the high positive correlations (mean r = 0.90) between students' scores on MC and CR items within courses suggest that the two types of items were essentially equally valid tests of mastery.
Second, there was no evidence that any of the 239 students received substantially higher (or lower) overall scores due to being better (or worse) guessers of MC answers than the assessment of their mastery based on CR items suggests they should have received. Third, there was no evidence that MC items unfairly disadvantaged females (or males) relative to their performance on CR items. These results should reassure teachers who use MC items on their tests that they are not unfairly judging www.scholink.org/ojs/index.php/jetss Journal of Education, Teaching and Social Studies Vol. 2, No. 4, 2020 Published by SCHOLINK INC. student mastery, as well as encourage others who have been reluctant to take advantage of the benefits of MC items on their tests.
One important caveat is that not all MC items are created equal. For instance, MC items that are overly complex, poorly written, or do not have a clear correct response option will not only frustrate students, but they will reduce the validity of the items. In addition, MC items that have unintentional cues can be exploited by test-wise students, such that guessing is more likely to disconnect performance from mastery. Reviews have shown that tests written by teachers, and even commercial test banks, are often replete with MC-item-writing flaws that can lead them to be less valid and less reliable-and thus items that can function as fair assessments of student mastery (Haladyna & Downing, 1989;Haladyna, 2004;Frey et al., 2005;Haladyna & Rodriguez, 2013;Towns, 2014;Rodriguez & Albano, 2017;Scully, 2017).
Even though this study suggests that students' relative rankings on tests are likely to be very similar regardless of whether MC or CR items are used, I do not suggest that MC items should be the only assessment tool used in a class, or even on a single test. For one thing, even though MC items can test higher-level thinking, in actual practice they tend to be written in a manner that requires only the lower-level cognitive processes of recognition and recall (Tarrant et al., 2006;Momsen, Long, Wyse, & Ebert-May, 2010;Baig et al., 2014;Rush et al., 2016). The inclusion of CR items can help ensure that tests will include sufficient items that require higher-order cognitive processes. Moreover, CR items are required to test the vital skills of logically formulating and clearly writing arguments (Aiken, 1987).
Finally, students may spend less time preparing and study more superficially for MC tests because of the perception that they will merely need to recognize the correct answers provided on the test (Gustav,