The Advantages of Five-Option Multiple-Choice Items in Classroom Tests of Student Mastery

The effectiveness of multiple-choice (MC) items depends on the quality of the response options—particularly how well the incorrect options (“distractors”) attract students who have incomplete knowledge. It is often contended that test-writers are unable to devise more than two plausible distractors for most MC items, and that the effort needed to do so is not worthwhile in terms of the items’ psychometric qualities. To test these contentions, I analyzed students’ performance on 545 MC items across six science courses that I have taught over the past decade. Each MC item contained four distractors, and the dataset included more than 19,000 individual responses. All four distractors were deemed plausible in one-third of the items, and three distractors were plausible in another third. Each increase in plausible distractor led to an average of a 13% increase in item difficulty. Moreover, an increase in plausible distractors led to a significant increase in the discriminability of the items, with a leveling off by the fourth distractor. These results suggest that—at least for teachers writing tests to assess mastery of course content—it may be worthwhile to eschew recent skepticism and continue to attempt to write MC items with three or four distractors.

Specifically, would decreasing the number of response options change the difficulty or discriminability of MC items? Although previous research on these questions provides limited and sometimes contradictory advice, it seems likely that as the number of response options is decreased, the difficulty and discriminability of MC items will also tend to decrease (Haladyna & Downing, 1989b;Haladyna et al., 2002;Rodriguez, 2005). Clearly, more empirical evidence from organically derived MC items (i.e., from real classroom tests, rather than experimentally altered items) is required in order to give the best advice to teachers regarding the number of distractors they should include in their MC items.
The considerations that go into choosing the number of distractors and the constraints on test-writers depend on the type of test and the goals of the test-writer. For instance, the number of response options that is ideal for a large standardized test may differ from the optimal number for smaller classroom exams. The current study was undertaken from the perspective of a teacher writing MC items to assess students' mastery of content covered in class. I analyzed data from more than 19,000 responses across 545 five-option MC items to address the following specific questions regarding the number of distractors: 1) How many different distractors were chosen per MC item? 2) When more than one distractor was selected for an MC item, with what frequency were the different distractors chosen?
3) How did the number of distractors chosen relate to the difficulty of the items? and 4) How did the number of distractors chosen relate to the effectiveness of the items, as measured by a discrimination index?

The Data Set
The analyses in this paper include testing data from six courses that I taught at Roanoke College from 2012 to 2019 (Table 1). These courses include three introductory courses for biology majors, two versions of a general ecology course, an introductory environmental science course, and a general-education course for non-science majors. In total, 545 multiple-choice items were analyzed from hourly, midterm, and final exams given in these courses (Table 1). Each of these MC items included one key and four distractors. For each MC item, I recorded the responses (viz., A, B, C, D, or E) from every student (499 total) who took the exams-for a total of 19,150 responses (minus a small handful of inadvertent non-responses).

Number of Plausible Distractors
The first question I addressed was the relative frequency of MC items having one, two, three, or four plausible distractors. A common cutoff for a distractor to be considered "plausible" is that at least 5% of test-takers must choose that response. Because my classes were small (ranging from 23 to 51 students), I considered a distractor to be plausible if at least one student chose it. Because many MC items on the tests were intended to be relatively easy, I also looked at the number of distractors chosen per item for the subset of items for which fewer than 80% of the students chose the correct response. In other words, I addressed the question of how many different distractors were chosen per item after www.scholink.org/ojs/index.php/jetss Journal of Education, Teaching and Social Studies Vol. 2, No. 4, 2020 62 Published by SCHOLINK INC.
excluding the 182 easiest items, because few students choose distractors for the easy items-often by design.

Difficulty and Discrimination Index
The "difficulty" of an item is traditionally defined (somewhat counter-intuitively) as the proportion of test-takers who chose the correct response, and it is often abbreviated as "p". Thus, for an item that all students answered correctly, p = 1, while for an item that no students answered correctly, p = 0.
The "discriminability" of an item can be quantified by how well students perform on the item, compared to their overall performance on a test. A highly effective item in this sense will be answered correctly by students who do relatively well on the test as a whole, and incorrectly by students who do relatively poorly overall. In this study, I quantified the discriminability of each item by a standard discrimination index (DI) that was calculated with a point-biserial regression (Towns, 2014).
Specifically, for each item, DI is equivalent to a Pearson product-moment correlation between whether each student answered the item correctly (1) or incorrectly (0) and the student's total score on all the items of an exam. The higher the correlation, the more discriminating the item, and (loosely speaking) the more likely it was that the item did a good job at assessing mastery. While a DI can theoretically range from -1 to 1, any DI > 0.2 generally indicates that the item was sufficiently discriminating, and an item with DI > 0.4 is considered excellent (Towns, 2014).
To explore the relationship between the number of plausible distractors and the difficulty and discriminability of MC items, I ran two linear regressions, one with difficulty (p) and one with DI as the response variable. The predictor variable for both regressions was the number of plausible distractors. I excluded from these regressions the 11 MC items for which no distractors were chosen by any student because such items are meaningless in terms of discriminability. All of the statistical analyses reported in this paper were performed using JMP IN v. 4.0.4 (SAS Institute, Cary, North Carolina, USA).

Number of Plausible Distractors
The number of different distractors chosen per item covered the entire gamut ( Figure 1). Two or fewer distractors were chosen for ~35% of the MC items, including ~2% for which every student chose the correct response. Nearly two-thirds of the items contained at least three plausible distractors, with nearly one-third containing four plausible distractors.

Chose the Correct Option
When the 182 "easy" MC items (with p ≥ 0.8) were excluded from consideration, the proportions of items with higher numbers of distractors chosen increased substantially (unfilled bars, Figure 1). For the 363 items with p < 0.8. only 15% had two or fewer plausible distractors, while nearly half of the items included four plausible distractors.
Although more than one distractor was chosen for most items, the frequency with which different distractors were chosen varied substantially ( Figure 2). That is, only rarely were there multiple equally plausible distractors for an item. For instance, when there were two plausible distractors, the second-most plausible distractor was chosen only half as frequently as the most plausible distractor ( Figure 2). When three or more distractors were chosen per item, the third-most plausible distractor was chosen by 16% of the students who chose an incorrect response. When all four distractors were chosen for an item, a single distractor accounted for nearly half of the incorrect responses. Nevertheless, even the least plausible distractor made up a substantial minority (10% on average) of the incorrect responses when all four distractors were chosen for an item ( Figure 2).

Instance, for the MC Items in which Four Different Distractors were Chosen by Students in a Class, the most Plausible Distractor was Chosen 49% of the Time, and the Fourth most Plausible
Distractor was Chosen 10% of the Time when Students Chose an Incorrect Option. The

Difficulty and Discrimination Index
The difficulties (   estimates that for every plausible distractor, the DI increased by a mean of 0.04. However, the relationship was not strictly linear, as the effect of increasing the number of distractors lessened as more distractors were included ( Figure 3B). In particular, there was a large jump (43%) in DI between having one and two plausible distractors, but much smaller jumps in DI from two-to-three (8%) or from three-to-four (4%) plausible distractors.

Number of Distractors per MC Item
This study demonstrated that it is not unrealistic to expect to be able to write multiple-choice items that have more than two plausible distractors for classroom exams. This finding runs counter to a growing contention that the challenge in devising more than two distractors is severe enough to preclude the use of MC items with more than three response options (Delgado & Prieto, 1998;Haladyna et al., 2002;Tarrant et al., 2009;Dehnad et al., 2014;Kilgour & Tayyaba, 2016). Each of the 545 MC items in my dataset contained four distractors (in addition to the key), and each distractor was intended to be a reasonable-sounding response that would tempt students who had incomplete knowledge of the construct tested by the item. One-third of the MC items ended up containing two or fewer plausible distractors, defined as a distractor that was chosen by at least one student in a class. However, nearly one-third of the MC items contained three plausible distractors, and another third contained four plausible distractors.
In any class, there are concepts that are so fundamental that we expect all students should master them.
Furthermore, many classes target a median grade of a B+/C-(i.e., ~80%). Such a target requires that there will have to be quite a few "easy" items. By definition, very few distractors will be chosen for easy items. If we take into account these considerations and eliminate the 182 easiest items in this study (i.e., those for which the difficulty, p, was greater than or equal to 0.8), the conclusion that it is not prohibitively difficult to write multiple plausible distractors is even more strongly supported.
Specifically, 85% of the 353 "not easy" items had at least three plausible distractors, with four different distractors chosen for more than half of those 85%.

The Role of Plausible Distractors in the Difficulty of MC Items
The difficulty of an MC item was very strongly associated with the number of plausible distractors that the item contained. However, it is not possible to discern a clear cause-and-effect relationship between the variables. Consider an item that addresses a relatively simple concept. Few students are likely to choose a distractor for a simple item, which means that fewer distractors are likely to have been chosen even once. At the same time, if the test-writer includes additional tempting distractors for a given item, then it is more likely that a student will stray from choosing the key. In the first scenario, the difficulty of the item can be seen as a cause of the number of distractors that are chosen. In the second, the difficulty can be seen as an effect of the number of plausible distractors. Neither conclusion is incorrect, as the difficulty and the number of plausible distractors are intimately connected. Regardless, it is hard to argue against the proposition that if the number of distractors were decreased, the key would be chosen more frequently (even by random chance alone), thereby making an item less difficult.
Despite this apparent truism, empirical studies have shown inconsistent results concerning the effect that the number of distractors has on the difficulty of an MC item. For instance, Haladyna and Downing (1993) reviewed four standardized MC tests and found no relationship between the number of plausible distractors and item difficulty. In an updated review, Haladyna et al. (2002) reported that five studies concluded that fewer distractors made for easier items, while two other studies found the opposite pattern. In a comprehensive meta-analysis of 56 independent trials from 27 studies, Rodriguez (2005) found that decreases in the number of response options in MC items led to significant decreases in difficulty.

The Role of Plausible Distractors in the Discriminability of MC Items
In the current study, the number of plausible distractors per item was also strongly correlated with the discriminability of the items. Unlike the relationship between the number of plausible distractors and item difficulty, the relationship with discriminability was not linear. Instead, the positive effect of the number of distractors on the discrimination index (DI) quickly leveled off. Specifically, there was a 43% increase in DI from one-to-two plausible distractors, an 8% increase from two-to-three plausible distractors, and only a 4% increase from three-to-four plausible distractors. Because the discriminability of items is of paramount importance in designing an effective and reliable test of mastery, a 4% increase may be well worth the effort needed to come up with a fourth plausible distractor. However, the pattern shown in this study suggests that going beyond four plausible distractors is likely to produce ever-diminishing gains in discriminability.

The pattern of increasing discriminability with a greater number of distractors is largely consistent with
other studies reported in the literature, though exceptions exist. For instance, in Haladyna and Downing's (1993) review, all four MC tests that examined this relationship found higher DI for items with greater numbers of plausible distractors. However, the updated review of Haladyna et al. (2002) reported one study with no effect of the number of distractors on DI, and another study that found that DI increased with a decrease in the number of response options. In his comprehensive meta-analysis, Rodriguez (2005) found that decreases in the number of response options almost always resulted in a significant decrease in the discriminability of MC items.
Inferences regarding the effect of the number of distractors on the discriminability of MC items depend heavily on the design of the study. Very often, studies that have found three-option items to be as discriminating as five-option items have employed the strategy of giving the five-option test first, then selectively removing the poorest-functioning distractors, then retesting students (sometimes the same ones) with the resultant three-option tests (Owen & Froman, 1987 (Zoanetti et al., 2013).

Different Considerations for Standardized Tests versus Classroom Tests
The MC items analyzed in this study were written to assess students' mastery of content in my courses.
I did not design the items to test hypotheses regarding the optimal number of distractors (or any other hypotheses regarding the construction of MC items). Therefore, this study was not an experiment in which the numbers of distractors for given stems were deliberately altered in a replicated fashion to discern their effects on psychometric qualities. As such, the analyses are necessarily post-hoc.
Nevertheless, the number of MC items (545) and responses (>19,000) should make my inferences highly robust. Moreover, the organic origin of the MC items in the dataset lends authenticity to the inferences. In fact, several researchers have lamented the shortage of empirical studies on this type of authentic classroom data (Owen & Froman, 1987;Tarrant et al., 2009;Funk & Dickson, 2011). It should be stressed, however, that these inferences only strictly apply to similar testing situations-that is, testing of mastery of content delivered in a college course.
Classroom tests of course mastery differ in scope, character, and purpose from standardized tests in several ways that may influence guidance regarding the number of distractors that should be included in MC items. In short, there are reasons to consider including more distractors in classroom tests than may be optimal for standardized tests. Below, I summarize three of these reasons.
(1) Standardized tests are likely to cover a much wider range of content, and thus contain many more items, than a single classroom test. The most direct way to maximize the number of items that can be answered during a given amount of time is to minimize the number of response options per test item. In fact, a premium on efficiency is a main reason for the recent call for three-option items over the traditional four-or five-option items (Owen & Froman, 1987;Rodriguez, 2005;Nwadinigwe & Naibi, 2013;Dehnad et al., 2014;Schneid et al., 2014;Vegada et al., 2016). Because having fewer response options raises the risk of guessing, more items are required for a test with fewer response options to maintain the reliability of the test (Rodriguez, 2005;Royal & Stockdale, 2017). For instance, Zimmerman and Williams (2003) calculated that if a test included only items with three response options, it would need to include a minimum of 80 items to be sufficiently reliable. Except for comprehensive final exams, classroom tests are unlikely to include that many MC items. Similarly, except for very large lecture classes, tests are likely to contain a variety of constructed-response items in addition to a limited number of MC items. In my classes, it is quite uncommon for students to be stressed for time during tests due to having too many MC items to answer. Therefore, I have experienced no need to minimize the number of distractors per MC item for the sake of test-taking efficiency.
(2) Standardized tests are given to vast populations of test-takers who represent a wide range of knowledge and experiences. In contrast, a classroom is composed of a relatively small number of students, all who have ostensibly had equal exposure to the same material covered on a test in their class. Teachers know what content has been covered in the course, and they have a good sense of their students' common misunderstandings regarding various constructs. Teachers can take advantage of these misconceptions to construct multiple distractors that are likely to lure students who have incomplete mastery, thus leading to highly discriminating MC items. It is less practical to devise a set of multiple distractors that will tempt the multitude of diverse, anonymous takers of standardized tests.
In fact, there is no shortage of evidence of the use of flawed items and implausible distractors in a diversity of MC tests (Haladyna & Downing, 1993;Tarrant, Knierim, Hayes, & Ware, 2006;Tarrant et al., 2009;Kilgour & Tayyaba, 2016;Rush et al., 2016). Therefore, the rationale to minimize response options due to the difficulty of devising distractors may apply much more to standardized tests than to classroom tests.
(3) Not only are standardized tests given to multitudes of people, but they are used multiple times, year after year. Therefore, there are plenty of chances to identify and eliminate poorly functioning distractors so that only the most plausible distractors appear in revised versions of the items. Classroom teachers may reuse MC items from one class to another, but the data available to decide which distractors are implausible are much more restricted than for standardized tests. When writing new MC items, there is no way for a teacher to know for sure which reasonable-seeming distractors will end up luring no one. In addition, an unchosen distractor in one year may be a more attractive response in another year by chance alone. It would be self-defeating to eliminate potentially plausible distractors for the sole purpose of limiting the item to an arbitrary number of response options.

Recommendations for Classroom Teachers
The target audience for this paper is teachers who use, or are considering using, multiple-choice items in their classroom testing. Fortunately, there are many highly accessible sources for guidance on the writing of high-quality MC items (e.g., Haladyna & Downing, 1989a;Haladyna et al., 2002;Frey et al., 2005;Moreno, Martinez, & Muniz, 2006;Towns, 2014;Gierl et al., 2017;Rodriguez & Albano, 2017).
I conclude this paper with several pieces of advice to teachers based on previously published guidelines, the quantitative results of this study on distractors, and my experience writing MC items for tests across a variety of courses: (1) Aim to include three or four distractors in your MC items. When teaching about a topic, keep track of misconceptions expressed by students, as these misconceptions can be the best fodder for MC-item distractors. The results of the current study show that it is not unreasonable to expect to come up with multiple plausible distractors for most MC items on content taught in a class. This study also shows that the more plausible distractors an item contains, the more discriminating the item is likely to be.
The improvement from three-to-four plausible distractors was not as large as from one-to-two, or from two-to-three, but it is worth the effort. The only distractor that is assured to be non-functional is the one that is never written.
(2) Most teachers include at least some easy items on each test. For instance, there are likely to be some concepts that are so fundamental that you expect all of your students to know the answer. For such items, most of your distractors will go unchosen by all the students. Unless you are content with half of your students failing a test, then you will need to accept that some of your items will contain nominally implausible distractors. This does not necessarily mean that the item is flawed. The goals of boosting student performance and confidence can sometimes be more important than making sure that every item on a test is maximally discriminating and that all the distractors are chosen by a minimum percentage of your students for every MC item.
(3) Avoid unintentional cues in the text of your distractors that may help students guess the correct response option. Foremost, all of your response options should be homogenous in content and grammatically consistent with the text of the stem. Failure on either of these fronts will ruin the appeal of a distractor. Also, avoid repeating a phrase from the stem in the correct response option (the key), and avoid making the key longer than the distractors. Failure on either of these fronts will make the key more obvious to guessers.
(4) One tempting way to increase the number of distractors in an MC item is to include "all of the above" (AOTA) or "none of the above" (NOTA) as a response option. However, most experts agree that these "inclusive" response options should only be used with caution-or never at all (Harasym, Leong, Violato, Brant, & Lorscheider, 1998 the undesirable tendency to decrease the discriminability of an item. Therefore, if AOTA or NOTA is added to an item only to increase the number of distractors, then they are more likely to make the item worse rather than better. Nevertheless, if the subject matter tested by an item suggests an organic use of an inclusive option, then its use can be beneficial as long as the item is written very carefully to maximize clarity and minimize unintentional cuing (Frary, 1991;Rodriguez, 1997;Wise, 2020).
A final consideration regarding the use of an inclusive response option is that it should serve as the key with a frequency that is inversely proportion to the number of response options that the items on the test contain (Hansen & Lee, 1997;DiBattista, Sinnige-Egger, & Fortuna, 2014). For instance, if there are five response options in the MC items on a test, then AOTA (or NOTA) should be used as the key 20% of the time and as a distractor 80% of the time that it is used as a response option.
(5) Even more often than AOTA or NOTA, test experts caution against using complex response options, such as "Both A and B". While including options that are combinations of other options is an easy way to increase the number of distractors in an item, such an item is fraught with possibilities for ambiguity, misinterpretation, and frustration on the part of the test-takers. Thus, the increase in the number of distractors that results from combining response options would likely lead to an item being less, rather than more, discriminating.
(6) Humor is an important part of many teachers' style. Even on tests, teachers may be tempted to include an obviously incorrect but hopefully funny response option for a MC item. The jury is still out on whether this is an advisable practice (McMorris, Boothroyd, & Pietrangelo, 1997;Haladyna et al., 2002). Humor on an exam may ease the tension for some students, but other students may not appreciate the teacher making light of what to them is a high-stakes judgment by someone who holds all the power. With welcome increases in student diversity (e.g., students from a variety of cultures, or students on different positions on the autism spectrum), it is becoming more likely that many students may not understand the intended humor. Students who do not get the joke may feel that the joke is ultimately on them. From a practical standpoint, an intentionally ludicrous distractor will do nothing to increase the discriminability of a test item. Combined with the risk of upsetting some students, the lack of concrete benefits suggests that teachers forego the humorous distractor and replace it with a distractor that has a better chance at being plausible.
(7) Once a solid list of distractors has been written for a MC item, you need to decide where in the order of response options the key should appear. A common piece of advice is to "balance the key", which means to avoid putting the correct response option in the same location too often (e.g., by making "C" the key for half of your items). My advice is to eliminate any subjectivity in deciding where to place the key by using a computerized randomization procedure. (A procedure for randomizing key locations using Excel is detailed in the Appendix.) After writing the items and deciding the order in which the items will appear on the test, adjust the key for each item to match the position suggested by your randomization procedure. Of course, there will be cases in which you will need to override the randomized position. Specifically, if there is a logical or numerical order to the