Fisking the Haladyna Rules #30: Use common errors of students

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Use typical errors of students to write your distractors.

In 1989, 2002 and 2004, this rule and Rule 29 (Make all distractors plausible) were distinct.  

Their 2013 book finally combines these two rules. It puts them together into one heading, preserving the wording of each and simply separating them with a semi-colon. They finally get it right. Their explanation contains something really good. Something really really good.

The most effective way to develop plausible distractors is to either obtain or know what typical learners will be thinking when the stem of the item is presented to them. We refer to this concept as a common error. Knowing common errors can come from a good understanding of teaching and learning for a specific grade level [or other listed methods].

Finally. A decade after the 2002 article and more than two decades after the 1989 article, near the end of the list, after the multi-part cluing rule, they get to the real meat. Unfortunately, all of that just shows that they have no clue how important this rule is. This is what matters most. Distractors are the defining feature of multiple choice and other selected response item types, and this finally gets near the core of what makes for a high quality distractor.

For multiple choice items to elicit high quality evidence, they must be able to offer credible affirmative evidence (i.e., that the test taker does have proficiency with the targeted cognition) and also be able to offer credible negative evidence (i.e., that the test taker lacks proficiency with the targeted cognition). These two sort of evidence are built in two different ways.

Affirmative evidence comes from items that require a cognitive path that depends on the targeted cognition to produce a successful response. All that cluing stuff Haladyna et al. keep coming back to? That is about alternative paths to a correct response via test taking savvy instead of through use of the targeted cognition.

Negative evidence is harder to collect. Negative evidence, as Rule 30 implies and their 2016 book says more clearly, requires offering potential responses that test takers might actually work to if they misunderstand or misapply the targeted cognition—that is, legitimate results of authentic mistakes. Any other distractor is a waste of everyone’s time. Only a guesser would select it, and that does not tell anyone anything—other than, perhaps, the test taker didn’t even try to actually work through the item. If a mistaken test taker cannot find their own result among the distractors (i.e., because distractors are wasted on some other basis), they are clued to try again. Rather than gathering evidence of their mistake (or mistaken understanding) they are given a second chance. But other sorts of mistakes might not be given such a clue or chance, as when they do have corresponding distractors. That is why it is important that distractors always and only be based on common test takers mistakes.

A substantive and valid meaning for “effective” distractors would be Substantively distractors that actually gather negative evidence of proficiency by giving the most common mistakes with the targeted cognition corresponding answer options. Now, if there is really only one mistake that test takers make with this particular piece of knowledge or skill, then no one should expect more than one effective distractor. But if there is a common mistake that test takers make with a problem but it is not a mistake with the targeted cognition, then it is not a good or effective distractor! Such a distractor would suggest that test takers lack proficiency with the targeted cognition when, in fact what they lack is with other knowledge and/or skills.

Yes, test takers who make other mistakes should be clued to correct those mistakes, because items should not collect information about other cognition. That is, the common mistakes that are relevant are only the ones in understanding or applying the targeted cognition, even if they are not the most common mistakes, overall.

Yeah, this is about item focus. Should an item purported to be aligned with some specific targeted cognition confuse information about other cognition with information about the targeted cognition? Of course not! Other sorts of mistakes should not prevent test takers from being successful, and other skills (e.g., test taking savvy) should not be enough to enable success.

Substantively, effective distractors capture evidence of the lack of targeted proficiency. Anything else is ineffective, regardless of how often test takers select it. And ineffective distractors undermine item quality and every validity claim about a test.

This is the hardest thing about writing high quality items. Developing items that lack alternative paths is hard, and made harder by inappropriate test prep that stresses shortcuts for the particular items on a test over authentic use of the targeted cognition. Developing a full set of distractors is even harder. As Haladyna et al. finally explained in 2013, it really benefits from knowing about teaching and learning of the targeted content. It is made harder because teachers and other educators are always trying to improve teaching and learning, meaning that the most common mistakes or misunderstanding scan shift over time as educators address the most common one they see.

The lure of substantively ineffective distractors that nonetheless masquerade as quantitatively effective distractors (i.e., by popularity) comes in the form of distractors based on other kinds of mistakes, rather than mistakes with the targeted cognition. These can be used to raise or reduce observed empirical item difficulty, and often will not be caught by item discrimination statistics. Haladyna et al. do not understand this, which is why even though Rules 29 and 30 start to get into the meat of what a good item is, even they fall short.

Thus, if they can be taken together, Rule 29/30 are perhaps the most important rule(s) and yet as Haladyna et al present them, they still are not good.

 

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #29: Make distractors plausible

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Make all distractors plausible.

This might be the most important principle in all of multiple choice (MC) item development, and that makes it incredibly important to all of large scale standardized assessment because of the dominance of MC items on standardized test. But Haladyna et al. fail to explain what makes a distractor plausible in their 2002 article. Note that there is a different rule about basing distractors on common test taker mistakes (i.e., Rule 30), so it cannot be that.  

Their 2004 book provides a brief explanation, but still separates test taker errors from plausibility. They write that distractors should “look like a right answer to those who lack this knowledge” (p. 120). My regular co-author and I call that shallow plausibility. That is, those who lack the desired proficiency cannot easily and quickly dismiss it as incorrect. This idea of shallow plausibility undermines (or subsumes) most of Haladyna et al.’s advice on cluing (be it part of Rule 28 or any other rule) because it entirely shifts the issue into a different frame. Like their 2002 article, their 2004 book appears to equate “plausible” with “effective” and suggests that these are evaluated by judging how many test takers select them.

But is that a decent standard to judge items and effectiveness? If you care about validity—about content validity, construct validity, or validity evidence from test content—then it clearly not is not a decent standard.

Items aligned to easier assessment targets should be easier. Fewer test takers should select distractors for those items. Of course, that just begs the question of what “easier” means. Well, for assessment purposes, easier is not an intrinsic quality of the targeted cognition. Rather, it is about an interaction between the content, teaching and learning of the content and the item—all in test takers’ heads. When instruction improves (e.g., through better curriculum, better lesson plans or better pedagogy), measured content difficulty should drop. If some school, district or state does a better job of teaching some standard, the distractors don’t get less effective. Rather, more test takers are able to produce a successful response. Better teaching does not make items or distractors less effective simply because fewer test takers select an incorrect option.

This simply a dumb way to think about distractor effectiveness. Truly dumb. The question is not whether distractors are selected by many test takers, but rather whether these distractors (as opposed to other potential distractors) are the ones that will be fairly selected by the most test takers. But to understand what that means, you’ll have to read tomorrow’s post.

But, frankly, this idea that distractors should be judged in a sort of popularity contest is what leads to the kinds of deception and minutia that Haladyna et al. try to warn against in Rule 7 (Avoid trick items). If the best you can do when writing distractors is to try to deceive test takers, you are not trying to measure the targeted cognition at all. Dumb Rule 7 only exists because of this idea that items should be difficult, rather than that they should be fair.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #28: Avoid clues

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Avoid giving clues to the right answer, such as

a. Specific determiners including always, never, completely, and absolutely.

b. Clang associations, choices identical to or resembling words in the stem.

c. Grammatical inconsistencies that cue the test-taker to the correct choice.

d. Conspicuous correct choice.

e. Pairs or triplets of options that clue the test-taker to the correct choice.

f. Blatantly absurd, ridiculous options.

This rule is huge. As is, it has six parts. In their 2004 book, this rule has five parts, and in their 2016 book, this rule has four parts from this list of six and two part from elsewhere. So, that prompts a question of how I should go about responding to this rule. Some previous rules seem like they should just be sub-parts of this one. But this is a fisking project, so I will address everything.

Item developers should not fall into patterns that give away the correct answers. That’s the real principle for their first sub-part, not just particular words. Always, never, completely, and absolutely should only be used in patterns that do not flag that option as correct or incorrect—just like all of the above and none of the above and countless other potential tells. This is not merely about their use in any particular item, but rather pattens in their use across item banks. It certainly should not lead to a prohibition that has nothing to do with how well these terms address content.

I have never understood Haladyna et al.’s advice on “clang association.” They seem to be saying that items should not repeat key words from the stem either in the correct answer option or in incorrect answer options. That does not make sense to me. Why not? They offer that this can simply be too big a clue to the correct answer option—which seems just to be part D of this rule—or it can be a sign of a “trick” item. But I already addressed their dumb Rule 7 about trick items. I do not believe in trick items. Moreover, if some word in a title or quote actually is often misunderstood or mislead, then that sounds like it is a good basis for a distractor. It should not be avoided.

Isn’t just grammatical inconsistency a repetition of Rule 23? See my response to that rule, from earlier in the month. This sub-rule is folded into option homogeneity in their 2013 book. Length is also included in their 2013 book, but as a separate subpart.

Conspicuous correct choice? I’m not really sure what that means. That sounds more like an issue with a lack of plausible distractors, which might explain why this sub-part is missing from their books.

Pairs and triplets? Just another example of not understanding homogeneity, Rule 23. I already addressed that.

Blatantly ridiculous options? Yeah, that’s again about plausible distractors. That is its own issue, and perhaps the most important single principle of multiple choice item construction, right up there with clarity. It is not just about cuing, nor is it appropriate to bury in some sub-part of a rule on cluing. So, this one gets its due attention in tomorrow’s post.

Where does that leave us? I and my colleagues worry about false-positive results. We worry about those alternative paths, including encouraging guessing with various tells. But this is not a good list of tells to worry about. It is not even Haladyna et al.’s complete list of tells, so what is this rule doing?

What is this rule doing? They report that 96% of the sources for their 2002 article support it, but is that any surprise? They have six parts that a source could support to be included! Why aren’t these listed as six different rules so we can see how many sources mention each sub-part, how many supported it and how many ignore it? Are we to believe that each of the 96% mentioned all six parts? Certainly not! So, what is going on here?

No, this is a not a great rule. It misses the point, confusing symptom and outcomes for the actual principle at stake. As so many other reason, this rule makes clear that this list is not about deep principles.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #27: Avoid NOT in choices

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Phrase choices positively; avoid negatives such as NOT.

This is yet another redundant rule. Rule 17 (Word the stem positively, avoid negatives such as NOT or EXCEPT. If negative words are used, use the word cautiously and always ensure that the word appears capitalized and boldface.) covers the same ground, and though their 2002 article offers some explanation for Rule 17, it offers nothing on this rule.

So, I feel the same way about this rule. That is, if using the word not or some other negating word, be sure to set it in bold and underlined. I do not understand why they offer an out for stems (i.e., bold and all caps), but not for answer options. Their 2004 book also offers that advice, and rewrites Rule 17 to avoid putting that in the rule itself. That seems superior to this 2002 version.

Thus, in their 2004, it is clear that these really are more guidelines or advice than they are rules. Their 1989 articles calls them “rules” dozens of times—including in the title—and says that it is “a complete and authoritative set of guidelines for writing multiple-choice items” (p. 37). The 2002 article does not call them rules, leaning on the word “guidelines.” They end with wisdom from the 1951 edition of Lindquist’s (editor) handbook, Educational Measurement. They quote Ebel, rather than Linquist’s own brilliant chapter.

Each item as it is being written presents new problems and new opportunities. Just as there can be no set formulas for producing a good story or a good painting, so there can be no set of rules that will guarantee the production of good test items. Principles can be established and suggestions offered, but it is the item writer’s judgment in the application (and occasional disregard) of these principles and suggestions that determines whether good items or mediocre ones will be produced. (p. 185)

Yes, this is a great quote. Yes, there is actual wisdom in there. But I do not buy for second that Haladyna et al. believe this. Rather, it feels like too little/too late. They are 20 (of 22) pages in before they use the word principle and this article offers rather (or very) little to help item developers to develop that critical professional judgment. Guidelines without deep and thoughtful explanations have no chance to be understood as true principles. Their approach to presenting these ideas invites them to be understood as rules. Including something like Rule 6 (Avoid opinion-based items) and claiming that it is supported unanimously—though it has just 26% support from their sources— and failing to offer any explanation for it is clearly not an effort to support professional judgement in the application of worthy principles. Offering all those numbers in their Table 2 (p. 314) without offering real explanation is about leaning into the misleading precision of numbers to bolster the seriousness and credibility of everything on their list. Their 2002 article claims 24 of these rules have “Unanimous Author Endorsements” and these 24 rules average mere mention by less than two-third of their sources. Why make such a claim if not to suggest that these are truly rules?

Yeah, I don’t like negatives in answer options, but I don’t think that I can defend a prohibition or even discouragement. Clarity and simplicity of language are good goals, and—as I wrote above—put negative or negating words in bold and underlined type to make sure that test takers don’t miss them. This was all addressed in Rule 17. So, there’s no new principle here in Rule 27, not from me and not from them.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #26: Avoid All of the above

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Avoid All-of-the-above.

This rule and Rule 25 (None-of-the-above should be used carefully) are the two most opposed rules (by their own sources) on the Haladyna lists, though the explicit opposition to this rule is half that of Rule 25. To be fair, 70% of their 2002 sources support this rule, though Haladyna et al.’s offered reasoning seems a bit weak.

Their all of the above analysis cites use of this answer option making items less difficult, but their analysis of Rule 25 (none of the above) expresses concern that it makes items more difficult. Were they simply reporting on the literature, these differing results would be just different results for different phrases. But as they are offering their views in their actual recommendations, guidelines or rules, it is not even clear why a phrases impact on item difficulty automatically makes it objectionable.

The basis for this rule seems to be that when all of the above is included as an answer option that it is far far far too likely to be the correct answer option (i.e., the key). That is not a reason to avoid it, but rather a reason to use as a distractor more often. Test takers and teachers and test preparation tutors would quickly learn that it is no longer a dead giveaway—wisdom that I heard decades ago.

Their 2004 book suggests two ways to avoid all of the above. First, “ensure that there is one and only one correct answer options.” Yeah, duh. That might limit the nature of content that could be included, so I don’t favor that. Their other advice is to turn the simple multiple choice item into a multiple true-false (MTF) item. That is a much much better idea. Provided that the testing platform allows for MTF items, they should probably be used more often. Yes, they can take more time than simple multiple choice items, but they can delve deeper into various facets of an idea. Anything that helps constructed response tests to assess more deeply is a very good thing.

So, what do I think of this rule? I think greater use of MTF items would be a positive change. Otherwise, I would rather all of the above be used far more often as a distractor than it be abandoned for use as a key.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #25: Use carefully None of the above

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: None-of-the-above should be used carefully.

First, I cannot tell you how much I hate this rule, nor how much it betrays the deep disrespect and distain that so many have for the work of content development in large scale assessment. No one would suggest that point-biserial or IRT should be used carefully, because everyone assumes that psychometrics is always done carefully. What does this rule or guidelines mea? Don’t be sloppy? Item developers should never be sloppy, whether they are using “None of the above” or not. They are professionals, and no professionals should be sloppy in their work.

Second, their 2002 sources and the empirical research is just split on this. There is no consensus.

Third, the 2002 article has an explanation that might be the most nuanced portion of the whole piece.

Given recent results and these arguments, NOTA [none of the above] should remain an option in the item-writer’s toolbox, as long as its use is appropriately considered. However, given the complexity of its effects, NOTA should generally be avoided by novice item writers.

Frankly, this kind of analysis should be applied to virtually their entire list, but it is nice to see it as least once. Of course “generally be avoided” is not actually actionable advice. It means that they can use it, but…I guess they should be careful, just like everyone else. Yeah, item development is hard.

Their none of the above analysis cites it making items more difficult, but their analysis of Rule 26 (all of the above) expresses concern that it makes items less difficult. Were they simply reporting on the literature, these differing results would be just different results for different phrases. But as they are offering their views in their actual recommendations, guidelines or rules, it is not even clear why a phrases impact on item difficulty automatically makes it objectionable. In fact, a plurality (48%) of their 2002 sources are against use of none of the above and only and slightly fewer (44%) are fine with it. There is no consensus.

Last, their 2004 book says, “When none of the above is used, it should be the right answer an appropriate number of times.” No, I do not have any idea what that is supposed to mean. My frequent co-author suggest that they mean something like “should only be the key approximately 25% of the time (for 4-option items) or approximately 33% of the time (for 3-option items)." But they’ve never shown that kind of thinking about how to read, understand or analyze items, so I don’t think she’s right. Of course, it her explanation has the benefit of giving some meaning to his rule—which otherwise lacks any

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #24: Choice length equal

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Keep the length of choices about equal.

First, this rule seems VERY redundant with Rule 23. Why isn’t this a part of Rule 23 (Keep choices homogeneous in content and grammatical structure)? Why not make it cover all three issues?

Second, all of my objections to Rule 23 apply to this rule. Go read that, if you care.

Third, Haladyna et al. violate this rule in their examples in their books. Heck, they even offer an example that is based upon the very idea that answer option length can vary. In their 2002 book, they offer an example is one the key is the shortest answer option by quite a bit, and is less than half as long as the longest answer option. Even they don’t buy this rule—even though 85% of their 2002 sources cite it. Doesn’t that undermine the credibility of their whole endeavor? Their 2004 examples for other rules routinely violate this rule, showing how meaningless it really is.

Example 5.5 (2004, p. 103)

According to American Film Institute, which is the greatest American film?

a. It Happened One Night

b. Citizen Kane

C. Gone with the Wind

D. Star Wars

This example violates Rules 21 (i.e., answer options in logical order), in addition to this Rule 24. Example 5.19 (p. 115) also violates this rule.

When an item fails to perform an on a test, what is the most common cause?

a. *The item is faulty

b. Instruction was ineffective.

c. Student effort was inadequate.

d. The objective failed to match the item

In 1989, they pointed to research that showed that when the key is clearly the longest answer option, test takers are even more likely to select it. This fits what I have heard as a very practical guessing strategy: pick the answer that sounds more advanced or complicated. Sure. Pick the longest answer option. Now, if that is the problem, then item developers should try to make sure that there are also a bunch of distractors that are the longest answer options. The problem there is item developers who habitually write dumber sounding distractors; the problem is not some answer options are longer than others.

So, is this rule even worse than Rule 23? Yes. Yes, it is.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #23: Choices homogeneous

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Keep choices homogeneous in content and grammatical structure.

Two-thirds of their 2002 sources support this rule, but the only empirical source they mention is a study by one of them that found this makes no difference. Perhaps more importantly, their only logic/or reasoning is that when the items are not all parallel, it can clue one of the items as the key. Their example in their 2002 book makes that obvious.

What reason best explains the phenomenon of levitation?

a. Principles of physics

b. Principles of biology

c. Principles of chemistry

d. Metaphysics

Putting aside magnetism and superconductors (i.e., physics), it’s not hard to see how they answer D would draw disproportionate attention. Depending on the stimulus, D might actually be the correct answer. But the problem is not that lack of homogeneity! The problem is that just one of them sticks out, not that they are not all the same.

So, clearly D should be “Principles of metaphysics,” to match the others. But then there’s a redundancy with physics…but there’s a conventional wisdom among item developers on how to deal with that—one that Haladyna et al. do not ever mention. As I wrote for Rule 22, answer options should all be parallel, all be distinct, or come in pairs (when an even number of answer options).

a. Principles of astronomy

b. Principles of astrology

c. Principles of physics

d. Principles of metaphysics

Do any of those uniquely jump out? They are not homogenous, as two of them are science and two of them are not. The same guidance works for grammar, length, voice, content, etc.. Answer options really do not need to be homogenous.

But here’s the real issue: There is a far far far more important rule for crafting distractors. Rule 29 is the most important rule, make all distractors plausible. If that requires violation of homogeneity, fine. Do it! That second set of answer options above is only good if each answer option is deeply plausible, and a shortcoming of homogeneity (e.g., as in creating pairs) is fine if it does not hurt plausibility. It is plausibility that matters, not homogeneity.

The real issue seems to be that so much of the Haladyna rules is about undermining guessing strategies in a world in which test takers simply can recognize the best answer or not. It does not consider the cognitive paths that test takers might take, and almost never considers that the best distractors are the ones that represent the results of mistakes in understanding and/or application that test taker may make along the way. Perhaps they just assume too simplistic content?

So, no, I don’t buy this rule.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #22: Choices not overlapping

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Keep choices independent; choices should not be overlapping.

Less than one-third of their 2002 sources mention this rule at all, but they never have cited an empirical basis for this rule. It seems thin.

Their reasoning seems to be based upon cluing and multiple correct answers, but there are already rules on cluing and ensuring that each item has just one correct answer (i.e., is not multi-keyed). So, what does this item add? Moreover, are those really inevitable results of overlapping answer options?

Any item aimed at identifying a set or range (e.g., which characters…, what are the symptoms of…, for what values of x….) would be made far more easy—perhaps too easy—if those sets/ranges could not overlap. I can imagine an argument that these kinda turn into complex multiple choice (type K) items, and that was already addressed in Rule 9. So, that might be a better place to address that concern. But Haladyna et all do not mention that concern in either article or either book. And overlapping ranges are simply not amenable to multiple select or multiple true-false item types. So, this issue doesn’t seem to create a need for this rule.

I simply cannot follow the logic suggesting that overlapping answer options would clue the correct answer option. If the answer options are:

a. Something

b. Some subset of A

c. Something else

d. Something else else

Does that suggest that the answer must be b? Must be a? Cannot be a or b? There is a general idea that answer options should all be the same in some way, all different in that way or come in pairs of two (i.e., when four answer options) in that way. The idea is that no single answer option should just jump out at test takers. But Haladyna et al. do not share this conventional wisdom in their rules. To be fair, I’ve never been quite sure about this wisdom. But would this set of answer options clue anything?

a. Something

b. Subset of A

c. Something else

d. Subset of B

I think not.

Which leaves the question of multi-keyed items. But we already know that multi-keyed items are bad (i.e. Rule 19). Is there something wrong with overlapping answer options if they are not­ multi-keyed? I keep looking and I cannot find anything other than obscurity. That is, complex multiple choice items (type K) can be needless confusing. So, try to avoid that. But there are also times—particularly with math items—when attention to precision is part of the targeted cognition. Precision in thinking and in communication is valuable in every content area, but math focuses on it more than most others. Should there really be a ban on items that lean into this skill?

I would note that this is not one of those rules that says “avoid.” Now, one might interpret such rules as being less than complete bans, suggesting something less strict. This rule, however, does not even leave that arguable wiggle room.

This seems redundant when it is not actually an obstacle to getting at important content. At best, it is useless.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #21: Logical/numerical order

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Place choices in logical or numerical order.

I have always wondered what this means. Of course, when the answer options are all numbers, it is clear. But what if the answer options are points on a graph? What if they are names? What if they are phrases or sentences? Should they be ordered by length? Alphabetical? Does it matter? (Yeah, one—but only one—of their books says that answer options “should be presented order of length, short to long,” but…ummm….why!? Because it is prettier? Huh?)

Is there always a “logical” order? What would it even mean for an order to be “logical?” What if two people disagree about which order is more “logical”?

I hate this rule because use of the word “logical” suggests that there is a single right answer. Logic should not yield multiple answers. I mean, imagine that robot putting its hands to its head and repeating “Does not compute. Does not compute,” until its head explodes. There are important issues that are not matters of logic.

Moreover, this rule kinda seems to go against the previous rule about varying the location of correct answer options. If the incorrect answer options are all based on authentic test taker mistakes (i.e., Rule 30), and the correct answer’s location should vary, it does not really leave as much room to put the answers in a “logical or numerical” order? How should an item developer square these differing rules? Are some of them more important than others? For example, are the most important rules earlier on this list? That is, are these rule presented in that sort of logical order?

We do not think that the Haladyna Rules are nearly as useful as they are depicted to be. Over and over again, they beg the actual question, hiding behind simplistic or trite “guidelines” that duck the real issues. They beg the question (in the original meaning of the phrase) by failing to offer useful guidelines or rules for item developers and so very many of them beg the question (in the more recent meaning of the phrase) by not actually addressing the meat of the issue they pretend to address.

And last, why doesn’t this rule include “chronological”? If it says “numerical,” it could easily also say “chronological.” Could it be that Haladyna et al are only thinking of math exams? That would be crazy, right?

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #20: Vary location of right answer

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Vary the location of the right answer according to the number of choices.

Yes. Totally. I mean, I think this could be written more clearly. I don’t really understand what “according to the number of choices” adds to this rule, but sure. Fine. I think that replacing that phrase with “randomly” might be better.

But “randomly” isn’t actually quite right. In our work, we have found that putting the correct answer option earlier in the list might lower the cognitive complexity of an item. That is, if a test taker finds a truly good candidate early, they might not have to work out all the other answer options all the way through. That is, they might be able to more quickly rule them out as being inferior to that earlier option. The hunt for the right answer might be cognitively more complex if they have to work harder to eliminate more answer options before they find a good one to go with.

Of course, if the correct answer option is always last or always later, that will reward guessing strategies—which is bad. The location of the correct answer option should be distributed equally across an entire form, just to fight that kind of construct-irrelevant strategy. We do not expect careful work to pick just the right items to increase the cognitive complexity of, though we might dream.

You see, even this seemingly simple rule might not be so simple. But Haladyna and colleagues clearly do not sufficiently dive into the contents of items or the cognition that items elicit to recognize that. Instead, they look at this most quantifiable and testable of ideas (i.e., how many distractors?) and revel in the how easily quantified it is.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #19: One right answer

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Make sure that only one of these choices is the right answer.

Oh, I want to hate on this rule so much. It is so dumb. Obviously multiple choice items—the items that dominate large scale standardized tests—should have just one correct response. (And we are talking about the simpl multiple choice item type, rather than multiple select (e.g., pick correct two of the five answer options), matching, multiple true-false, etc..)

Unfortunately, I have seen too many items come to from item writers or to external review with multiple correct answers. The technical jargon for this is “double keyed” or “triple keyed.” Occasionally, there isan item in which every answer option is actually a correct response: a quad-keyed item! Not good—though amazing.

Now, multi-keyed items are usually not the fault of the answer options. More often, the problem is in the stem and stimulus, I think. That is, the question can reasonably interpreted in a number of ways, leading to different correct answer options. This sort ambiguity can also be found in answer options, though I suspect that that is less common. I know of no studies of mid-process multi-keyed items that answers that question definitively.

This might be a good spot to get at a deep problem with this list. It is generally written in the second person, giving orders to item developers. My regular co-author and I far prefer a list that describes the traits of high quality items. That is, let’s all be clear about the goal. Let’s focus on what effective items look like.

Then, we can develop processes and procedures for how to achieve those goals. If we are going to address the actions of item developers, let’s try to provide actually helpful advice. In this case,  how might item developers make sure that only one of the answer options is correct? As is, this list pretends to offer advice on what to do, but instead it is usually kinda getting at item quality.

With RTD (Rigorous Test Development), we have approaches and techniques to accomplish this. We have a Pillar Practice that we call Radical Empathy. We have a rigorous procedure that we call Item Alignment Examination built on radical empathy. In short, test developers need to work through items through the perspective of a range of test takers, not just as themselves or as one mythologized typical test taker. RTD likely needs to develop more procedures just for catching multi-keyed items. This is hard work. Item development is incredibly challenging work.

These Haladyna lists simply do not recognize that. That is probably the most offensive thing about them. They lay out seemingly simple rules that barely scratch the surface of what it means to develop an high quality valid item (i.e., one that elicits evidence of the targeted cognition for the range of typical test takers), and because of these lists absolute dominance in the literature, they evangelize the idea that item development is fairly simple.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #18: Write as many plausible distractors as you can

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the choices: Develop as many effective choices as you can, but research suggests three is adequate.

This rule is ridiculous. This is the rule that shows that these authors have no serious experience as item developers. They do not recognize that item developers simply do not have time to develop more (effective) distractors than they have to, and they appear to have no clue as to how difficult it is to write plausible/effective distractors. In fact, item developers should develop extra ideas for distractors, if they are available, because few will turn out to actually be plausible. (Moreover, the technical and contactual requirements of large scale standardized assessment generally sets how many distractor are required.)

That is really the key point. They have no idea what it takes to develop a good distractor. They think that quantity is really a driving issue here.

And yet, they actually undermine the whole first half of the rule with the second half of the rule. If three is adequate, then why develop more, folks? Do they have that little respect for the time of professional content developers? The 2002 article claims that it is primarily aimed at classroom teachers, though also useful for large scale assessment development. Do they have that little respect for teachers’ time? Why waste time on developing even more distractors, especially considering how difficult it is.

The thing is, they acknowledge in their 2002 article that developing additional distractors can be challenging. “The effort of developing that fourth option (the third plausible distractor) is probably not worth it.” So, why do they suggest it? Why do they say, “as many…as you can?” Why do they say, “We support the current guideline.”

In fact, they mention that this is actually a quite well researched question. There are countless studies on the optimal number of distractors. There are countless studies on how effective distractors are (i.e., how many test takers select them). It is a standard part of test development to review how attractive each distractor was in field testing. And they summarize much of this literature by saying, “Overall, the modal number of effective distractors per item was one.” We have a shortage of effective distractors, even as items usually include three or more distractors. Perhaps the reason why so many studies show that two distractors are sufficient is the low quality of the second or third distractor. That is, it’s not a question of how many there are, but rather of how effective they are. Perhaps quality matters than quantity.

Now, how many of the 14 rules that focus on distractors are about how to write effective distractors? How many really focus on how to gather evidence that test takers lack sufficient proficiency with the targeted cognition? Well, not enough. This one focuses on quantity, while merely waving a hand at effectiveness.

(And we can, for now, ignore issues with the literature’s idea of effectiveness of distractors, which seem to have rather little to do with the quality of the evidence they provide or the validity they contribute to items.)

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #17: Use positive, no negatives

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the stem: Word the stem positively, avoid negatives such as NOT or EXCEPT. If negative words are used, use the word cautiously and always ensure that the word appears capitalized and boldface.

This rule seems pretty good, on the surface. It seems intuitive, but does it actually matter? (And ignore the odd mention of “cautiously” applying after the decision to use such a word has been made, right?)

The 2002 article lays out the evidence, and the evidence does not support their contention. Roughly two-thirds of their sources support the rule, but roughly one-fifth explicitly argue against it. The empirical studies that Haladyna et al. cite do not  support this rule. In fact, they cite a study by one of themselves (i.e,. Downing) that found this rule makes no difference to item difficulty or item discrimination.

So, if it does not show up in the item statistics, then why push for this rule? “Under most circumstances, we suggest that a stem should be worded positively.” This lack of reasoning epitomizes the empty center of their whole endeavor. They endorse some received wisdom, but do nothing to explain why. Recall that in 1989, they called their list “a complete and authoritative set of guidelines for writing multiple-choice items”—in the paper’s abstract! While they did not repeat that claim in 2002, nor do they disclaim any of the rules they report from the literature.

So, why avoid negatives? I can think of a reason: stressed and/or hurried test takers might miss that key word (e.g., “not” or “never”) and therefore misunderstand what is being asked of them. This could lead test takers to provide an unsuccessful response, even though they could have provided a successful response if the stem was more clear. (Of course, there is no good reason to include a distractor that is based on test takers missing a negating word in the stem.)

Yes, clarity is essential. Rule 14 (Ensure that the directions in the stem are very clear) is their best rule.

So, if we suppose that that chance of skipping or missing that key negative word is the reason to avoid negative phrasing, is there something that could be done about that? For example, why if such words are bolded and underlined (i.e., something I usually opposed because it can look like garish overkill)? Might that draw sufficient attention to those words to ensure that they are not skipped? And if it would, why avoid negatively worded questions? What reasoning might be left?

It is curious that their 2004 and 2013 books omit mention of the studies cited in the 2002 article that suggest that negative words in stem do not made a difference in how items function. It is almost as though they eventually realized that their argument is so weak that they are better off omitting the whole truth that they know. But that could only be done in bad faith, we know that that cannot be the case. Right?

Last, they acknowledge in their 2002 book that another scholar found that the impact of negative stems varied based upon the type of cognition that was targeted. For the life of me, I cannot figure out why they would mention this without explaining more. It would be useful to know when this advice might actually make a difference. In our own experience (mine and my closest colleagues), we have seen attempts to target cognition that really does call for a negative stem, but the broad acceptance of this rule has made it impossible to get such items through item development processes.

In my view, the stem should be clear and hopefully succinct, but never at the expense of getting at the targeted cognition. If a negative stem does not hurt clarity, succinctness or alignment, I do not see a problem.

So, I would  suggest bold and underline those negative words. But don’t use all caps—that’s just too much.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]

Fisking the Haladyna Rules #16: Avoid window dressing

[Each day in October, I analyze one of the 31 item writing rules from Haladyna, Downing and Rodriquez (2002), the super-dominant list of item authoring guidelines.]

Writing the stem: Avoid window dressing (excessive verbiage).

If you haven’t read my analysis of Rule 13 (Minimize the amount of reading in each item), please go back and read that. It applies here. But there is more, following a fantastic meme.

Yes, excessive verbiage is bad. Afterall, that’s what “excessive” means. So, this rule is somewhat tautological. I think that item developers should not make bad items. But that is not helpful advice.

The question is what counts as excessive. At this point, it is not surprising that the 2002 article makes no effort to explain this. Their 2004 book really offers no meaningful explanation for its version, Make the stem as brief as possible. Their 2013 book combines this rule with Rule 13, and does say quite a bit more. But it is not quite helpful.

Their example (2013, p. 98) is, “Which of the following represents the best position the vocational counselor can take in view of the very definite possibility of his being in error in his interpretations and prognoses?” Yes, that is clearly excessively wordy, but it is practically a straw man argument. Has anyone ever suggested that such a stem might be appropriate, or that such a question would be well writing in any circumstance?

Stems should be clear. They should include all the information needed for the test taker to understand what is being asked of them. Extra adverbs, adjectives and degree modifiers should not be included (e.g., “very” and “definite” in the example above). Filler words and phrases that do not contribute meaning or information should not be included. Phrases and words that can be replaced with simpler, more common and shorter equivalents without a loss of meaning should be so replaced. (e.g., replacing “represents the” with “is,” and replacing “in view of” with “given” in the example above).

My usual co-author offers, “Ensure you use as much verbiage as needed to make the task clear, no more and no less.” This emphasizes that clarity is the guiding principle. Of course, it highlights the reality that once one has Rule 14 (Ensure that the directions in the stem are very clear), the verbiage rule really does not add very much, perhaps not anything at all.

If the explanation of this rule included how to recognize excessive verbiage, the rule would not seem tautological. I understand why a simply stated rule might require further explanation to really be understood, but the articles quite often do not do that, and the quite rarely do it well.

[Haladyna et al.’s exercise started with a pair of 1989 articles, and continued in a 2004 book and a 2013 book. But the 2002 list is the easiest and cheapest to read (see the linked article, which is freely downloadable) and it is the only version that includes a well formatted one-page version of the rules. Therefore, it is the central version that I am taking apart, rule by rule, pointing out how horrendously bad this list is and how little it helps actual item development. If we are going to have good standardized tests, the items need to be better, and this list’s place as the dominant item writing advice only makes that far less likely to happen.

Haladyna Lists and Explanations

  • Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Routledge.

  • Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

  • Haladyna, T., Downing, S. and Rodriguez, M. (2002). A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 15(3), 309-334

  • Haladyna, T.M. and Downing, S.M. (1989). Taxonomy of Multiple Choice Item-Writing Rules. Applied Measurement in Education, 2 (1), 37-50

  • Haladyna, T. M., & Downing, S. M. (1989). Validity of a taxonomy of multiple-choice item-writing rules. Applied measurement in education, 2(1), 51-78.

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333.

]