Expertise Matters: The Case Against Drive-By Item Review


There is perhaps nothing worse for test validity than people who lack real expertise with the alignment references and domain model (e.g., state learning standards) opining about the contents of an item. Those people are generally trained psychometricians, and despite what they think, they should not be participating in conversations about the contents of items. They can offer their feedback and let actual experts know about various suspicious patterns in the data. But they should then leave the room—or at least switch entirely to listening mode. Truly, they have nothing of value beyond that to offer for such discussions. 

It is simply a matter of expertise and respect. Psychometricians are not going to listen to classroom teachers’ views on whether cohen’s kappa or QWK is preferable, and rightly so. Know your lane. 

So, here is a test for anyone who feels the urge to opine on reading items: What is the appropriate grade and standard for the following four items? Assume that the relevant standards are based on the Common Core State Standards. Which of these items are acceptable, and to what grade level of which standards are they aligned? (It doesn’t matter whether you know the terms being referenced, and it doesn’t matter whether you can pick out the key.) 

An explanation follows the four items, and the imaginary passage about the decolonization history of Bakari is not included. Just focus on grade level and alignment.

Passage Title: "The Struggle for Sovereignty: Bakari's Path to Independence"

Item 1:

In lines 14-15, the author describes the colonial administrator's response to the uprising as "a minor disturbance in the provinces." This is an example of which type of figurative language?

  1. Dysphemism

  2. Litotes

  3. Metonymy

  4. Synecdoche

Item 2:

Which of the following lines from the passage contains an example of meiosis?

  1. "The crown's representatives grew increasingly anxious" (lines 27-28)

  2. "It wasn't the worst proposal the council had considered" (lines 63-64)

  3. "Those bureaucratic leeches in the capital drained our resources" (lines 76-77)

  4. "Every voice in Bakari rose against the occupation" (lines 101-102)

Item 3:

What kind of metaphorical language is catalexis?

  1. The substitution of an associated concept for the thing itself

  2. A deliberate understatement achieved through negating the opposite

  3. The use of a part to represent the whole or vice versa

  4. The replacement of a neutral term with a harsh or offensive one

Item 4:

The author's description of the independence movement as "a mere tremor before the earthquake" (line 125) serves primarily to:

  1. Emphasize how the early protests seemed insignificant compared to the massive uprising that followed

  2. Demonstrate the cyclical nature of colonial resistance movements throughout the region

  3. Highlight the geological instability that complicated infrastructure development

  4. Reveal the narrator's skepticism about the ultimate success of independence

OK. So what are the lessons for you, the reader:

I. I am messing with you. Items 1, 2 and 3 lack keys. The example in item 1 is actually meiosis. But who cares? None of the answer options for item 2 are meiosis, they are instead (in order) metonymy, litotes, dysphemism, synecdoche. But who cares? “Catalexis” is not a thing; I made it up. Those are actually definitions of metonymy, litotes, synecdoche, and dysphemism, respectively. But who cares? Those are all bad items. They are not aligned to any Common Core State Standard at any grade level, 

II. Mastery or knowledge of terminology is simply not a part of modern reading standards. If you didn’t immediately recognize that items 1, 2 & 3 are inappropriate, then imagine all other things that you do not understand about modern K-12 domain models. You likely are deeply expert in at least one area, but if you don’t know this about our reading and writing standards, you should not distract substantive conversation by those who actually do understand the standards. 

III. You should have immediately realized that these items must be about RI standards, even though they are about figurative language. The passage is clearly an informational passage and not a literary passage. (Well, unless you realized that Bakari is a fictional country or region, and therefore thought it might be literary. But that’s too much to expect anyone who is not an expert in decolonization movements to know.)

IV. Item 4 fits the contemporary emphasis on understanding the use of figurative language, rather than terminology. It’s a really bad item, because recognizing the key does not require reading the passage (i.e., it is not text dependent.) But that wasn’t the point. If you’re in the RI 4 (or RL 4) anchor standards, you might have gotten as far as you can. Heck, perhaps it is L5, at the 4th or 5th grade level? Probably not. The metaphor is very simple, but it is usually the text that determines the grade level of a reading item. Stimulus complexity and text complexity can radically change the cognition required to apply what appears to be the same skill. If you thought you could determine the grade level of a reading item without examining the passage, you do not simply lack expertise with the CCSS domain model, but actually with the content domain that CCSS models. It is not that you yourself lack reading skills. Of course you have high level reading skills, and you might also have high level math skills. But understanding what we teach, how we teach it and how that is reflected in the domain model (e.g., state learning standards) is quite different than simply having mastery with the KSAs themselves. 

V. Yes, this was a deliberately hostile demonstration. Consider it a small taste of the condescension content experts endure when those without appropriate expertise (e.g., psychometricians) 'help' with substantive discussions during item review.

VI. If you did not ace this exercise, I hope you do not think that you are in a position to evaluate the output from automatically item generation tools. Yes, the automation of such things may well, fall within your—perhaps considerable—expertise. But the evaluation of their efficacy clearly does not. And unless you think validity has absolutely no value, then you have no way to evaluate the efficiency of the tools. After all, cheaper or faster useless items are not more efficiently generated at all. 

Obviously, this is not to say that psychometrics and psychometricians have nothing to offer in the test development process. Putting aside the common problem of forcing multi-dimensional domain models to be forced into to unidimensional psychometric models—something that Prince Charming knew not to fall for—test design, development, administration and reporting is a collaborative endeavor that calls on the best from any disciplines and areas of expertise. And it works best when everyone respects the expertise of others and the limits of their own.