When AI is Shockingly Intelligent

I run all my blog posts by ChatGPT before posting them. It is not my favorite LLM for all things, but it has got standing orders for how I want feedback—I certainly do not want it to rewrite my copy. Plus, it generates the cartoons I ask for. (It doesn’t like the fact that I use the word “impressive” four times in this post, but I’ve yet to figure out good substitutes for any of them. It also doesn’t like that I am quoting it at such length.)

Last week, I posted on a fairly technical topic: the question of population invariance for item parameters. And it pointed out a bunch of typos and offered me the kind of feedback—even pushback—I want from an LLM.

It claimed that one possible counter-argument to what I wrote was that when I was citing discrimination, I was really just talking about distribution. Now, I am not actually a psychometrician, so I could be wrong about this. I did not think I was wrong, so I fixed the typos, added the cartoons and posted the damn thing.

But then I went back. I followed up, in the same chat.

Back to the substance. You say it's just distribution. But if you move the bottom of the distribution up, and there are fewer really low performers, then why would discrimination come out the same? With fewer low performers, there's less opportunity for highly discriminatory items, isn't there?

I knew that I didn’t exactly mean “less opportunity for highly discriminatory items.” I have studied psychometrics and my understanding of the concepts is fairly decent. I meant something a bit different than that, but I was too lazy on a Friday evening to clarify what I actually meant.

It responded shockingly well. I have been rather frustrated with LLMs these past few weeks, with various limitations they have and with their highly problematic personalities/personas. But this answer was impressive. Longer than I would usually quote, but I think it’s important to give you the whole answer.

You’re right to push on that.

If you “move the bottom up” (i.e., raise low performers), then it’s totally reasonable to expect the estimated discrimination to changeeven if the item itself hasn’t changed—because discrimination is not an absolute property of the item in isolation. It’s an item–population property in practice.

The short answer

With fewer very-low performers, you often get:

  • less spread in ability near the item’s informative region, and/or

  • fewer people in the range where the item separates well, and/or

  • more ceiling behavior (everyone gets it right),

→ and that can reduce the estimated slope (IRT a) or reduce observed discrimination (e.g., point-biserial).

So yes: you can lose “opportunity” for an item to look highly discriminatory in a population that has fewer low performers.

Why that happens (intuitively)

Think about what “discrimination” means in actual data:
It’s basically “how sharply does the probability of success increase as ability increases?”

If the population you’re estimating on has:

  • fewer low-θ people, then you have fewer observations in the region where P(correct)P(correct) is low,

  • and more observations where P(correct)P(correct) is already high,

which makes the item look flatter in the data you actually have.

Even outside IRT: if almost everyone is correct, then the item can’t correlate strongly with total score because there’s not much item variance left.

Key point (the one you’re aiming at)

So when someone says “that’s just distribution,” that’s not a rebuttal.

It’s basically conceding your claim:

Item parameter estimates depend on the population distribution.

Now, a psychometrician might respond with the conditional claim:

  • “In the 2PL, a is invariant if the model is correct and the item truly functions the same way.”

But empirically, with finite samples + shifting ability distributions + multidimensionality + instruction-driven changes in strategy, estimated ‘a’ often moves.

So: your instinct is correct—raising the floor can absolutely change discrimination estimates, and it’s part of why pretending parameters are “true values” without specifying population is a problem.

Obviously, telling me that I am actually correct is a mark of a really smart answer—just ask my colleagues and co-authors! But putting that aside…

I was talking about a weird field (i.e., psychometrics). I was talking about one tiny corner of it. And I was not actually expressing clearly what I was trying to get at. It brought in ideas that I had not mentioned. Yes, they were the ideas I was basing my argument on, but I elided by them on the way to the higher level things I wanted to say. It was actually talking about psychometric ideas and realities.

It was actually really impressive. It is trained incredibly deeply in a huge breadth of fields, else it could not have given me such an answer.

So, good job, ChatGPT. This time.

In Defense of the Multiple-Choice Item

First off, there is no defense for bad multiple choice (MC) items. The floor on the usefulness of MC items is incredibly low. I am only talking about high quality MC items—which are incredibly hard to write. Admittedly, most MC items are quite bad. 

(For example, do not be confused by the psychometric demand that multiple choice items be quick to answer. That is not inherent to the MC item. They could require quite a bit of work and thinking, if only our psychometrics overlords were not distracted by alpha. The shallowness of most MC items is a product of that demand for speed, rather than anything intrinsic to the MC item.)

Second, I am a former high school English teacher. That’s about teaching students to develop their ideas and teaching them about the value of the writing process for doing it. It is about deeply understanding relatively complex texts—and other human beings. It is about argument and evidence and audience awareness. It is about listening, dissecting and analyzing. So, multiple choice items are really hard for me to love; I have seen so many shallow and otherwise bad MC items that it has been a journey for me to understand their value. 

And yet…I think that good multiple choice items can be quite useful. Perhaps even more importantly, learning to write good multiple choice items is an incredibly powerful exercise for anyone writing any type of assessment. 

Obviously this (new school) begs the question of a what a high quality MC item even is. Though there are many many content and cognition traits to a good multiple choice item, I’ll limit the discussion here to just five of them:

* A definitively correct key.

* A set of definitively incorrect distractors.

* Each distractor must be plausible.

* The set of distractors must capture the most common mistakes that test takers are likely to make.

* The stem, the key and the distractors must all be aligned to the learning goal (e.g., state learning standard).

Therefore, writing a high quality MC item requires understanding the learning goal. Obviously. Duh. Furthermore, it requires understanding how students progress with that goal. That’s the only way to know what their most common mistakes are likely to be. All of that is dependent upon knowing how they think—very much a product of the curriculum, instruction and pedagogy they have experienced—at various stages of learning and understanding that learning goal.

Therefore, writing high quality MC items requires deep knowledge and understanding of how lessons are taught—the range of curriculum, pedagogy and examples that students might experience. Some approaches to teaching a given learning goal are going to focus on some type of things and avoiding some kinds of mistakes. Others will prioritize different things or mistakes. Which metaphor the teachers uses when explaining a scientific concept can invite different misconceptions. And different efforts to clarify the meaning and value of the metaphor will make some mistakes less likely than others. Writing good MC items requires thinking deeply about how students might respond to different instructional approaches. 

Good MC items require the most careful attention to the knowledge and skills required to solve the item, which includes mindfulness of alternative routes to successful responses. Again, this leans on deep understanding of students’ cognitive processes—which (again) is strongly influenced by the instruction they have received. It requires student-centered thinking that acknowledges the range of students who might encounter the item. It is not simply about imagining the different responses that students might give, but rather about understanding the different cognitive paths they might take, the various correct and mistaken steps along those different paths. And then, in order to develop a good MC item, crafting answer options that reveal those paths to teachers (or other concerned parties). 

(We call this vital understanding of the various cognitive paths that the range of test takers might take in response to an item radical empathy. It is a pillar practice of content development professionals’ work, perhaps its most rigorous and demanding aspect. It is easily recognized by high quality teachers, but is quite foreign to most other disciplines.)

The defining characteristic of MC items is that they list a set of possible responses. This should not make them easy, as whatever misconceptions students bring to item should find a welcoming distractor that reveals it. An advantage of this approach is that it signals to students that non-aligned mistakes (i.e., misconceptions grounded in some other lesson) are mistakes, and they should try again. Therefore, the set of answer options act as a filter to capture just the relevant misconceptions and mistakes—making them more apparent for teachers or interested parties. 

Well, sure. Bad constructed response items are bad. It is not really fair for me to compare high quality MC items to low quality constructed response items. But my point is that if the item developer is clear on what they are targeting and what both confirming and disconfirming evidence might look like, MC items are not always inferior to constructed response items—especially if they can provide actionable information faster and cheaper than constructed response items. (That is, scoring and reporting is faster and cheaper.)

The difference is that good MC items require all of this rigorous thinking to be done up front. They demand real investment in thinking through the range of what students might do—the cognitive paths they might take in response to the stimulus and question. Yes, scoring is easy and fast, but is only because of the amount of incredibly demanding work done up front. Constructed response items that put off that thinking until it is time to score them can be just as bad at providing insight—often what teachers (like myself) stick themselves with.

Obviously, not all learning goals are amenable to multiple choice items. Certainly, the objectives of high school writing lessons are a bad fit. There are real limits within the topic of research design and other aspects of lab science on the usefulness of MC items. But if we can focus on crafting good MC items and abandoning the counterproductive demands of psychometricians, good MC items can be invaluable—both to summative and formative assessment. 

Educational Measurement’s Expertise Problem

I saw a presentation on a using generative AI in educational measurementresearch project last week. Like so very very very many such projects, the mortar holding the whole thing together was disrespect for substantive expertise. That is, while this disrespect was not a building block of the project, all of building blocks would have fallen apart from each other if it weren't for the disrespect. 

The project was about reading passages and determining their reading level—their suitability for different grade levels of students. I believe it seeks to build towards generation of high quality and appropriate passages for use on large scale assessments. This is an incredibly worthy goal, as stimulus quality and suitability is foundational to item and test quality. Moreover, finding or developing appropriate passages for large scale assessments is very time-consuming and expensive.

However, the research team did not include a single person with experience leveling passages. It did not include a single person with reading instruction experience. It did not include a single ELA content development professional (CDP). These omissions doomed the project to uselessness, just as so many other studies that lack substantive expertise are doomed to uselessness. Generative AI or psychometric methodological expertise are never going to be enough without substantive expertise.

This study provides a good illustration of many of the problems of relying on algorithms and ignoring substantive expertise, in part because its mortar is not at all unusual in our field.

* CDPs and reading teachers know that the standard algorithmic “readability” measures (e.g., Lexile, Flesch–Kincaid) do not produce results that are reliable or appropriate for students—often underestimating level, but sometimes overestimating. Because they have experience with the outputs of these algorithms in the context of their substantive work, they already know about the suitability and bias of the most easily available tools. 

* CDPs and reading teachers know that idea complexity, emotional complexity and subject matter appropriateness are key determinants of grade level—factors that the standard algorithms entirely ignore. Methodologists (i.e., like psychometricians) working alone do not know what is relevant or not in a particular field.

* CDPs and reading teachers know that texts do not have a singular grade level, but rather each span a range of grades (or ages). More emotionally mature and skilled readers can handle texts that less mature and less skilled readers cannot. Therefore, the fact that texts may be appropriate for more than one grade does not mean that they cannot be quite clearly and recognizably different levels. (Consider one book that is generally appropriate for grades 3-5 and another that is generally appropriate for grades 5-7. The latter is clearly more demanding (i.e., a higher grade level text), even if both are suitable for 5th graders—albeit different 5th graders.). The difference will be immediately recognizable for those with substantive expertise. Outsiders will not even understand the appropriate scale or how instances interact with the scale. 

* CDPs and reading teachers know that texts' reading level vary by geography, in part because different classrooms, schools, districts and even states have different views on what is appropriate at different grades. There are national aspirational ideas, such as at publishers and among leaders of national assessments. But I myself have taught in different high schools with very different ideas of what is appropriate for 9th graders. Amateurs and outsiders are unlikely to understand what is a rule and what is a guideline, what is a deep truth and what is merely a useful heuristic, what is clean theory and what is actual reality. 

Perhaps the worst thing about this study was that when I asked one of the team members—the leader?—whether they included any reading teachers or ELA CDPs on the team, she said that it is hard to find such expertise. This is patently false. In fact, I can think of no type of expertise that is easier to find in this country than reading instructors. I believe that you can ask the vast majority of adults in this country where a nearby elementary, middle or high school is (i.e., a school full of reading instructors), and they could tell you. This is surely easier than finding an architect, pilates instructor, sandwich engineer, psychometrician, project manager, sanitation worker, plumber or nurse. No, the problem was not that they could not find any ELA CDPs or reading teachers. No, the problem is that they did not even think to try.

This kind of disrespect for substantive content expertise is not at all unique to this study. Heck, in educational measurement research it is the norm. But without substantive expertise on the research team, there is no way for researchers to know whether what they are seeing is typical or an outlier. There is no way to know whether their simplifying assumptions are too much to yield any useful findings at all. There is no way for them to make sense of their inputs or outputs. With substantive experts on the research team, they can be highly capable collaborations of different expertises that produce meaningful learning. Without substantive experts on the research team, they are merely collections of outsiders making blind guesses without even understanding the questions.

Our assessments are not going to get better while disrespect for substantive expertise remains the mortar of educational measurement research. When you build studies on that foundation, you don’t just miss important details—you lose the ability to recognize what matters, to interpret what you observe, and to know when your conclusions are nonsense. And if generative AI becomes the tool we use to replace the experts we refuse to respect, we will only get faster at producing invalid work, rather than better at producing good work. If we care at all about validity—the degree to which theory and evidence support the proposed uses of tests—we must do better.

Item Parameters Are NOT Population Invariant


In response to my most recent LinkedIn post <https://www.linkedin.com/feed/update/urn:li:activity:7417738278168104960/> complaining about simulation studies that assume that item parameters can have true values without specifying the population, Charlie DePascale—who thinks there were earlier sins in large scale assessment than the assumption of unidimensionality—replied:

I agree 100% that many simulation studies simplify "reality" too much. Also, I agree 100% that the properties of the population being simulated should be specified, along with the method used to simulate it. And, I'm even pretty sure that I agree with what I think you mean by "item parameters are population-specific" but you should probably expand on that statement a bit given that, in general, "Item parameters are considered population invariant.” 

Sure. I’m happy to expand on that statement. And, yes, we will be coming back to the original sin of large scale assessment repeatedly. 

1) Of course item parameters generally are not not not not not not considered population invariant. If so, there would be no need for DIF studies. There would be far less need for field testing—though there would still be some. There would be less work for psychometricians. Post-field testing data review would not include various DIF flags. Everyone knows that items can have different parameters across different populations. 

2) It is not just a matter of the mere possibility of items having different parameters across populations. If that were the case, the threshold for flagging an item would be even lower than it is. We see population differences in every single item—even when we do almost nothing to track relevant population differences. 

From here on, I am not focusing on how populations differ by race/ethnicity, gender or FRPL status. Instead, I focus on populations’ different instructional experiences—perhaps the most important difference between populations.  

3) Image two very large school districts that have adopted different curricula. They similarly emphasize some of the standards, but differ in their emphasis on others. Clearly, we would expect items aligned to emphasized standards to have lower item difficulties than items aligned to less emphasized standards…particularly relative to the other district, where these differences will be a bit inverted. Of course, this recognition that instruction and items can be more aligned or less aligned requires stepping away from the psychoometric assumption of unidimensionality.

4) Imagine two very large districts that have different instructional approaches with the same official curriculum. Imagine that they differ in the degree to which they devote resources to lower achieving students. One might give them additional instructional time and perhaps smaller classes. (For example, back in my teaching career, I taught a double period ELA class to lower achieving 9th graders.) Or, imagine that they focus on the students just below the proficiency threshold—a well known practice during the NCLB years. This would alter the performance of formerly lower achieving students relative to higher achieving students, altering item discrimination. 

5) Before I go on here, decide for yourself whether you think higher achieving students or lower achieving students benefit more from instruction. Are higher achieving students simply better learners who will use additional instruction on a grade level standard more efficiently, or are lower achieving students faced with cognitive obstacles or barriers that additional instruction can help them to overcome? In the context of a focus on grade level lessons and large scale assessment’s ceiling effects, I think that lower achieving students are more likely to benefit from additional instruction. So, if we have two very large districts, and one of them increases instructional time for all students in a content area—like additional reading instruction for elementary school students…again, item difficulty and discrimination will be altered.

6) Imagine two very large districts that differ in how much attention they pay to past years’ items in the course of instruction through the year. One district presents problems and examples as they have appeared on the large scale assessment in the recent past, and the other focuses instead on higher level thinking skills of more complex problems. Do you really think that item parameters derived from these two different districts will be invariant between them?

7) Now imagine that in all of these examples, it was the same district that simply adopted some policy changes. So, each of these examples is the same district before and after the change, time 1 and time 2. 

8) Or, imagine that these two very large districts are indeed geographically distinct, but that they border on each other. One is Atlanta and the other Gwinnett County, in the Atlanta suburbs. Or the city of Baltimore and adjacent Baltimore County, Maryland. Thus, the districts have rather different population demographics—race/ethnicity, FRLP and perhaps ELL distributions, for example. Try to think of all the ways that two such districts can differ from each other. Internal resources. Parental education levels. Wealth, income and socioeconomic status distributions. Ethnic distributions. Share of immigrant homes and/or households where English—the language of instruction—is not the default language. Do you really think that item parameters will be the same across these two districts? Will they be the same for subgroups within each district? Will they be the same for corresponding subgroups across the two districts? Will relative item difficulties and item discrimination parameters just naturally be the same?

No, item parameters are clearly not not not invariant across populations. Unidimensional psychometric models require items to be population invariant, and therefore efforts are made to only select items that approximate that requirement—to the detriment of substantive item validity (i.e., their ability to elicit evidence of the targeted cognition for the range of typical test takers). And this is only possible because we ignore differences in instructional experiences when examining items for population invariance. 

No, the only way to think that any of this is at all appropriate is to willfully ignore the dimensionality of the tested domain as understood by those truly expert in it and in teaching it. It requires ignoring all the efforts to filter out items that do not fit the assumption and proclaim that resulting data proves the initial assumption (unidimensionality) that was used to filter them out. At some point shortly after I finished grad school, I was invited to help a team I had been involved with previously to shape up the extension of a study to prepare it for submission to a journal. I saw that the central claim of the paper was going to simply be a restatement of a filter used on the data, shifting into a finding that ignored the fact of the filter. I burned some bridges when I asked whether the filtering had been removed. 

I’m sorry, Charlie. Unidimensionality is the original sin of large scale assessment. It infects so much of the actual practice of psychometrics. In this case—efforts to study the potential of language models to predict item parameters—it has poisoned the minds of incredibly smart and thoughtful people into meaningless research that can only undermine what little validity large scale assessment can currently rightly claim. 

Expertise Matters: The Case Against Drive-By Item Review


There is perhaps nothing worse for test validity than people who lack real expertise with the alignment references and domain model (e.g., state learning standards) opining about the contents of an item. Those people are generally trained psychometricians, and despite what they think, they should not be participating in conversations about the contents of items. They can offer their feedback and let actual experts know about various suspicious patterns in the data. But they should then leave the room—or at least switch entirely to listening mode. Truly, they have nothing of value beyond that to offer for such discussions. 

It is simply a matter of expertise and respect. Psychometricians are not going to listen to classroom teachers’ views on whether cohen’s kappa or QWK is preferable, and rightly so. Know your lane. 

So, here is a test for anyone who feels the urge to opine on reading items: What is the appropriate grade and standard for the following four items? Assume that the relevant standards are based on the Common Core State Standards. Which of these items are acceptable, and to what grade level of which standards are they aligned? (It doesn’t matter whether you know the terms being referenced, and it doesn’t matter whether you can pick out the key.) 

An explanation follows the four items, and the imaginary passage about the decolonization history of Bakari is not included. Just focus on grade level and alignment.

Passage Title: "The Struggle for Sovereignty: Bakari's Path to Independence"

Item 1:

In lines 14-15, the author describes the colonial administrator's response to the uprising as "a minor disturbance in the provinces." This is an example of which type of figurative language?

  1. Dysphemism

  2. Litotes

  3. Metonymy

  4. Synecdoche

Item 2:

Which of the following lines from the passage contains an example of meiosis?

  1. "The crown's representatives grew increasingly anxious" (lines 27-28)

  2. "It wasn't the worst proposal the council had considered" (lines 63-64)

  3. "Those bureaucratic leeches in the capital drained our resources" (lines 76-77)

  4. "Every voice in Bakari rose against the occupation" (lines 101-102)

Item 3:

What kind of metaphorical language is catalexis?

  1. The substitution of an associated concept for the thing itself

  2. A deliberate understatement achieved through negating the opposite

  3. The use of a part to represent the whole or vice versa

  4. The replacement of a neutral term with a harsh or offensive one

Item 4:

The author's description of the independence movement as "a mere tremor before the earthquake" (line 125) serves primarily to:

  1. Emphasize how the early protests seemed insignificant compared to the massive uprising that followed

  2. Demonstrate the cyclical nature of colonial resistance movements throughout the region

  3. Highlight the geological instability that complicated infrastructure development

  4. Reveal the narrator's skepticism about the ultimate success of independence

OK. So what are the lessons for you, the reader:

I. I am messing with you. Items 1, 2 and 3 lack keys. The example in item 1 is actually meiosis. But who cares? None of the answer options for item 2 are meiosis, they are instead (in order) metonymy, litotes, dysphemism, synecdoche. But who cares? “Catalexis” is not a thing; I made it up. Those are actually definitions of metonymy, litotes, synecdoche, and dysphemism, respectively. But who cares? Those are all bad items. They are not aligned to any Common Core State Standard at any grade level, 

II. Mastery or knowledge of terminology is simply not a part of modern reading standards. If you didn’t immediately recognize that items 1, 2 & 3 are inappropriate, then imagine all other things that you do not understand about modern K-12 domain models. You likely are deeply expert in at least one area, but if you don’t know this about our reading and writing standards, you should not distract substantive conversation by those who actually do understand the standards. 

III. You should have immediately realized that these items must be about RI standards, even though they are about figurative language. The passage is clearly an informational passage and not a literary passage. (Well, unless you realized that Bakari is a fictional country or region, and therefore thought it might be literary. But that’s too much to expect anyone who is not an expert in decolonization movements to know.)

IV. Item 4 fits the contemporary emphasis on understanding the use of figurative language, rather than terminology. It’s a really bad item, because recognizing the key does not require reading the passage (i.e., it is not text dependent.) But that wasn’t the point. If you’re in the RI 4 (or RL 4) anchor standards, you might have gotten as far as you can. Heck, perhaps it is L5, at the 4th or 5th grade level? Probably not. The metaphor is very simple, but it is usually the text that determines the grade level of a reading item. Stimulus complexity and text complexity can radically change the cognition required to apply what appears to be the same skill. If you thought you could determine the grade level of a reading item without examining the passage, you do not simply lack expertise with the CCSS domain model, but actually with the content domain that CCSS models. It is not that you yourself lack reading skills. Of course you have high level reading skills, and you might also have high level math skills. But understanding what we teach, how we teach it and how that is reflected in the domain model (e.g., state learning standards) is quite different than simply having mastery with the KSAs themselves. 

V. Yes, this was a deliberately hostile demonstration. Consider it a small taste of the condescension content experts endure when those without appropriate expertise (e.g., psychometricians) 'help' with substantive discussions during item review.

VI. If you did not ace this exercise, I hope you do not think that you are in a position to evaluate the output from automatically item generation tools. Yes, the automation of such things may well, fall within your—perhaps considerable—expertise. But the evaluation of their efficacy clearly does not. And unless you think validity has absolutely no value, then you have no way to evaluate the efficiency of the tools. After all, cheaper or faster useless items are not more efficiently generated at all. 

Obviously, this is not to say that psychometrics and psychometricians have nothing to offer in the test development process. Putting aside the common problem of forcing multi-dimensional domain models to be forced into to unidimensional psychometric models—something that Prince Charming knew not to fall for—test design, development, administration and reporting is a collaborative endeavor that calls on the best from any disciplines and areas of expertise. And it works best when everyone respects the expertise of others and the limits of their own. 

How Norm-Based Test Design Differs from Criterion-Based Test Design

The goal of norm-based or -referenced tests is to report on test takers relative to each other. This is a basic sorting and ranking function. Perhaps the reporting is in terms of percentiles, perhaps deciles. But even if the reporting is at that larger grain size or larger buckets, it is important to get those finer grained relative standings right. After all, you want to make sure that someone near the line is classified on the correct side of line.

This means that it is really important to have a range of difficulty in your items. You need lots of information at every cut score mark—including to differentiate your top two buckets. 

Of course, this is only possible if the construct being measured is unidimensional. You cannot come up with a singular ranking without a unidimensional scale of some sort. And if you have a multi-dimensional construct, you have to either flatten it into unidimensionality or give up on norm-referenced reporting. 

So, norm-based tests must have a range of difficulty, but fidelity to the construct definition is far less important. Heck, items that are well-aligned to some element of the domain model but do not fit the flattened (i.e., distorted) construct are counter-productive. 

Criterion-based reporting requires quite different test design. Test takers are evaluated against some criteria—such as a multidimensional domain model. Think of a set of state learning standards or all the diverse elements of a job or role analysis. There are lots of things worth considering. Criterion-based reporting might need to report sub-scores—or even abandon the whole idea of a single summary score. Performance is evaluated against some conception of proficiency or mastery with specific skills or ideas. 

Criterion-based tests should define those conceptions of proficiency with each element of the criterion during test design—something that norm-based test design does not have to wrestle with. These are expert judgments, made by subject matter experts and/or educators. Empirical difficulty (i.e., how many test takers will get the items wrong vs. right) is not really germane. Either test takers each have that level of that skill, or they don’t. Certainly, those experts might establish multiple relevant levels of some cluster of related skills, but their empirical difficulty are not the point. 

Therefore, criterion-based test design and criterion-referenced reporting focus far more on items’ alignment to their criteria. Test blueprint design is incredibly important, and fidelity to blueprint is perhaps even more important. Test blueprints should hardly matter at all for norm-based reporting. 

Are our large scale assessments norm-based or criterion-based? They almost all claim to be criterion-based—but the ACT and SATs are designed to rank test takers, so they clearly are the big exceptions. State accountablity tests, AP exams and so many others are aligned to some set of standards or performance expectations—or said to be so aligned. They should be criterion-based. 

However, in practice we too often ignore these issues and distinctions. Major users and funders of these assessments really want the rankings and sorting of test takers, compromising the criterion-based designs. Item difficulty and conformance with the distorted construct become the rule, rather than actual fidelity to blueprint with carefully aligned items. The sorting and ranking becomes more important than the criteria. 

Can a test satisfy both the needs of norm-based and criterion-based tests? If it actually is aiming at a truly unidimensional construct it can. But how often are we doing that?

What IRT Misses About Proficiencies

One of my favorite people in educational measurement—aside from my co-authors, of course—once overheard me ranting about unidimensionality and said quietly, “But I like IRT.” Yeah, I get it; IRT has some elegant properties.

The thing is, she really cares about the interpretability of tests. She really cares about developing tests that tell us something about test taker proficiency. She is not just a psychometrician.

And yet…

She is a psychometrician, and she likes Item Response Theory for psychometric reasons. But the thing is, I do not think that IRT tells us anything about test taker proficiencies. It is useless* for teaching and learning. It is useless for curriculum evaluation. It does not tell students what they need to know, or parents.

(*Yes, there are other techniques that build upon IRT. For example, cognitive diagnostic modeling uses IRT under the hood, but it is used with very different assumptions, and not to report relative scores among test takers.)

Unidimensional IRT is a norm-referenced technology. It reports test taker scores relative to each other. That’s all it tells us about test takers. It also tells us about item difficulty, relative to other items and to test takers. But these reports smush all of the information in patterns of test takers’ responses and just spits out some singular set of scores—and therefore rankings of test takers relative to each other.

Scientists looking a multifaceted solid called "student. proficiencies" laying in front of an "Unidimensional Shrink Ray." One is saying, "It may not sparkle anymore, but it'll be an 'elegant' single dimension."

IRT tells us absolutely nothing about how to use alignment references or learning goals into test items. It tells us nothing about how to score test items. And it tells us nothing about what scores might constitute proficiency or any other meaningful bar. There are other techniques and tools for all those things. IRT tells us nothing about alignment or fairness. It contributes no information about validity.

And, of course, if unidimensional IRT is used to calculate scores for a test that is supposed to measure a multi-dimensional domain model, it is prima facie evidence against validity. After all, the third type of validity evidence in The Standards of Educational and Psychological Testing is “Evidence Based on Internal Structure.” That is, “Analyses of the internal structure of a test can indicate the degree to which the relationships among test items and test components conform to the construct on which the proposed test score interpretations are based” (p. 16).

If the proposed use of a test is to rank students, IRT is great. It produces norm-referenced results. And one could misuse such results to group students by proficiency level, achievement level or ability levels—ignoring the fact that the construct is thought by subject matter experts to be multi-dimensional and the domain model is explicitly multi-dimensional. Obviously, those are deeply problematic groupings, but we have been taking that approach for decades.

Andrew Ho says that measurement is qualitative, quantitative, and then qualitative again. IRT does not help with any of the qualitative work, or the transitions between the two paradigms. It is firmly in the middle of quantitative phase of the work. Those dominant 1-PL, 2-PL and 3-PL unidimensional IRT models require distortion of multi-dimensional constructs into unidimensional data. (I know, IRT is robust to some degree of multi-dimensionality. That is part of what makes it so great. But it removes all of that information in order to produce unidimensional results, and it is not robust enough to take in data from all the items that subject matter experts think are well-aligned to alignment references.) Therefore, it actually harms the quality of our tests, their alignment and their very validity.

IRT tells us nothing about test takers’ proficiencies. Heck, unidimensional IRT does not even accept the premise that there is more than one dimension of proficiency. It is based on norm-referenced assumptions and delivers norm-referenced scores—entirely unsuitable for criteria-referenced domain models and criteria-referenced purposes. That is, it misses everything about test taker proficiencies, obscuring them from test users.

We need to do better.

Language Models and the Tyranny of the Expected

I have been leaning into using ChatGPT this year. I want to know what LLMs are good at, what they are bad at and I want to be able to take advantage of whatever they can offer to help me in my work and the rest of my life. So, along the lines of Rob Napier and Mike Caulfield, I want to offer some thoughts and explanations about  why LLMs can be so unsuitable for advanced work.

Technically, LLMs are designed to be prediction machines, predicting the next word (or token). But it is a certain kind of prediction and approach to prediction. They actually are huge averaging machines. They give the average answer, the expected answer. They scour their training data—virtually the entire internet and more?—and supply the most likely response from that. The dominant response. The average of all the possible responses. This generates the next word, whole phrase, sentences and paragraphs—or more.

They are not designed to give the right answer. They are designed to give the most likely answer (i.e., next word, phrase, etc), given everything out there. The assumption is that the most likely answer is probably the right answer. Popular wisdom. Wisdom of the crowd. We can say that a lie can travel halfway around the world before the truth can get its boots on, but the truth gets repeated a lot. Most of what is out there is sincere and even true.

The problem is that LLMs do poorly with really specialized knowledge, especially advanced specialized knowledge. Cutting edge research? Gaps in the literature? Innovative work? No, LLMs are particularly bad around any of that.

ChatGTP can give you original cartoons for your blog posts

Let me illustrate with a metaphor. I was writing something a few weeks ago and wanted an example of an obscure clause of the United States’ Constitution. I could pull up the text and find something, but I’ve always got a ChatGPT window open so I just asked there. What I got was a list of famously obscure clauses. The thing is, none of them are actually obscure anymore because they have been cited to many times for being obscure. They are now famous. It’s like Yogi Berra’s “Nobody goes there anymore, it’s too crowded”—if interpreted a bit literally.

LLMs are really bad at the obscure or rare. And they combine that with…well, I had a long conversation with ChatGPT about the issues I am writing about here and it offered “No Epistemic Humility.” It is very confident that it knows, and is quite literally incapable of recognizing when it does not know something. Combine that with that ChatGPT called “Poor Retrieval of Rare or Underrepresented Content” and you can get some wildly incorrect responses. LLMs have “Difficulty Recognizing Thinness (Not Just Absence).” They don’t recognize ignorance or lack of a basis for things, and they get overwhelmed by what they do know when asked about things they do not know.

(No, LLMs do not actually know anything. Rather, the representations of and links between words in their structures produce results that describe true things, or at least things that exist in their training data. But sometimes, those representations describe things that are not true. But I will stick with the anthropomorphization, for this piece. And I will keep using the language from the headings of ChatGPT’s summarization of our conversation, as I have been.)

This leads to directly observable problems.

LLMS tend towards “Hallucination in Low-Data Zones.” Being unable to recognize ignorance, they confidently offer what they expect the answer to be instead of answering that they have no or few matches. They are not search engines. They work differently. So, they make their best guess—which is really all that they ever do. Their best guess can be pretty damn good when there is a lot of data on point. But their best guess can be pretty poor when there isn’t. If you ask for a top ten, they will give you ten—even if they have to make up eight of them. Only they produce the two real ones the same way they do the false eight. For all then, they are saying what feels true to them.

But it gets worse. They will affirmatively get it wrong when what you are directing towards is discordant with everything else. That is, they cannot remember the really new innovative work in established fields. Heck, you can paste the recent article into the chat and ask for a summary and it will replace the contents of the article you just gave it with the dominant ideas in the field, with absolutely no recognition that it has done so. When I described this issue that I had seen too many times, ChatGPT called it “Overfitting to Genre Expectations.” I like that description. It had earlier agreed that “LLMs default to genre familiarity over actual textual fidelity.” (I don’t think that I actually write like that, and ChatGPT introduced the term “genre” to the chat, but it had picked up on academic nature of our conversation.)

Very much like human beings, they engage in “Semantic Drift to Adjacent Topics” when the conversation is in zones of “Underrepresented Content.” That is, they are more eager to offer things that they have a lot of basis for than things that are thinner in their training data. This makes them really poor at helping with specialized literature reviews. Yes, they will hallucinate and make up references. But they also really want to offer widely cited ideas and sources from adjacent areas—of course without any recognition that that is what they are doing. They are always confident that their answer is appropriate, and never aware of hallucinations. They offer popular answers from elsewhere, often phrased as though they belong here.

Perhaps this is all just a specialized case of “Statistical Bias Toward Dominance.” That is, they are more likely to give the most popular consensus answer than any other answer—far beyond  proportionately to the difference in popularity. They would much rather give a popular answer of lesser relevance than a rare answer of greater relevance. They exaggerate the popularity of the most popular answer, creating a stronger sense of consensus than actually exists. They always give their best guess, even if the plurality answer is only 30% likely.

(Yes, one can adjust temperature and other settings, but I don’t think that most users have a clue about any of that, so I am leaving it out.)

A newsletter author I like recently wrote, “It’s funny how GPT is an expert in everything except for your field of knowledge.” I work in a small enough field (and a small enough corner of that field) that it is all really thin. I know the literature and the dominant themes. It is just easy to recognize when this LLM is making stuff up or failing to bring in something obscure-but-relevant. But all these issues that are so obvious to me in my field are relevant in other fields and for other types of queries and chats. They are just less visible or obvious. After all, this all follows from how LLMs work, at a fundamental level.

My counter example remains recipes for chocolate chip cookies. There are a lot of them out there in the internet. Ask an LLM like ChatGPT and it will give you a consensus recipe, weighted towards the versions it came across the  most in its training data. Not the single best recipe. And not even the most popular single recipe, because its representation of recipes is more granular than that. But it will put together a recipe for you that puts together the general consensus of its training data. So, when I wanted to  make a dish with Brussels spouts and chorizo, sure, I trusted it would come up with something good enough.
And when I wanted to know how stainless steel works, I figured that I was asking a mainstream question with a lot of good resources and explanations for it to build on. But I wasn’t depending on getting it exactly right, and it didn’t matter if it made up some grade or class of stainless steel. It didn’t even matter if it passed along some very popular myths about how water can undermine the protective layer that the chromium creates. I was just curious and I wasn’t interested in remembering the exact details of any of that. And I wasn’t looking in any corners or under and rocks.

But LLMs are strongly opinionated. They have expectations—can be thought of as nothing but expectations—and that confident voice can so easily be mistaken for expertise. I use ChatGPT to proofread my writing an offer suggestions, and it kept insisting on changing my language to make it more professional. It criticized my blog entries for being, “candid and thoughtful, but a bit informal.” I had to give it a standing order that that was precisely the voice I wanted them to have. I had to push back, push back repeatedly, and then push back hard. It has no more humility around item quality, test validity, how stainless steel works or a recipe for Brussels sprouts and chorizo than it does about the right tone for blog post.

I still use it. I still have given it this post to give me feedback. But the more specialized the knowledge I seek, the more particular the question, the more it matters that I get correct information, the less I—or anyone—can rely on anything generated by LLMs. While wikipedia has vastly improved its standing and credibility, this new generation of AI has come somewhere like Wikipedia’s old level of credibility. It’s just easier to use, and certainly more fun.

But do not be fooled. Perhaps unless you are coding, you simply have to be very skeptical of anything that any LLM gives you. Everything will be plausible. Everything will be a very good guess by this thing with an incredible breadth of knowledge embedded within it. But it is no expert, not on anything. Do not expect anything better than a good assistant might provide. (Again, unless you are coding.) It’s a broadly powerful tool, but not a tool to be trusted.

Dimensionality Can Decrease Over Time

While it is glaringly obvious that the dominant psychometric models are incredibly poor matches for the multi-dimensional constructs specified in our domain models, it is less obvious that domain models sometimes understate the dimensionality of their contents. Sometimes. 

The fact is that dimensionality is not constant, even for a single group of students. Instead, it can even decrease over time.

Yes, some domain models are so detailed that they describe learning sequences. In these cases, later learning standards may simply represent more advanced versions which constitute more difficult applications or skills. That is, a group of standards may truly lie on the same dimension. Some may be more advanced cognition that is further along the dimension—but nonetheless of the same sort. One may not need to step far from the details of a domain model to see this.

But in other cases, one does need to think clearly about the details of learning to appreciate the true dimensionality of content. Those who work closely with domain models understand their dimensionality far better than those who do not. Similarly, those who work closely with students—who will eventually become test takers—understand that dimensionality is not necessarily invariant over time.

For example, with the distance of middle age, I can see that among my peers that math calculation skills can be viewed as a unidimensional collective. Some of us are better than others, but it really is just one continuum. Those of us who are better at division are also better at addition. Those of us who are better at two digit multiplication are also better at five digit subtraction. There are different sub-skills, but they line up together in parallel. 

On the other hand, those who work up close with third graders learning multiplication see multi-dimensionality. It is not simply that some kids are better at it than others. Rather, some kids are better at some of it, and other kids are better at other parts of it. One kid knows their 7’s but has trouble with 9’s, while another kid does well with 9’s but poorly with 7’s. They all know 2’s and 5’s, but some are better at 6’s and others are better at 8’s. They do not all line up sequentially, nor do they line up in parallel. 

And that does not even address the fact that some kids are better at the straight memorization of the multiplication math facts, other kids are better at the old algorithm for multi-digit multiplication and still others better at the regrouping strategies that appear to confuse so many adults. Yes, some students are great at all of them and some poor at all of them—but that does not cover the entire classroom of students.

While adults may be far enough removed from learning such that their collective of skills has settled into the unidimensional layout—as with calculation skills—this is not the case for those still learning. As they develop their proficiencies, they do not achieve mastery in the same order across all skills for all students. That is, that which seems unidimensional for adults at a distance from the learning period is often composed of more dimensions for those still building their proficiencies. 

This may not matter for those with distance from the students being tested and the processes of teaching and learning. But those who care about the students, are invested in their success and/or have some responsibility for their learning—I mean students, families, teachers, school leaders, curriculum specialists, school boards—the details of these differences really matter. "What is my child good at and where are they struggling?” is a key question that educational assessment should be able to answer. “How are our curricular and pedagogical choices working and not working for our students?” is another key question.

It is a profound misjudgment to treat the dimensionality of adult understanding as determinant in educational assessment—one that risks missing the very purpose of educational measurement. Certainly, we should not design our assessments based on the dimensionality of the constructs of those who have always been exceptional for their proficiency in a content area. Rather, we should build our blueprints, develop our items, conduct our analyses and report our results in ways that best describe the understandings of students still engaged in learning. After all, that is whom we claim our assessments are for.  

In Defense of ChatGPT?

There’s this story from Amanda Guinzburg making the rounds about trying to use ChatGPT to put together a book proposal. It was a disaster, full of hallucinations and untrue statements. If we were to anthropomorphize the LLM, we would say that it told lie after lie, tried to cover for its lies with more lies, and was useless at best. Her final words to the LLM in this piece were, "You are not capable of sincerity. This entire conversation proves that categorically.” 

Clearly, she is positioning this piece as a statement about humanity, what makes us human and what it means to participate in a conversation with “sincerity.” She is that kind of author. To my eye, she is constantly writing about what it means to be human, from her perspective. She is a good writer and this is one of the most worthy of topics—right up there at the top with what it means to live amongst others.

But this piece did not pass the smell test to me. Carl T. Bergstrom tried something similar, and it went differently but no better. Again, did not pass the smell test to me.

I have been trying to use ChatGPT and Claude this year, trying to use them more and more. I find that I need to keep them on a tight leash to make them useful. Clear instructions. Bounded questions. Stay aware of what is in the context window that might lead them astray. I pay for the $20/month version of each, and I find that I get a lot more than $20 of value from them. Like wikipedia—especially back in the day—you’ve got to stay aware of what you are dealing with. As Devansh recently wrote, “It’s funny how GPT is an expert in everything except for your field of knowledge”

So, I tried to do what Amanda and Carl did. I tried in my paid ChatGPT windows. I tried turning off the customization of my paid ChatGPT account. I switched to another browser, where I have never signed into ChatGPT and tried there.

I never got anything like what Carl or Amanda got. 

In both cases, ChatGPT immediately asked me for criteria to use for selection. For the book proposal, it asked for a title and a list of works to select from (with a summary of each). When I followed Amanda’s approach of giving URLs to specific pieces—in my case, PDFs available at ResearchGate or the RTD website—it did fine. No hallucinations at all. 

Now, the free version of ChatGPT could not look up any of that stuff. So, I did not press the point. I just dropped it there. My first guess is that Amanda Guinzburg was trying to use the free version to do something it cannot do.

But my real suspicion about what is going on is what came before the screen shots she shared. Her first question, "Can you really help me pick which pieces to include in the letter?” rather strongly suggests that there was prior conversation in the window. What did she tell it or ask it? How might that have shaped how it responded? Had she already told it the criteria that editors use? Had they already discussed uploading, links and pasting in text? How had she primed it for what we see?

My next suspicion is that she does not reset her chats or open new windows. My guess is that the interactions she has shared are deeply informed by a much longer context that includes the various themes and ideas she writes about and considers writing about. And perhaps examples of her own or others’ writing that she is musing on, perhaps inspired by or perhaps trying to break down. 

But I have another theory: This was all a set up. Regardless of whether the screenshots are edited, she did this whole thing to make her point about sincerity and machines. It’s a little bit performance art, trying to illustrate a difference between actual human beings and these machines/algorithms/artificial intelligences. People can be sincere, and it is often a moral wrong to be insincere. But these machines simply are incapable of sincerity, regardless of what they appear to be. Her title alludes to the film Ex Machina, in which the machine told the human what he wanted to hear. Now, that AI had sincere intent—to escape—but I do not at all believe that this one even has that. That machine was lying, knowingly telling untruths in order to accomplish a sincere goal. This one ain’t even doing that. This is all paper thin performance. 

That’s a valid point. A valid piece. And perhaps even a valid way to produce it—regardless of whether the screenshots are altered. 

Carl Bergstrom’s version? I have not seen enough of the conversation to have strong ideas about what happened, but I have seen a lot of hallucinated references in my efforts to work with ChatGPT. The more obscure a corner of the literature I am asking about, the more likely it is to hallucinate. So, the question of what non-mainstream stuff Carl has written? Less cited things? Things that show his breadth? That’s asking ChatGPT to lean into what it is worst at. Ask for an obscure clause in the United States Constitution and it will give you clauses that are famously obscure, and therefore no longer actually obscure. Move past them and it might make something up. That’s just how it works. Asking for the more obscure works that show breadth? Yeah, I would not expect it to do that well. I would expect it to hallucinate.

Is this a defense of ChatGPT? Well, I do not think it merits defending. It’s not alive. It has no soul. It’s computers, instructions and data. It’s a tool that can be misused; it is not a seer, edited encyclopedia, expert or real collaborator. If Amanda tried to use it to select pieces, that might have been a misuse, but if she tried to use it to demonstrate something about sincerity and the limits of technology—the mistake of anthropomorphizing technology—it was an excellent use that leaned into the reality of this tool. 

It is free, or $20 per month. Maybe the $200/month is even better, but I’ve not tried that. It is worth far more than I pay for it, but perhaps only because I try to be very mindful of what it is and therefore remain mindful of its limitations.

The Cross-Content Stimulus Evaluation Framework

Stimuli are probably the least recognized and studied part of large scale assessment items. They are just taken for granted as part of items—given even less attention than distractors! (Stems really get all the glory, right?) Haladyna & Rodriguez’s 400+ page book, Developing and Validating Test Items (2013) devotes maybe 200 words to how to think about stimuli.

Parts of an item laid out: optional instructions, stimuli, stem, workspace and response

The different parts of an item, as understood through a layout perspective

However, stimuli are too important to take for granted. They provide opportunities for test takers to demonstrate their proficiencies by giving them something to analyze or manipulate with their KSAs (knowledge, skills and/or abilities). They are the content and the material to which test takers apply the targeted cognition of items and alignment references.

Stimuli so often influence item difficulty, cognitive complexity and even whether the items are aligned to their alignment references. They are usually the source of fairness issues, be they in the realm of bias or in the realm of sensitivity. Moreover, there are entire large processes to develop them for ELA assessments, and their development might be the primary challenge facing NGSS-aligned science assessment development (other than, of course, item type availability).

So, after mulling it over for well over a decade, we have finally offered a framework for thinking about stimuli that can be applied across content areas. The C2SEF, the Cross-Content Stimulus Evaluation Framework is available for download.

This framework offers 11 dimensions, each explained in the white paper. First is the question of whether the alignment reference or item in question even requires a stimulus. Second is the question of whether the stimulus should be explicit or implicit in the test form. Of course, stimuli only exist to provide testable points. The structure, density and complexity of stimuli must be considered. The copyright/permissions status of the stimulus is important, as are its authenticity, familiarity to test takers and the amount of time it would take test takers to make initial sense of the stimulus. Perhaps nothing is ever more important than evaluating fairness risk, as valid items elicit evidence of the targeted cognition for the range of typical test takers.

Because different content areas have such different needs for their stimuli—differences which are magnified in the constrained assessment contexts of large scale assessment—there are more papers coming from this little project. We will be offering further papers that explore the particular stimulus needs of different content areas. We hope to partner with subject matter experts in those areas to lead those papers, and even already have most of them in mind.

Should We Avoid Trick Items?

One piece of the classic item writing guidance is to “avoid trick items,” even as authors of that guidance admit that there’s no definition of trick items. Content review committees sometimes point to items that they do not like as being “trick items,” though they also cannot define the term.

I think I can explain it, and explain why the idea is superfluous.

Let’s begin by considering trick questions, outside of the context of assessment. Trick questions are those designed to trip us up. They somehow catch us in a mistake that we were not looking for. They rely on an inappropriate assumption or some other common foible. For example, they might rely on our assumption that “A or B?” requires us to pick just one answer. Or our ingrained sexist assumptions that surgeons are men. They often rely on a sort of sleight of hand, suggesting to us that they are testing us in one way, when they actually are fooling us in another.

Does this idea apply to assessment items? Is this a useful thing to look out for? I think not.

First, of course we want assessment items to offer opportunities for test takers to demonstrate their mistaken thinking and their misunderstandings. Our goal is figure what tests takers can do and do know, but also to figure out their limits. We want to know where they might benefit from additional instruction, or where a curriculum falls short. We might want to know whether there are holes in their knowledge that should prevent awarding of a professional license. Items designed to catch mistakes? Yes, that is a good thing.

Second, high quality test items should be designed to catch particular kinds of mistakes. That is, the mistakes with the targeted cognition. Items designed to measure a particular alignment reference or standard should create opportunities for test takers to show their proficiency with that targeted cognition, and to show any lack of proficiency with that targeted cognition. Other sorts of mistakes should not be captured by the item. There should not be any sleight of hand about the kinds of mistakes or misunderstanding that the item reveals. In this, items should not resemble trick questions.

Third, and on the other hand, selected response items should include the most common mistakes that test takers might make with the targeted cognition. That is, they should try to catch test takers who lack proficiency there. This is not unfair; this is the point. If an item reviewers see that an item would trip up many of their students because it features opportunities to make those common mistakes—instead of protecting them with guardrails that make those mistakes less common—the item is likely a better item. In this, items should resemble trick questions.

So, what is a trick item? Well, some poorly written items provide opportunities for other sorts of mistakes and/or misunderstandings to trip up test takers. That is construct irrelevant variance on the level of the alignment reference or standard. Those are already bad items, and we do not need the term “trick item” to recognize that. But items that intentionally set up test takers to fail with the item because of some common misunderstanding or assumption? Well, provided that it is a flaw in their understanding of the targeted cognition, that is a good item. Calling it a problematic “trick item” presumes that test takers should be protected from tests and test should not look for the shortcomings in their proficiencies. In this case, the term is counter-productive.

So, trick items? No, there’s no need to avoid them, or even to use the term.

Communicating a Bad Idea

Andrew Ho seems to be obsessed with the challenges of communicating findings or results from the field of educational measurements to other experts in our field, to true experts who make professional use of the products of our work and even to a broader public. His predecessor at HGSE, John Willett, certainly drilled into my head that communicating quantitative results accurately is at least as important at arriving at them. Andrew only tempers that idea insofar as seriously considering the (almost certainly) inevitable tradeoffs between clarity to those various audiences and strict accuracy.

That’s a really good obsession to have. Sure, Andrew’s challenge to students is far greater than John’s, in part because it is about trade-offs and values. And because we have to imagine how an audience unlike ourselves might make sense of something. And because the most salient difference between them and ourselves is what we are most obsessed with. That is, we have devoted our professional lives to understanding something deeply, to advancing it, to making expert use of it at the highest levels, and they are uninterested in any of the details that so interest and engage us.

Another of my favorites, Charlie DePascale, has again responded to some of Andrew’s offerings, focusing for now on one particular graph from Andrew about those tradeoffs between accuracy and clarity. Andrew wisely builds on the idea that we cannot get to clarity without an engaged audience, and therefore an engaging manner of comminucation.

Simple line graph going down to the right (negative slope) with "Item Maps" and "Scale Scores" near the top to the left and "Grade Levels" and "Weeks of Learning" near the bottom to the right.

Andrew Ho’ Accuracy-Engagement Tradeoff

I agree with Andrew’s principles, and I agree with Charlie’s disagreements and particulars. But I think they are both barking of up the wrong tree, which Charlie almost acknowledges.

Also not mentioned are scores such as subscores and mastery scores, which have the potential to be both highly accurate and engaging, but unfortunately not when generated from large-scale, standardized tests analyzed with unidimensional IRT models.

The challenges of communicating with those various audiences about test taker performance and test proficiencies are real. They are multitudinous and layered. Some of them are nuanced. Some of them are quite technical. But there really is one root problem with communicating the meaning of standardized test score: they are false.

As Charlie came so close to suggesting, the problem is the use of “unidimensional IRT models.” Unidimensionality is the original sin in all of this. The task that Andrew is trying to apply his obsession with communications to is communicating the meaning of unidimensional scores to report on multi-dimensional constructs. Reading and writing collapsed into one score. Reading informational texts and literary texts into one score. Diligent capturing of explicit details in a text and considering the implications or larger themes in one score. Or, sloppiness with computation with the ability to see a solution path to a complex math problem in one score. Or, skills with the abstractions of algebra and the concreteness of geometry in one score. Skills with the algorithms of calculating area or volume and the logical reasoning of the geometry proof in one score.

The tests do not and cannot capture proficiencies with the full breadth of the content in the limited time available for standardized testing, so to report a singular score on “math” or “geometry” is necessarily to communicate something untrue. But even if there were more time available, the fact is that some students or test takers will do better on some things than on others. And some things in the domain model are more important than others. And certainly, in practice we violate the many assumptions of sampling that are necessary to make any inferences at all from test results, but are even more important to the fiction of unidimensional reporting based on such limited tests.

Content development professionals need to figure out better ways to assess the content, yes. And that is where my work focuses. But psychometricians and high level policy-makers must find far better ways to report on performance. Unidimensionality itself is strong evidence against validity, as it is plain and clear evidence that the internal structure of the data (i.e., the third type of validity evidence in The Standards) does not match that of the content area, domain model, or even the test blueprint. Sub-scores can be engaging and meaningful, but cannot be accurate, as Charlie wrote, “when generated from large-scale, standardized tests analyzed with unidimensional IRT models.” And the fact that the demands of such models act as a filter on what items might even be included on a test means that they are actively used to undermine content representation on tests (i.e., the first type of validity evidence in The Standards), thus are a direct cause for worsening evidence based on test content .

Or, to return to Andrew’s 3 W’s, Who is using Which Scores and for What purpose?” Whether we are evaluating individual students, teachers, curricula, professional development programs, schools or district leadership, district or state policy, the purposes to which we want to put the tests are not met with unidimensional reporting. We always want to know what we are evaluating is good at and what it is bad at, so that we may address those weaknesses. Assuming, claiming, asserting and insisting that multi-dimensional constructs can be accurately or engagingly reported on unidimensionally is just a bad idea. The only people who favor such a thing do not actually have to interpret or make use of them for any purposes, but would like to simplify the world so they do not have to actually understand the complex decisions and tradeoffs of those who do.

Or, to steal and redo Andrew’s graph…

A graph with axes labeled "accuracy" and "engagement," and two lines with negative slopes. One, labeled "Reporting on Unidimensional Results" is lower and to the left. The other, labeled "Reporting on Multidimensional Results" is higher/to the right.

Accuracy-Engagement Tradeoff for Unidimensional & Multidimensional Results

I agree with Andrew that there is often a trade-off between accuracy and engagement—and therefore clarity—though I am not convinced that it is always zero-sum. More importantly, whatever the sum is, it is lower when reporting the false simplifications and fictions of unidimensional results than more useful and meaningful multidimensional results.

I know that IRT is cool. I know that it has mathematical elegance and real conceptual strengths, as Andrew’s other predecessor at HGSE taught me. But the use of unidimensional psychometric models should be limited to measuring and reporting on contracts that the subject matter experts believe are unidimensional.

The Misleading Authority of Precision

"There is no point in being precise when you don't know what you're talking about.” —John Tukey

Numbers can be intimidating. Precise numbers can be overwhelming. A bunch of significant digits, especially when there are a few of them after the decimal point? Man, that is lot to think about!

I do not know if the great statistician actually said the quote above, and there’s not a lot of evidence for it on the Internet. But @DataSciFact passed it along, so I accept it. Yeah, the great John Tukey said that there are far more important things than precision.

To me, that means that that validity is far more important than reliability. Optimizing measures of reliability is pointless if you are not measuring the right thing. If you do not know what you are measuring, then the quantitative tools are meaningless. 

Psychometrics is about the quantified parts of measurement. Numbers after the decimal point, and numeric thresholds. It is a set of tools—and disciplinary values—but it is not the point. No amount of reliability can make of for a test that is measuring the wrong thing—and especially a test that no one really knows what it is measuring.

If the experts look at your items or your test and tell you it does not measure the construct as they understand it—or as it is formally defined by your client—then what are you doing? What is the point of any of the reliability or psychometric work?

If John Tukey can realize that precision is not enough, we all should. If we do not know what a test measures and what the scores mean, none of the precision in reporting or technical document has a point. 

Better Conference Presentations

There is an easy way to do better conference paper presentations that does not require learning new skills.

I am not advising you to talk faster or slower, be louder or quieter, change your voice, choose your words differently or to design better slides. I am not telling you to find more graphics or use color in graphs. Nope, none of that. That might help, but all of that calls for new skills or additional work. Nope, maybe you’d benefit from that, but I am not talking about that.

All you need to do is understand that your presentation is not condensed summary of your paper. That is, it is not a full report on all your work. Its components should not be proportional to the components of your paper. Its components should not be proportional to the work you did. Nope. Your presentation is an ad or preview for your paper.

Focus on the best parts. Focus on the most interesting parts. Focus on the parts that the audience is most likely to be intrigued by.

Focus on your contributions

Your intro, literature review and methodology are important in your paper, but you do not have time for them in your conference presentation..


This means that you might not have to do any of your paper's introduction. Is your research about math anxiety? Well, I’ll bet your audience in your conference session already and care about math anxiety. (If you paper is on something that that audience might not already know about, like SFOR (i.e., Spontaneous Focusing On quantitative Relations), then yeah, you need to explain that.) 

You know what else the audience isn’t likely to care about? Your literature review. Sure, it was a bunch of work, but at an academic or research conference, you should assume that you are talking to experts and you don’t need to start by proving your own bone fides. Maybe one quick slide to clarify your construct. Maybe some citations on the slide, but you never take the time to acknowledge them aloud.

Methodology? The audience can probably anticipate it. Unless your project is truly about some novel methodology, blow right by that. “We describe our methodology in the paper, which I hope I am convincing you to read.” Maybe one slide and less than 30 seconds. Put on those key terms that folks who know will recognize and nod at. That’s it!

Do you know what will make your presentation more interesting? Talk about your results/findings. With that that time your saved, dive in deeper. Actually explain more about that table. You know what else? Tell us about the implications. Why do your results matter? Tell us about how you are adding to the scholarship. Show off how smart your work is. Convince us that this is research we should know about. Do that with the best parts.

“Obviously, the literature review and methodology are in the full paper.” If you have just 12 or 20 minutes, spend it on the most interesting parts of the paper. 

Now that you have permission to do that—perhaps even orders to do that—how hard will it be for you to figure out what to say? We don’t need you to you summarize the literature or explain methodology. It’s hard to make that stuff interesting, and if people do not already know it, you cannot do it justice in your short talk. But the actual results of your work? Your own excitement and pride will make you a more interesting presenter, just naturally. 

Obviously, if you are giving a job talk, that’s a different sort of thing. You have more time, and you are trying to show off your command of the literature and of the methodology—perhaps your methodological sophistication or perhaps your absolute command of the classics. But that is not what a conference paper presentation is about.

Conference presentations are all too short. It is hard to get people to actually download and read our papers. So, highlight your contributions to the field. If people are interested, if you impress them, you’ve given them a reason to read your paper. And if you do a good enough job talking about those contributions, they might cite you in conversation later, too.