Epistemologies, Respect, and Physics Envy (Part V): The Standards and the standards

I had a conversation recently with a respected psychometrician. Someone truly smart and thoughtful, whom I personally respect for a variety of reasons. I take what they say and write quite seriously—even though I don't necessarily agree with all of it.

In the course of the conversation, he made a point about psychometrics and validity that required, as far as I could tell, simply setting aside what The Standards for Educational and Psychological Testing say on the matter. When I pressed him on it, he dismissed my reference to The Standards by saying—in a kidding-not-kidding sort of way—"I don't care what The Standards say."

I don't buy it. I know that he cares—just not enough to let it stop him. And not enough to reckon with the inconsistency between The Standards and the argument he wanted to make. So he made the joke rather than wrestling with what the document actually says. The humor was doing epistemological work by signaling that the authoritative, hard-won, cross-disciplinary consensus document was beneath the level of argument he was operating at. It was clearly a joke. It was also a deflection.

So I pushed further. He then dismissed The Standards as merely "a consensus document"—as though that fact diminishes its authority rather than constituting it. In word-oriented epistemologies, in law, in democratic theory, consensus is the source of authority. The Standards carry weight precisely because they represent serious, collaboratively arrived-at agreement over the course of generations across the constituencies of this multi-disciplinary field. That is not a weakness. That is what they are. They should not be ignored for being inconvenient, or dismissed for failing to feature one's own favorite ideas. (RTD, for example, adds item validity to the picture The Standards draw—but removes nothing from it in order to do so.)

Then he asked whether domain models can be falsifiable.

For a third time, I was stopped cold. The domain models given to assessment developers are the job. State legislatures ratify learning standards. Test development contracts call for assessments to align with them. And like The Standards, the standards themselves represent the product of generations of thinking—about curricular priorities and learning sequences. Modern standards can be traced through the remarkable work of the National Council of Teachers of Mathematics in the 1980s and 1990s and through the broader education standards movement that followed. This is serious work, done by people deeply committed to their fields and to their students.

Are they falsifiable? From a Bayesian perspective, it would take an extraordinary amount of evidence to do so—certainly astronomically more than can be generated by convenient but obviously inappropriate psychometric models. And any such evidence would have to address the actual concerns of the standards: the nature of the content and the appropriate learning sequences through it. None of that was on the table. Moreover, what would result from that kind of serious engagement would be iterative improvement, not disproof. That is how both The Standards and the standards develop over time.

Asking whether they are falsifiable is a category error. Falsifiability is a criterion native to one epistemological tradition—a powerful and important tradition—being applied as though it were universal. CCSS is not a hypothesis. It is a social, legal, and professional agreement about what students should know and be able to do and how those learning goals hold together. The right question is not can it be falsified? The right question is is our test properly grounded in it? That is word-oriented epistemology's home territory. And it is, not coincidentally, the epistemological tradition that The Standards themselves are working in.

So, in a single conversation, the same move was made three times. The Standards—dismissed with a joke. The learning standards—subjected to a criterion from a different epistemological tradition that was never designed to apply to them. 

This is not a story about a bad actor. It is a story about a larger trap—about what happens when deep disciplinary formation leads someone to stop seeing the obligations that don't fit its framework. This psychometrician spoke from genuine expertise. But for decades, this is the view I have seen dominate educational measurement. And it stands in the way of high quality assessments—and of the respect from educators and the broader public that high quality assessments deserve.

This is not to say that The Standards and the standards act like laws of nature. They are not absolutely inviolate. Nature says that nothing can travel faster than the speed of light. But the 613 mitzvot in the Torah—commandments that have structured Jewish life for millennia—all yield to pikuach nefesh, to the preservation of life. There are hierarchies among them. There are millennia of accumulated debate about how to understand and implement these written words. Some communities have revised their relationship to them substantially. But none of those debates—not the Talmudic tradition, not Reform Judaism, not the interpretive traditions of Christianity—simply dismiss them. Even where traditions have departed from strict observance, they have done so through serious engagement, not jokes and deflection. The wrestling is the point.

The Standards and the standards work the same way. They have hierarchies—Andrew Ho has argued that the first source of validity evidence is listed first for a reason, and he is right. They require interpretation. They will be revised, and there are people doing that work on The Standards right now. They will listen to the entire professional community. That is how both bodies of work grow and improve. Not through dismissal, but through serious engagement by people who take them seriously enough to push back on them in principled, argued, evidence-based ways. 

But until that revision work is done—until the community has done the wrestling and arrived at a new agreement—we are all professionally obligated to respect both The Standards and the standards. They define the job. Without them, we would be flailing, each of us alone, trying to figure out what we are supposed to be doing. With them, we stand on the shoulders of those who came before us.

That is what escaping the trap looks like. Not abandoning your formation. Not pretending that all methods are equally appropriate for all questions. But recognizing that your framework is your framework—powerful and not universal—and that the standards and The Standards deserve grounded, serious engagement. The Standards and the standards merit deference from all of us because they are the product of multiple epistemologies and disciplines. They should guide us in transcending our own. 



Epistemologies, Respect and Physics Envy (Part IV): -(PE)⁻¹

The term "physics envy" was coined by biologist Joel E. Cohen in a 1971 book review in Science: "Physics-envy is the curse of biology." The aspiration behind it is older—Auguste Comte imagined sociology as "social physics" in the nineteenth century—but Cohen gave it a name. He meant it sincerely. But even his original use is off the mark. "Physics envy" was deployed in the context of political contestation about epistemological debates within biology—a field arguing internally about whether to be more mathematical, more formal, more like physics. That is a professional and political struggle, not a psychological one. 

When the term gets applied by someone to descibe other disciplines—to qualitative researchers, to humanists, to anyone working outside the quantitative tradition—it describes something that simply doesn't exist.

In practice, "physics envy" functions as a disparagement. It projects a hierarchy the speaker feels onto the psychology of the accused: you must want what I have, because what I have is the best thing. It asserts superiority while dressing it as diagnosis. And it is, in the laziest way, unfalsifiable: if you deny having physics envy, that is exactly what someone with physics envy would say.

It is also—and this is where the irony becomes acute—built on a Freudian concept. "Physics-envy" is constructed on the model of "penis envy." By 1971, that was not an innocent borrowing. Second-wave feminism was actively exposing penis envy as precisely the same move: a person in a position of cultural power projecting their own hierarchy onto the psychology of those below them. The person without the thing doesn't actually want the thing. They want the social power attached to it—and the person with the power finds it flattering to believe otherwise.

But the deeper irony runs further. Envy is not a concept from physics. It is not a concept from the natural sciences at all. It is psychological, humanistic, interpretive—the kind of thing the natural science epistemologies are supposedly beyond. To name what they were observing, the people deploying this term have to reach outside their own toolbox and borrow from a tradition they look down on. They borrow without understanding what they are taking—a term already discredited in the very field they were raiding. And the act of borrowing demonstrates, without their noticing, exactly what they are denying: that their own disciplines lacked the tools to address everything that matters. They needed words. They needed interpretation. The very move that asserts the hierarchy undermines the claim the hierarchy rests on.

That said—and I say this as someone whose has joked for forty years has been that physics is just applied calculus—none of this is to disparage physics. Physics is so closely bound to mathematics that it is like a brother to my first disciplinary love. My joke is grounded in genuine respect for what both make possible, and for the extraordinary things physics has done. I was formed by mathematics and the natural sciences before I was formed by any other disciplne, and that formation never fully resolves. When I developed grounded theory in my doctoral work—a multidisciplinary study with a sociological methodology, rooted in political science, informed by psychology, executed with the applied philosophy in mind—I still worried about falsifiability. But fallibility is an important concept native to completely different epistemology I was working in, one I carried without quite being able to set it down. That was not physics envy. It was the weight of a formation I was still working against. Deep respect is not the same thing as envy.

I do not think physics envy, as described, exists in any meaningful way. What I do see—clearly, consistently—is its negative reciprocal. -(PE)⁻¹.

Not envy upward. Contempt downward. Not aspiration toward physics. Dismissal of those working in different epistemological traditions, from people who believe their work resembles physics. Very commonly—not occasionally, not in isolated cases—people trained in quantitative disciplines looking down on those working in qualitative ones.

It is easy to see -(PE)⁻¹ all around us. The tyranny of numbers. The confusion of precision with accuracy—as though carrying a result to four decimal places makes it more true. The willingness of people without training in qualitative methodology to opine freely on qualitative work, or who have collaborated on one mixed-methods project to consider themselves expert. A fundamental disdain for non-quantitative epistemologies that allows them to think—at best—there is barely any there, there. Strong enough that people who do not work in research at all—policy people, administrators, funders—think themselves closer to obvious quantitative logic and look down on those working with more nuanced and subtle tools.

And yet, the smartest quantitative methodologists I know—Dan Koretz, Andrew Ho, Andrew Gelman—preach consistently about the limitations of their tools and the importance of taking the non-quantitative parts of the work seriously. They know better than to fall for -(PE)⁻¹, and they understand it is their job as educators to ensure that future generations of quantitative researchers don't fall for it either.

But the world is larger and harder to teach than even such wise methodologists can reach.

Epistemologies, Respect and Physics Envy (Part III): The Written Word

Two of the most acclaimed geniuses of the twentieth century were physicists—practitioners of a field so dependent upon mathematics that entirely new branches of math have been invented to do the work that physics required.

Their genius was not actually the math.

Stephen Hawking realized early in his life that his physical disabilities would not allow him to lean into the mathematics the way his colleagues did. He could not write out equations and notations in the conventional way. Which was fine, because that was not his genius. What he brought was the ability to visualize the geometry of spacetime and interpret physical reality in ways no one else had managed. One of the greatest physicists of the twentieth century. His genius was interpretive mastery.

Albert Einstein—the reason we use the name "Einstein" to denote genius itself—was not responsible for getting all the mathematics right in his papers. His wife is said by some historians to have helped; a close friend supplied the Riemannian geometry required for general relativity. His genius lay elsewhere. Einstein's path to relativity ran through thought experiments—imagining riding alongside a light beam, imagining an elevator in free fall. He grasped the physics conceptually before he had the mathematics to express it. The interpretation came first. The numbers followed. They were true to each other.

Physics is demanding. It includes math. But the rigor is not just in the math. In much of science, the actual math is literally formulaic. You know which equation to use; you plug in the values; you get the answer. That part is not where the genius lives.

The rigor in natural science is in the inquiry. In figuring out what you even want to test—recognizing a phenomenon, hypothesizing about it, finding a way to quantify it, doing the quantitative work, and then getting back out of the numbers. Back to the real meaning, in terms of the phenomenon. The numbers are often the easiest part.

The hardest part? The bridges. First, from something qualitative and real to a quantified proxy. And then second,  from those numbers back into the qualitative phenomenon. Doing those translations while remaining true to the source—true to the phenomenon when quantifying, and true to the numbers when interpreting back into the terms of the real world. Real rigor is in that faithfulness. Andrew Ho teaches that educational measurement must be qualitative, then quantitative, then qualitative again. He is right. And that sequence applies far beyond educational measurement.

Consider a structural engineer's analysis of a bridge design. The load calculations are exact. The strengths of the steel and concrete are what they are. The numbers are non-negotiable—and any interpretation that ignores them will get people killed. And yet those numbers do not design the bridge or determine how to balance its strength and its expected load. Do you add more rebar? Redesign the load distribution? Post a weight limit? Serious engineers can look at the same numbers and reach different defensible conclusions. The numbers constrain every possible answer, though they do not produce one.

Once we recognize that it is not numbers that make for rigor, we should be able to see that rigor exists in word-oriented epistemologies as well. The disciplined acts of careful interpretation—in law, in philosophy, in literary scholarship, in religious textual traditions—are held to the same standard. The rigor lies in fidelity to the text. What does this statute actually say? What does this passage actually mean? What does this argument actually establish? These are not soft questions. They have possible answers and wrong answers—answers the text itself can support or refute. A lawyer who ignores what the statute says is not interpreting freely; she is being unfaithful to her evidence. Scholars in these traditions are held to exacting standards: quote accurately, represent sources faithfully, do not put words in mouths, do not misrepresent arguments. These are not courtesy norms. They are the standards of rigor. We all know what it looks like when they are violated. Everyone who has written a literature review was taught the importance of reporting truthfully on what the literature says, and what it might imply. Those who are not faithful to their sources are hacks, regardless of the discipline or epistemology.

Good and rigorous work is hard. Hack work is easy, regardless of the field, discipline or epistemology.

Rigor in word-oriented epistemologies actually has a longer history than rigor in quantitative ones. The degree is called a Doctor of Philosophy. The natural sciences grew out of what was once called natural philosophy—traditions of rigorous argument, careful reasoning, and disciplined interpretation that predate quantification. Quantification was built on them. To fail to see rigor in word-oriented work because it lacks numbers or advanced math is to miss the forest for the trees.

I was a math and science person. STEM high school before STEM was even a thing. All the AP courses. County math team. Summers studying more math and more science. It then took four years of college and more years of serious effort in graduate school to trust that I could do good qualitative work. I have never been able to shake the epistemologies that originally seemed so natural to me, but I continue to work at it. The parallels across epistemologies are real—the demands of clear communication, strict thinking, and faithful engagement with evidence are recognizable everywhere—but seeing them requires work.

(Someone looking for the true soul of mathematics could do far worse than the CCSS standards for mathematical practice. And someone who understands word-oriented epistemologies can read those descriptions of fundamental mathematical thinking and see their own epistemology right there between the lines.)

Neither numbers nor advanced math makes something rigorous. Hack statistical work can be found everywhere. Rigor does not come from numbers. It comes from holding ourselves and our work to the demands of our methodologies, of honoring the demands of our epistemologies. In taking seriously the obstacles and challenges, even when they make the work harder and keep us from simply and easily reaching desired conclusions. And in so many fields, it means honoring the words.

Epistemologies, Respect, and Physics Envy (Part II): Deference to Quant

There is a hierarchy in educational measurement. Quantitative work is seen as superior. This is so thoroughly naturalized that many of the people who benefit from it cannot see it as a hierarchy at all—they experience it as a simple fact about what rigorous evidence and reasoning looks like. This is an epistemology at work.

We can watch how it operates in practice. Psychometric flags—low point biserials, suspect discrimination indices—are supposed to function as alerts, prompts for human judgment about whether something is wrong with an item. A negative point biserial should not automatically exclude an item. But in practice, they function as verdicts. Low point biserial? The item simply cannot make the operational form. The flag overrides the judgment of the content experts who built the item, reviewed it, and know the domain. Cronbach's alpha has achieved something close to supreme authority—a single number that seems to settle questions that involve far more than internal consistency. The quantitative output doesn't inform the judgment; it replaces it. Some deny that this happens, while others acknowledge the fact that this is standard practice.

Meanwhile, the people holding those quantitative outputs frequently feel entitled to opine about things about which they lack expertise. Where exactly is the line between fifth and sixth grade mathematics? What distinguishes two sixth grade standards from each other in ways that matter for item development? What does it mean for a student to have genuinely mastered a standard versus having picked up a surface procedure? These are not trivial questions. They require deep content knowledge, experience with how students think, and serious engagement with domain models. They are the core expertise of content development professionals. And yet many others don’t even see the tools and lenses required to build and hold that expertise—therefore think that there is nothing wrong with offering their views as though they might be of similar standing. Their epistemological framework does not see this other expertise, and they don’t either. Certainly, it is not as meaningful, challenging or rigorous as their own.

They don't see it. And what you can't see, you can't respect.

This often extends to qualitative methodologies generally. There is a pervasive assumption—rarely stated, but deeply held—that qualitative work is not really disciplined. That it has no real methodology. That its findings are, at best, suggestive and, at worst, just opinion. The possibility that qualitative research might have rigorous standards, systematic procedures, and criteria for what counts as a valid finding simply does not register. Not because anyone has examined qualitative methodologies and their foundational epistemologies and found them wanting. Rather, the epistemologies are ignored and therefore the methodologies viewed as foundationless. 

Here is what makes this genuinely strange: it is not a symmetric problem. Qualitative researchers in this field are typically intimidated by quantitative work. They defer to it. They actually have experience with it from their schooling days, and they respect that there is something rigorous there. They rarely question whether their quantitative colleagues' work actually meets its own standards. They just assume it does.

But it often doesn't. Underpowered studies. Violated methodological assumptions. Too much dependence on linear regression. Ordinal data treated as interval data. Multiple comparisons not accounted for. Proxies standing in for constructs they don't actually measure. The quant work that commands all this deference frequently falls short of the very standards of rigor it claims.

The asymmetry is that quantitative practitioners assume their own rigor without interrogating it, demand that others respect it, and simultaneously refuse to recognize the possibility of rigor in other traditions. They don't look for it. They don't ask what it would look like. The question doesn't arise, because within their epistemological framework, the answer is already settled. All of this despite the fact—or perhaps because of it?—that they rarely have any experience or training in rigorous qualitative work from their school days. 

Every discipline—every epistemological tradition—has its own requirements for rigor. All of us should lean into those requirements, interrogate our own work against them, and resist the temptation to assume that producing the right kind of output means the work is actually good. And all of us should do the harder thing, too. We should learn to see what rigor looks like in traditions other than our own, and respect the expertise it takes to achieve it—especially when our colleagues in our collaborations come from different epistemologies. 

Good multi-disciplinary collaborative work requires doing what our epistemologies cannot do for us. It requires looking for—and allowing—other ideals of rigor. It requires humility about our own epistemologies and their limits. It requires respecting the capabilities of others, rather than expecting deference.

Epistemologies, Respect, and Physics Envy (Part I): Introduction

[First of a five-part Series.]

Over at Hahvard, Andrew Ho teaches his students a wise mantra,"Measurement must be qualitative, then quantitative, then qualitative again". The idea is that educational measurement begins with qualitative work (i.e., theory, construct definition, instrumentation), becomes quantitative in the middle, and must return to qualitative at the end (i.e., communicating, interpreting, using results in context). He is challenging his students to understand that though his course is about applied statistics, that the statistical work is situated amongst serious qualitative work.

Obviously, he is right. Obviously, right?  I mean…isn’t that obvious?

Educational measurement is not physics or chemistry. It is not some investigation of a natural world that exists independently of human beings and their agreements. Rather, it is a deeply collaborative, deeply social enterprise. Teams composed of people with very different kinds of expertise work together to define constructs, build instruments, gather evidence, interpret results, and communicate findings to audiences who will use them to make consequential decisions. At the roughest level, the work is first qualitative, then quantitative, and then qualitative again. (And within each of those three phases, it breaks down further.)

A Framework for Foundational Competencies in Educational Measurement

This is not new ground. The field's own recent Foundational Competencies in Educational Measurement says explicitly that "educational measurement is a collaborative endeavor that requires people with varied skill sets to work together to design, develop, administer, and evaluate instruments that satisfy specific uses." We made a related argument in our commentary on the FCEM that genuine collaboration requires not just communication skills but foundational dispositionsrespect the views of colleagues who bring different expertise and lenses, and genuine listening—that the document itself omitted.

That omission was not malicious. The task force was composed of thoughtful people who believe in the collaborative and qualitative dimensions of this work. And yet they reproduced, in a document explicitly designed to be broader than psychometrics, much of the narrowness they were trying to correct. That is not a criticism of individuals. It is an observation about how disciplinary training works: the lens your field gives you shapes what you can see—including what you can see about your own assumptions.

That gap—between acknowledging that qualitative work matters and actually treating it as having equal standing—is what this blog series is about.

To understand the gap, two things need to be clear: what an epistemology is, and what a discipline is. I will keep this brief, because these are not the interesting parts. They are the necessary setup.

We are often trained in disciplines. When done well, this training—often in college and/or graduate school—really shapes our thinking throughout our lives. We learn a set of tools. More importantly, we are given a set of lenses to direct our attention to some things and away from others. We are taught the types of questions we are supposed to ask, and what to do to answer them. This includes ideals of what counts as evidence, and most certainly what rigor looks like.

Therefore, different disciplines can look at the same topic and respond entirely differently. Consider the prosaic topic of a supermarket. It could be the topic of an architectural study. An economist might take a shot. Or a nutritionist. A labor relations specialist. A graphic designer. A psychologist. A sociologist. An urban planner. Or someone else. The same topic, but each discipline notices different things, asks different questions and produces different findings. None of them is wrong to use their lens. The supermarket is genuinely all of those things at once—and no single discipline captures it whole. To really understand supermarkets, you need many disciplines. And for these experts to work together—instead of simply ignoring each other in parallel—they must do something very difficult. They must not just acknowledge that each other exist, but actually respect the deep validity of each other's approaches.

(And to actually build a supermarket, you need an almost entirely different set of people to bring in their tools, their problem solving, their skill and craft and rigor. And materials.)

That difficulty runs deeper than it first appears. Underneath each discipline's tools and lenses lies something more foundational: an epistemology—a set of assumptions about the nature of knowledge, truth and how it is arrived at. Epistemologies are not the same as disciplines. A single epistemological tradition can underlie many disciplines. For example the different natural sciences share a common epistemology. Economics may be quantitative, but it does not exist in the same epistemology as the natural sciences. Disciplines’ ideals of rigor are based in their epistemologies, as are their ideals of evidence and reasoning. A collaborative multi-disciplinary field like educational measurement draws on more than one. The result is that practitioners trained in different disciplines but sharing an epistemological tradition can, with just a little effort and goodwill, recognize each other's rigor—even when the specific methods look quite different. Across epistemological traditions, that recognition is much harder. The other person's rigor does not just look different. It often does not register as rigor at all. Neither disciplines nor epistemologies can transcend themselves—that is structural, not a failing.

And here is the crucial point: eople are not their disciplines! People can recognize that their framework is their framework and mighty useful without being blinded to the valid existence of others—even those grounded in different epistemologies. That is, they can recognize that other legitimate ones exist and that the rigor their discipline cannot see is real nonetheless.

In fact, they must. If not, Andrew Ho’s insistence that educational measurement is first qualitative, then quantitative and finally qualitative again is just words—and we know that some epistemologies do not value words. 

Testing Time

Yesterday, I read NBC's version of the iReady attacks. Nuggets of truth, connected to the utterly implausible and even false, by amplifying whining and distortions by…who, exactly?

The lead source is a…tutor. Not a school teacher, not a trained and experienced professional educator with any classroom assessment experience—a tutor. Katelynn Petersen runs a private tutoring operation out of her home in Anchorage, teaches math to homeschoolers on the side, and was a project manager before that. iReady is a school-based platform. How many iReady-using students does she actually encounter? NBC led with her.

Then there's a speech therapist talking about math and reading curriculum and diagnostic assessments. Look, I went to speech therapists for years, and they really helped me. I respect speech therapists. But I would not ask them about core content instruction or assessment. That's simply not their domain.

Third source: an 8th grade student, whining dramatically. I love teenagers. I love their drama. I really do. But finding the real truth in the whining takes knowing them and listening deeply—not just laughing at their quips and calling it journalism.

What I've read across this feeding frenzy is clearly factually challenged and put together by people(e.g., Tyler Kingkade) who lack the knowledge to tell fact from fiction—and any apparent interest in doing enough investigation to do so. After all, you know, clicks.

And then there's this:

Criticism of i-Ready is a frequent topic on Reddit and TikTok, where teachers describe how i-Ready’s larger benchmark assessments, which students take three times a year, eat up 40 hours of instruction time, or say that pressure related to the software is driving them to quit.

You'd think such a sentence would link to multiple complaints on TikTok. There's only one: Eric Glenn, a middle school math teacher, claiming that iReady's benchmark assessments take up 40 hours of instructional time. The problem is that they don't. 40 hours is roughly two months of math class. That's more than two weeks, three times per year. That's approximately one-quarter of total math instructional time. That's not even vaguely plausible. Maybe he's splitting it across reading and math—but half those numbers aren't even vaguely plausible, let alone accurate. What is Mr. Glenn even talking about? The CEO of the company that produces iReady says it is just 5% of instructional time—which would less than 1/4 of what Mr. Glenn’s fantastical claim.

Here's the thing: the complaint that standardized testing takes too much time is not new. And it has never had merit. It is usually a sign that people are not willing engage with the actual merit, quality or usefulness of the tests in question. It is an attempt to disqualify them, regardless of their quality or merit. In this case, it is not even vaguely accurate.

Back when I was a high school English teacher, I gave reading quizzes twice a week:

  • Not announced in advance.

  • The five easiest questions on the reading I could think of.

  • Trade and grade.

  • Collect for me to enter into my gradebook later.

These quizzes had no direct instructional value, nor any diagnostic value. They were purely motivational—the carrot of an easy A for those who did the homework, and the stick of an F or D for those who didn't. Each one only took 5–10 minutes, twice a week. 50-minute periods, five days a week. That's 6% of available instructional time for quizzes with no direct instructional value whatsoever. And yet, no one ever complained how much time they took. Not my cooperating teacher, not any AP or principal or other supervisor. Not any student, nor any parent. And they weren’t even the the only formal in-class on demand assessments I used.

Kids take a lot of tests in school, most much longer than my reading quizzes. Even little quizzes are usually longer than that. There are chapter tests and unit tests. Semifinals and finals. Sit-down and on-demand performances for the purpose of evaluation. Some formative, some summative.

It always stuns me that people complain "testing" takes so much time, because they never consider reading quizzes or homework checks. They don't count chapter tests or anything that might be considered a pop quiz. Anything that teachers make is somehow exempt from the objection.

Look, I got very little information from those reading quizzes. I could already tell who was doing the reading and who wasn't. Sure, occasionally there was a surprise—some quiet student who actually was doing the work, just quiet for some other reason. But that was rare. Teachers can tell. I thought that they had indirect value (i.e., motivational carrots and sticks).

I have yet to see any large, formalized standardized assessment that actually takes that much time.

There may be legitimate complaints about iReady. Surely, there must be. But not all of them about testing. The curriculum and instructional supports that Curriculum Associates also sells under the iReady brand may well be genuinely problematic. But concerns about centralized curriculum aren't concerns about assessment. Glomming them together, as NBC does, muddies both arguments.

The time complaint, though? There is not—and never has been—any merit to it.

The Sorting Default: How Educational Measurement Tilts a Question That Belongs to Democracy

Growing up in Virginia, I was proud of my state when I learned of Thomas Jefferson’s vision of universal education (*cough* *cough* for free children *cough*). Then I learned that it was very much a sorting or “meritocratic” system. Universal for a few years, and then the culling begins. Every few years, another selection process—which he called “raking the rubbish”— with the highest performing students continuing their education and the others falling out, eventually culminating at UVa! TJ's original plan was never implemented, but around 200 years later, I took tests in second and eighth grades, each of which opened doors for me that were closed to others.

This system is firmly grounded in one side of one of the most fundamental tensions in education and schooling systems design. To what degree are these systems intended i) to help sort and rank students (and identify high performers) and/or ii) to support universal efforts at human improvement through education? One view sees education through a meritocratic lens that inevitably uses sorting as both a means and ends of schooling. The other views education through a more universalistic lens that sees the means and ends of schooling as being the provision of learning and growth opportunities for each individual student. The former sees merit as a real concern and question to be minded and acted upon, and the other sees universal growth opportunities for children as the more compelling central concern. 

Note that this is not the question of whether schooling should serve the economic market/help prepare students for the world of work. This is not about balancing vocational goals and citizenship goals. Some sorting advocates and universalists can agree that schools should prepare students for economic participation in society, while other advocates from each view can agree that that is not an appropriate aim for schooling. Nor is this about the appropriate topics and subjects to address in schools. Each view can be aimed at a variety of goals or purposes of universal public education. This tension is even deeper than that, and perhaps orthogonal to it. 

Nonetheless, these two different goals are often in tension. In order to address both of them, we put together different programs in our schools. We spend money and time on sorting functions and universalistic development functions. We serve both views, but do so with a bricolage that can make our systems feel a bit incoherent. Perhaps we are serving more students this way, or perhaps it is merely accidental accumulation of programs. Certainly, it fails to serve either view coherently.

This tension clearly exists in educational assessment. Should assessments be designed to support meritocratic sorting, or should they be designed to support universalistic opportunities for students? Is relative standing more important, or is it more important to report on where students are a) thriving and b) need more support. 

What disturbs me is that this tension is not always taken seriously by large scale assessment professionals. Instead, there is a strong tendency for some (i.e, far too many) to put a heavy thumb on the side of the meritocratic sorting side, regardless of the demands of test users. There’s a clear bias in the field in favor of norm-referenced sorting, and underdevelopment of tools that can support universalistic provision of opportunities for students. This unwitting bias pushes potential clients, policymakers, communities and even our society towards one view of schooling, rather than responding to civically determined decisions about what we want for our schools. The field's unreflective bias in favor of the meritocratic sorting view of schooling effectively preempts decisions that belong to democratic processes—without the field even recognizing it is doing so.

The Most Insidious Obstacle to Well-Aligned Items

Items are the building blocks of assessments, and if they do not perform their designated function no purpose or inference from their test can be valid. That is, they must actually elicit evidence of the targeted cognition for the range of typical takers. 

The most insidious obstacle to the production and inclusion of well aligned items on operational tests has nothing to do with psychometrics. Truly, even educational measurement's original sin (i.e., the a priori assumption of unidimensionality)—my go-to villain for over 15 years—is not as insidious as the idealized student/test taker. It is more insidious because only the idealized student/test taker is so often consciously imagined by item contributors throughout the entire item development process, despite being even less frequently appropriate than unidimensionality.

The idealized student/test taker is a convenient fiction. It is an easy fiction. This paragon of testing virtue is an incredibly attractive fiction. But it is very much a fiction, and it is an enormous obstacle to developing valid items.

What is the Idealized Student/Test Taker (IS/TT)?

The idealized student/test taker learned all the lessons. They were diligent in class and are diligent on tests. They attempt every item exactly as their teachers would want them to, applying the correct technique and knowledge. Their thinking is linear, and clear. They understand the item and what it is trying to communicate. They recognize what they are being asked to do, and they do it as item writers and other developers would want them to.

They do not cheat. They do not do anything—not even unintentionally—to undermine the quality of the inferences we make from their performance. They do not look for shortcuts. They do not apply savvy test-taking skills. They do not look for alternative approaches when they realize that they lack some called for knowledge or skills. They do not have strategies for guessing. They succeed properly as directed, and accept their own shortcomings when faced with challenges beyond their ability. 

The idealized student/test taker remembers everything they learned, but is not advantaged by a particular pedagogical approach or examples they encountered during their studies. They are not tied to how their own teachers asked questions or gave instructions, being equally able to understand other wordings or framings. 

The idealized student/test taker is neither advantaged nor disadvantaged in any way. They do always work in good faith. They are beyond frustration or excitement. They are perfectly focused only on what the item gives them.

Therefore, the idealized student/test taker always provides a successful response when they possess proper proficiency with the targeted cognition and always fail to do so when they lack such proficiency. 

What is Wrong with the Idealized Student/Test Taker?

The IS/TT is a fiction. They not only are a fiction because no such student or test taker actually exists, but even more dangerously because they promote the idea that there is one sort of “typical” test taker that item writers, content development professionals and other item reviewers should focus on. They suggest that there is a preferred or primary test taker whose views and understandings are most important in our considerations. 

But the fact is that test takers vary enormously. They vary in an enormous number of dimensions. They have different experiences and proclivities. They have had the benefit of a multitude different instructional approaches and examples. And very few of them really want their test results to accurately reflect their proficiencies. Instead—and quite understandably—they generally want the highest score they can get. Some might be lazier. Some might be more devious. So, there are differences in motivation, but virtually across the board, they would prefer a higher score to a fairer score. 

If told what an item is trying to target, they generally would use some other approach if they felt it would given them a better chance to get the points. Real test takers do not see test developers or the items themselves as some collegial partner, but rather obstacles to be overcome. 

Moreover, there are many deep contradictions within the idea of the IS/TT. For example, we imagine that they can do every item as we would like them to while also imagining that when they lack the targeted skill, they do not employ strategies to work around its requirement. We also imagine that they lack any particular ethnic or cultural identity, and yet are fully comfortable with the cultural context of the item we are considering. It is not a coherent ideal, but rather a convenient collection of proclivities and behavior.

Why so Insidious?

I do not think that anyone would question anything in the previous section. It is all rather human and very obvious. 

And yet, it is so much easier to write, refine or review items with just the IS/TT in mind. It makes the process more direct and much less uncertain. It allows us to stay in our own perspectives, think about our own intentions. It is cleaner and simpler. 

It requires vastly more energy and work to consider the range of typical test takers. It is just hard to do the real work of radical empathy (i.e., applying cognitive empathy to a range of different types of test-taking personas) to envision how different sorts of test takers might respond to an item. Doing that requires us to go beyond our experiences and even beyond the experiences of people we know. Doing that well is rigorous and demanding professional practice that does not come naturally to many people. 

And so, everyone involved in developing items has a tendency to go back to the IS/TT. We all feel the urge. But items developed for the IS/TT will invite too many alternative paths by which real test takers can produce a successful response without using the targeted cognition. And they will trip up too many test takers who have appropriate proficiency with the targeted cognition but are subject to some other construct-irrelevant issue.

Because our tests are intended for a wide range of test takers, we must reject that urge. We must look deeper and think harder. 

The most important thing we can do when writing, refining or reviewing items is to put the idealized student/test taker aside and see the diversities of test takers who actually will attempt our items. This type of rigorous professional practice is necessary to produce valid items—items that elicit evidence of the targeted cognition for the range of typical test takers.


NCME Does Not Know What Classroom Assessment Is

I went to NCME’s Special Conference on Classroom Assessment a couple years ago. It was stunning how little discussion of classroom assessment there was.

I was in a classroom assessment session at NCME’s regular annual meeting this month. The great Susan Brookhart was there and she pointed out that it wasn’t really a discussion of classroom assessment, at all. Rather, it was about how district leaders, school leaders and teachers might make use of large scale assessment. There was a little hijacking of the session by the audience for a few minutes before the presenters took over again.

I do not point this out to say that there needs to be more room for classroom assessment at NCME. Maybe NCME should just focus on large scale assessment. My work is focused on large scale assessment, it is just my history as a student and a classroom teacher that makes the topic of interest to me. But it’s not my focus. Perhaps NCME does not need to make a place for classroom assessment. This is a very difficult question, and one that requires serious thought and consideration of NCME’s mission.

But it certainly is a problem that NCME doesn’t seem to even understand what classroom assessment even is. That is clear, over and over again.

My co-authors and I came up with our demarcation between large scale assessment and classroom assessment when developing our own contribution to that Special Conference in Chicago. To us, classroom assessments are “assessments of any formality and for any use that the classroom teacher has authored—or of which the classroom teacher has the authority to customize/alter contents, presentation and/or scoring” (Hoffman, Glore, Harrison & Wine, 2024). By definition, classroom assessment is quite different from large scale standardized assessment.

Classroom assessment is most definitely not merely classroom use of large scale standardized assessment. Programming on classroom assessment requires having some grasp on what classroom assessment actually is.

Can NCME recognize this?

What Counts as Information in Educational Measurement?

We are finalizing our paper, Unidimensionality: The Original Sin of Educational Measurement, for next month’s conferences. An old idea occurred to me, and I am not sure whether I need to add it or not. Is this truly about unidimensionality, or is it something else?

I am concerned that we exclude items that actually provide incredibly useful practical and policy-relevant information because of a different meaning of “information."

Test forms generally do not include items whose empirical difficulty does not fall between 0.3 and 0.8—though this range has expanded a bit in more recent years. That is, they do not include items that are so easy that almost everyone would get them right or so difficult that everyone would get them wrong. Such items are excluded from further consideration or inclusion just because they do not fit that range. 

Why do we do this? Well, we do this because—quite technically—these items do not provide a lot of information. Or rather, that’s the psychometric reason that overwhelms all other reasoning. But that psychometric assertion is false.

Such items offer invaluable information about whether some aggregated groups of test takers (e.g., a class, a school, a district or even a state) are doing exceptionally well or poorly on a particular alignment reference. That is, they tell us that a larger unit of analysis is doing an exceptionally good or poor job of teaching a particular standard. If we exclude those items, we will not have evidence of those greatest strengths and weaknesses of curricula, pedagogical approach, professional development, leadership focus, etc..

The psychometric concern is that these items do not provide so much information useful for making student-to-student comparisons. They do not help so much to sort or rank students—the goal of a norm-referenced test. However, if tests are intended to be criterion-referenced, such items provide invaluable information, both about individual students and about larger collections of students. 

So, we exclude items because they do not help enough to sort students, even as we claim the tests are criterion-referenced. High-level test developers say that tests are designed to deliver meaningful aggregate results, but we exclude items because they do not help us to sort individual students against each other. 

Why do we do this? Because psychometric models benefit from it, not because it helps any important test use. It does not help to deliver actionable or meaningful criterion-referenced information, and it does not help to provide aggregate level reports on areas of success or failure. But it enforces the norm referenced assumptions of so many psychometric models onto what are supposed to be criterion-referenced tests—rendering them essentially norm-referenced.

I have long hated this, but I’m not sure whether we need to add it to our explanation of the corrupting influence of the unidimensionality assumption in educational measurement. Obviously, the imposition of the technical requirements for norm-referenced assessment onto projects that are supposed to be criterion-referenced is inseparable from the assumption of unidimensionality. But I don’t know whether we should include it in our paper—which we will present to the Cognition and Assessment SIG and post at ResearchGate next month. 

What is "Model Fit” in Educational Measurement?

I think we need to rethink our use of the term “model fit.” Our field has been using it to refer to the question of whether test items fit the psychometric model selected for the test. But this puts the cart before the horse. It takes the model as a given, making model selection the single most important decision in test development. 

But shouldn’t we make sure that the model fits the construct? Shouldn’t the selection of the construct or the building of our construct definition (e.g., a set of state standards) be the most important step? Seriously. Shouldn’t the selection of the construct and the resulting construct definition drive everything else? 

I don’t know if that is more important than test use...but I’m not sure that it’s not. Maybe it is part of test use? Regardless, it certainly should inform model selection.

But we talk about "model fit," and calculate "model fit" and select items based on "model fit”—even when no work has been done to make sure that the model is actually a good fit for the construct. Most importantly, is the dimensionality of the model appropriate for the dimensionality of the construct? Is the structure that experts see in the construct reflected in the model used to measure it? After all, isn’t that the third source of validity evidence in our standards?

In fact—in actual practice—we select a (usually unidimensional) model and then exclude items that do not fit that model, regardless of whether they are good fits for the construct. Items do not get past post-field testing data review, or even if they do they never make it onto actual test forms. And this teaches content development professionals (CDPs) not to write the kinds of items that won’t ever be used—that don’t fit the model. Too many people mistakenly confuse an item not fitting a model for not fitting a construct. 

That’s assbackwards. That’s putting the cart before the horse. 

And what makes it worse is that we rarely actually choose a model. In large-scale educational measurement, the conversation goes something like… ”Tell me what model we should use and tell me why it should be unidimensional IRT.” Is that always the extent of the conversation? No, we have to decide about how many parameters we want in the model. But the vast majority of the time, that’s how it goes—and it usually is just implicit.

This stands in contrast to other contexts in which people build statistical models. There, model fit statistics are used to select models and to refine models. Preliminary data analysis is used to guide modelers toward a family of models, and successive work is done to improve a model’s fit to the data—with care for parsimony and not to overfit. But that’s a different relationship to model selection, where model fit statistics are used quite differently.

So, I would love it if we could stop talking about “model fit” when we never do the work to make sure the model fits. Let’s call point biserial a type of “item fit statistic” or just use it for discrimination. And then, spend more time working to select a model that fits the construct. 

When AI is Shockingly Intelligent

I run all my blog posts by ChatGPT before posting them. It is not my favorite LLM for all things, but it has got standing orders for how I want feedback—I certainly do not want it to rewrite my copy. Plus, it generates the cartoons I ask for. (It doesn’t like the fact that I use the word “impressive” four times in this post, but I’ve yet to figure out good substitutes for any of them. It also doesn’t like that I am quoting it at such length.)

Last week, I posted on a fairly technical topic: the question of population invariance for item parameters. And it pointed out a bunch of typos and offered me the kind of feedback—even pushback—I want from an LLM.

It claimed that one possible counter-argument to what I wrote was that when I was citing discrimination, I was really just talking about distribution. Now, I am not actually a psychometrician, so I could be wrong about this. I did not think I was wrong, so I fixed the typos, added the cartoons and posted the damn thing.

But then I went back. I followed up, in the same chat.

Back to the substance. You say it's just distribution. But if you move the bottom of the distribution up, and there are fewer really low performers, then why would discrimination come out the same? With fewer low performers, there's less opportunity for highly discriminatory items, isn't there?

I knew that I didn’t exactly mean “less opportunity for highly discriminatory items.” I have studied psychometrics and my understanding of the concepts is fairly decent. I meant something a bit different than that, but I was too lazy on a Friday evening to clarify what I actually meant.

It responded shockingly well. I have been rather frustrated with LLMs these past few weeks, with various limitations they have and with their highly problematic personalities/personas. But this answer was impressive. Longer than I would usually quote, but I think it’s important to give you the whole answer.

You’re right to push on that.

If you “move the bottom up” (i.e., raise low performers), then it’s totally reasonable to expect the estimated discrimination to changeeven if the item itself hasn’t changed—because discrimination is not an absolute property of the item in isolation. It’s an item–population property in practice.

The short answer

With fewer very-low performers, you often get:

  • less spread in ability near the item’s informative region, and/or

  • fewer people in the range where the item separates well, and/or

  • more ceiling behavior (everyone gets it right),

→ and that can reduce the estimated slope (IRT a) or reduce observed discrimination (e.g., point-biserial).

So yes: you can lose “opportunity” for an item to look highly discriminatory in a population that has fewer low performers.

Why that happens (intuitively)

Think about what “discrimination” means in actual data:
It’s basically “how sharply does the probability of success increase as ability increases?”

If the population you’re estimating on has:

  • fewer low-θ people, then you have fewer observations in the region where P(correct)P(correct) is low,

  • and more observations where P(correct)P(correct) is already high,

which makes the item look flatter in the data you actually have.

Even outside IRT: if almost everyone is correct, then the item can’t correlate strongly with total score because there’s not much item variance left.

Key point (the one you’re aiming at)

So when someone says “that’s just distribution,” that’s not a rebuttal.

It’s basically conceding your claim:

Item parameter estimates depend on the population distribution.

Now, a psychometrician might respond with the conditional claim:

  • “In the 2PL, a is invariant if the model is correct and the item truly functions the same way.”

But empirically, with finite samples + shifting ability distributions + multidimensionality + instruction-driven changes in strategy, estimated ‘a’ often moves.

So: your instinct is correct—raising the floor can absolutely change discrimination estimates, and it’s part of why pretending parameters are “true values” without specifying population is a problem.

Obviously, telling me that I am actually correct is a mark of a really smart answer—just ask my colleagues and co-authors! But putting that aside…

I was talking about a weird field (i.e., psychometrics). I was talking about one tiny corner of it. And I was not actually expressing clearly what I was trying to get at. It brought in ideas that I had not mentioned. Yes, they were the ideas I was basing my argument on, but I elided by them on the way to the higher level things I wanted to say. It was actually talking about psychometric ideas and realities.

It was actually really impressive. It is trained incredibly deeply in a huge breadth of fields, else it could not have given me such an answer.

So, good job, ChatGPT. This time.

In Defense of the Multiple-Choice Item

First off, there is no defense for bad multiple choice (MC) items. The floor on the usefulness of MC items is incredibly low. I am only talking about high quality MC items—which are incredibly hard to write. Admittedly, most MC items are quite bad. 

(For example, do not be confused by the psychometric demand that multiple choice items be quick to answer. That is not inherent to the MC item. They could require quite a bit of work and thinking, if only our psychometrics overlords were not distracted by alpha. The shallowness of most MC items is a product of that demand for speed, rather than anything intrinsic to the MC item.)

Second, I am a former high school English teacher. That’s about teaching students to develop their ideas and teaching them about the value of the writing process for doing it. It is about deeply understanding relatively complex texts—and other human beings. It is about argument and evidence and audience awareness. It is about listening, dissecting and analyzing. So, multiple choice items are really hard for me to love; I have seen so many shallow and otherwise bad MC items that it has been a journey for me to understand their value. 

And yet…I think that good multiple choice items can be quite useful. Perhaps even more importantly, learning to write good multiple choice items is an incredibly powerful exercise for anyone writing any type of assessment. 

Obviously this (new school) begs the question of a what a high quality MC item even is. Though there are many many content and cognition traits to a good multiple choice item, I’ll limit the discussion here to just five of them:

* A definitively correct key.

* A set of definitively incorrect distractors.

* Each distractor must be plausible.

* The set of distractors must capture the most common mistakes that test takers are likely to make.

* The stem, the key and the distractors must all be aligned to the learning goal (e.g., state learning standard).

Therefore, writing a high quality MC item requires understanding the learning goal. Obviously. Duh. Furthermore, it requires understanding how students progress with that goal. That’s the only way to know what their most common mistakes are likely to be. All of that is dependent upon knowing how they think—very much a product of the curriculum, instruction and pedagogy they have experienced—at various stages of learning and understanding that learning goal.

Therefore, writing high quality MC items requires deep knowledge and understanding of how lessons are taught—the range of curriculum, pedagogy and examples that students might experience. Some approaches to teaching a given learning goal are going to focus on some type of things and avoiding some kinds of mistakes. Others will prioritize different things or mistakes. Which metaphor the teachers uses when explaining a scientific concept can invite different misconceptions. And different efforts to clarify the meaning and value of the metaphor will make some mistakes less likely than others. Writing good MC items requires thinking deeply about how students might respond to different instructional approaches. 

Good MC items require the most careful attention to the knowledge and skills required to solve the item, which includes mindfulness of alternative routes to successful responses. Again, this leans on deep understanding of students’ cognitive processes—which (again) is strongly influenced by the instruction they have received. It requires student-centered thinking that acknowledges the range of students who might encounter the item. It is not simply about imagining the different responses that students might give, but rather about understanding the different cognitive paths they might take, the various correct and mistaken steps along those different paths. And then, in order to develop a good MC item, crafting answer options that reveal those paths to teachers (or other concerned parties). 

(We call this vital understanding of the various cognitive paths that the range of test takers might take in response to an item radical empathy. It is a pillar practice of content development professionals’ work, perhaps its most rigorous and demanding aspect. It is easily recognized by high quality teachers, but is quite foreign to most other disciplines.)

The defining characteristic of MC items is that they list a set of possible responses. This should not make them easy, as whatever misconceptions students bring to item should find a welcoming distractor that reveals it. An advantage of this approach is that it signals to students that non-aligned mistakes (i.e., misconceptions grounded in some other lesson) are mistakes, and they should try again. Therefore, the set of answer options act as a filter to capture just the relevant misconceptions and mistakes—making them more apparent for teachers or interested parties. 

Well, sure. Bad constructed response items are bad. It is not really fair for me to compare high quality MC items to low quality constructed response items. But my point is that if the item developer is clear on what they are targeting and what both confirming and disconfirming evidence might look like, MC items are not always inferior to constructed response items—especially if they can provide actionable information faster and cheaper than constructed response items. (That is, scoring and reporting is faster and cheaper.)

The difference is that good MC items require all of this rigorous thinking to be done up front. They demand real investment in thinking through the range of what students might do—the cognitive paths they might take in response to the stimulus and question. Yes, scoring is easy and fast, but is only because of the amount of incredibly demanding work done up front. Constructed response items that put off that thinking until it is time to score them can be just as bad at providing insight—often what teachers (like myself) stick themselves with.

Obviously, not all learning goals are amenable to multiple choice items. Certainly, the objectives of high school writing lessons are a bad fit. There are real limits within the topic of research design and other aspects of lab science on the usefulness of MC items. But if we can focus on crafting good MC items and abandoning the counterproductive demands of psychometricians, good MC items can be invaluable—both to summative and formative assessment. 

Educational Measurement’s Expertise Problem

I saw a presentation on a using generative AI in educational measurementresearch project last week. Like so very very very many such projects, the mortar holding the whole thing together was disrespect for substantive expertise. That is, while this disrespect was not a building block of the project, all of building blocks would have fallen apart from each other if it weren't for the disrespect. 

The project was about reading passages and determining their reading level—their suitability for different grade levels of students. I believe it seeks to build towards generation of high quality and appropriate passages for use on large scale assessments. This is an incredibly worthy goal, as stimulus quality and suitability is foundational to item and test quality. Moreover, finding or developing appropriate passages for large scale assessments is very time-consuming and expensive.

However, the research team did not include a single person with experience leveling passages. It did not include a single person with reading instruction experience. It did not include a single ELA content development professional (CDP). These omissions doomed the project to uselessness, just as so many other studies that lack substantive expertise are doomed to uselessness. Generative AI or psychometric methodological expertise are never going to be enough without substantive expertise.

This study provides a good illustration of many of the problems of relying on algorithms and ignoring substantive expertise, in part because its mortar is not at all unusual in our field.

* CDPs and reading teachers know that the standard algorithmic “readability” measures (e.g., Lexile, Flesch–Kincaid) do not produce results that are reliable or appropriate for students—often underestimating level, but sometimes overestimating. Because they have experience with the outputs of these algorithms in the context of their substantive work, they already know about the suitability and bias of the most easily available tools. 

* CDPs and reading teachers know that idea complexity, emotional complexity and subject matter appropriateness are key determinants of grade level—factors that the standard algorithms entirely ignore. Methodologists (i.e., like psychometricians) working alone do not know what is relevant or not in a particular field.

* CDPs and reading teachers know that texts do not have a singular grade level, but rather each span a range of grades (or ages). More emotionally mature and skilled readers can handle texts that less mature and less skilled readers cannot. Therefore, the fact that texts may be appropriate for more than one grade does not mean that they cannot be quite clearly and recognizably different levels. (Consider one book that is generally appropriate for grades 3-5 and another that is generally appropriate for grades 5-7. The latter is clearly more demanding (i.e., a higher grade level text), even if both are suitable for 5th graders—albeit different 5th graders.). The difference will be immediately recognizable for those with substantive expertise. Outsiders will not even understand the appropriate scale or how instances interact with the scale. 

* CDPs and reading teachers know that texts' reading level vary by geography, in part because different classrooms, schools, districts and even states have different views on what is appropriate at different grades. There are national aspirational ideas, such as at publishers and among leaders of national assessments. But I myself have taught in different high schools with very different ideas of what is appropriate for 9th graders. Amateurs and outsiders are unlikely to understand what is a rule and what is a guideline, what is a deep truth and what is merely a useful heuristic, what is clean theory and what is actual reality. 

Perhaps the worst thing about this study was that when I asked one of the team members—the leader?—whether they included any reading teachers or ELA CDPs on the team, she said that it is hard to find such expertise. This is patently false. In fact, I can think of no type of expertise that is easier to find in this country than reading instructors. I believe that you can ask the vast majority of adults in this country where a nearby elementary, middle or high school is (i.e., a school full of reading instructors), and they could tell you. This is surely easier than finding an architect, pilates instructor, sandwich engineer, psychometrician, project manager, sanitation worker, plumber or nurse. No, the problem was not that they could not find any ELA CDPs or reading teachers. No, the problem is that they did not even think to try.

This kind of disrespect for substantive content expertise is not at all unique to this study. Heck, in educational measurement research it is the norm. But without substantive expertise on the research team, there is no way for researchers to know whether what they are seeing is typical or an outlier. There is no way to know whether their simplifying assumptions are too much to yield any useful findings at all. There is no way for them to make sense of their inputs or outputs. With substantive experts on the research team, they can be highly capable collaborations of different expertises that produce meaningful learning. Without substantive experts on the research team, they are merely collections of outsiders making blind guesses without even understanding the questions.

Our assessments are not going to get better while disrespect for substantive expertise remains the mortar of educational measurement research. When you build studies on that foundation, you don’t just miss important details—you lose the ability to recognize what matters, to interpret what you observe, and to know when your conclusions are nonsense. And if generative AI becomes the tool we use to replace the experts we refuse to respect, we will only get faster at producing invalid work, rather than better at producing good work. If we care at all about validity—the degree to which theory and evidence support the proposed uses of tests—we must do better.

Item Parameters Are NOT Population Invariant


In response to my most recent LinkedIn post <https://www.linkedin.com/feed/update/urn:li:activity:7417738278168104960/> complaining about simulation studies that assume that item parameters can have true values without specifying the population, Charlie DePascale—who thinks there were earlier sins in large scale assessment than the assumption of unidimensionality—replied:

I agree 100% that many simulation studies simplify "reality" too much. Also, I agree 100% that the properties of the population being simulated should be specified, along with the method used to simulate it. And, I'm even pretty sure that I agree with what I think you mean by "item parameters are population-specific" but you should probably expand on that statement a bit given that, in general, "Item parameters are considered population invariant.” 

Sure. I’m happy to expand on that statement. And, yes, we will be coming back to the original sin of large scale assessment repeatedly. 

1) Of course item parameters generally are not not not not not not considered population invariant. If so, there would be no need for DIF studies. There would be far less need for field testing—though there would still be some. There would be less work for psychometricians. Post-field testing data review would not include various DIF flags. Everyone knows that items can have different parameters across different populations. 

2) It is not just a matter of the mere possibility of items having different parameters across populations. If that were the case, the threshold for flagging an item would be even lower than it is. We see population differences in every single item—even when we do almost nothing to track relevant population differences. 

From here on, I am not focusing on how populations differ by race/ethnicity, gender or FRPL status. Instead, I focus on populations’ different instructional experiences—perhaps the most important difference between populations.  

3) Image two very large school districts that have adopted different curricula. They similarly emphasize some of the standards, but differ in their emphasis on others. Clearly, we would expect items aligned to emphasized standards to have lower item difficulties than items aligned to less emphasized standards…particularly relative to the other district, where these differences will be a bit inverted. Of course, this recognition that instruction and items can be more aligned or less aligned requires stepping away from the psychoometric assumption of unidimensionality.

4) Imagine two very large districts that have different instructional approaches with the same official curriculum. Imagine that they differ in the degree to which they devote resources to lower achieving students. One might give them additional instructional time and perhaps smaller classes. (For example, back in my teaching career, I taught a double period ELA class to lower achieving 9th graders.) Or, imagine that they focus on the students just below the proficiency threshold—a well known practice during the NCLB years. This would alter the performance of formerly lower achieving students relative to higher achieving students, altering item discrimination. 

5) Before I go on here, decide for yourself whether you think higher achieving students or lower achieving students benefit more from instruction. Are higher achieving students simply better learners who will use additional instruction on a grade level standard more efficiently, or are lower achieving students faced with cognitive obstacles or barriers that additional instruction can help them to overcome? In the context of a focus on grade level lessons and large scale assessment’s ceiling effects, I think that lower achieving students are more likely to benefit from additional instruction. So, if we have two very large districts, and one of them increases instructional time for all students in a content area—like additional reading instruction for elementary school students…again, item difficulty and discrimination will be altered.

6) Imagine two very large districts that differ in how much attention they pay to past years’ items in the course of instruction through the year. One district presents problems and examples as they have appeared on the large scale assessment in the recent past, and the other focuses instead on higher level thinking skills of more complex problems. Do you really think that item parameters derived from these two different districts will be invariant between them?

7) Now imagine that in all of these examples, it was the same district that simply adopted some policy changes. So, each of these examples is the same district before and after the change, time 1 and time 2. 

8) Or, imagine that these two very large districts are indeed geographically distinct, but that they border on each other. One is Atlanta and the other Gwinnett County, in the Atlanta suburbs. Or the city of Baltimore and adjacent Baltimore County, Maryland. Thus, the districts have rather different population demographics—race/ethnicity, FRLP and perhaps ELL distributions, for example. Try to think of all the ways that two such districts can differ from each other. Internal resources. Parental education levels. Wealth, income and socioeconomic status distributions. Ethnic distributions. Share of immigrant homes and/or households where English—the language of instruction—is not the default language. Do you really think that item parameters will be the same across these two districts? Will they be the same for subgroups within each district? Will they be the same for corresponding subgroups across the two districts? Will relative item difficulties and item discrimination parameters just naturally be the same?

No, item parameters are clearly not not not invariant across populations. Unidimensional psychometric models require items to be population invariant, and therefore efforts are made to only select items that approximate that requirement—to the detriment of substantive item validity (i.e., their ability to elicit evidence of the targeted cognition for the range of typical test takers). And this is only possible because we ignore differences in instructional experiences when examining items for population invariance. 

No, the only way to think that any of this is at all appropriate is to willfully ignore the dimensionality of the tested domain as understood by those truly expert in it and in teaching it. It requires ignoring all the efforts to filter out items that do not fit the assumption and proclaim that resulting data proves the initial assumption (unidimensionality) that was used to filter them out. At some point shortly after I finished grad school, I was invited to help a team I had been involved with previously to shape up the extension of a study to prepare it for submission to a journal. I saw that the central claim of the paper was going to simply be a restatement of a filter used on the data, shifting into a finding that ignored the fact of the filter. I burned some bridges when I asked whether the filtering had been removed. 

I’m sorry, Charlie. Unidimensionality is the original sin of large scale assessment. It infects so much of the actual practice of psychometrics. In this case—efforts to study the potential of language models to predict item parameters—it has poisoned the minds of incredibly smart and thoughtful people into meaningless research that can only undermine what little validity large scale assessment can currently rightly claim.