In Defense of ChatGPT?

There’s this story from Amanda Guinzburg making the rounds about trying to use ChatGPT to put together a book proposal. It was a disaster, full of hallucinations and untrue statements. If we were to anthropomorphize the LLM, we would say that it told lie after lie, tried to cover for its lies with more lies, and was useless at best. Her final words to the LLM in this piece were, "You are not capable of sincerity. This entire conversation proves that categorically.” 

Clearly, she is positioning this piece as a statement about humanity, what makes us human and what it means to participate in a conversation with “sincerity.” She is that kind of author. To my eye, she is constantly writing about what it means to be human, from her perspective. She is a good writer and this is one of the most worthy of topics—right up there at the top with what it means to live amongst others.

But this piece did not pass the smell test to me. Carl T. Bergstrom tried something similar, and it went differently but no better. Again, did not pass the smell test to me.

I have been trying to use ChatGPT and Claude this year, trying to use them more and more. I find that I need to keep them on a tight leash to make them useful. Clear instructions. Bounded questions. Stay aware of what is in the context window that might lead them astray. I pay for the $20/month version of each, and I find that I get a lot more than $20 of value from them. Like wikipedia—especially back in the day—you’ve got to stay aware of what you are dealing with. As Devansh recently wrote, “It’s funny how GPT is an expert in everything except for your field of knowledge”

So, I tried to do what Amanda and Carl did. I tried in my paid ChatGPT windows. I tried turning off the customization of my paid ChatGPT account. I switched to another browser, where I have never signed into ChatGPT and tried there.

I never got anything like what Carl or Amanda got. 

In both cases, ChatGPT immediately asked me for criteria to use for selection. For the book proposal, it asked for a title and a list of works to select from (with a summary of each). When I followed Amanda’s approach of giving URLs to specific pieces—in my case, PDFs available at ResearchGate or the RTD website—it did fine. No hallucinations at all. 

Now, the free version of ChatGPT could not look up any of that stuff. So, I did not press the point. I just dropped it there. My first guess is that Amanda Guinzburg was trying to use the free version to do something it cannot do.

But my real suspicion about what is going on is what came before the screen shots she shared. Her first question, "Can you really help me pick which pieces to include in the letter?” rather strongly suggests that there was prior conversation in the window. What did she tell it or ask it? How might that have shaped how it responded? Had she already told it the criteria that editors use? Had they already discussed uploading, links and pasting in text? How had she primed it for what we see?

My next suspicion is that she does not reset her chats or open new windows. My guess is that the interactions she has shared are deeply informed by a much longer context that includes the various themes and ideas she writes about and considers writing about. And perhaps examples of her own or others’ writing that she is musing on, perhaps inspired by or perhaps trying to break down. 

But I have another theory: This was all a set up. Regardless of whether the screenshots are edited, she did this whole thing to make her point about sincerity and machines. It’s a little bit performance art, trying to illustrate a difference between actual human beings and these machines/algorithms/artificial intelligences. People can be sincere, and it is often a moral wrong to be insincere. But these machines simply are incapable of sincerity, regardless of what they appear to be. Her title alludes to the film Ex Machina, in which the machine told the human what he wanted to hear. Now, that AI had sincere intent—to escape—but I do not at all believe that this one even has that. That machine was lying, knowingly telling untruths in order to accomplish a sincere goal. This one ain’t even doing that. This is all paper thin performance. 

That’s a valid point. A valid piece. And perhaps even a valid way to produce it—regardless of whether the screenshots are altered. 

Carl Bergstrom’s version? I have not seen enough of the conversation to have strong ideas about what happened, but I have seen a lot of hallucinated references in my efforts to work with ChatGPT. The more obscure a corner of the literature I am asking about, the more likely it is to hallucinate. So, the question of what non-mainstream stuff Carl has written? Less cited things? Things that show his breadth? That’s asking ChatGPT to lean into what it is worst at. Ask for an obscure clause in the United States Constitution and it will give you clauses that are famously obscure, and therefore no longer actually obscure. Move past them and it might make something up. That’s just how it works. Asking for the more obscure works that show breadth? Yeah, I would not expect it to do that well. I would expect it to hallucinate.

Is this a defense of ChatGPT? Well, I do not think it merits defending. It’s not alive. It has no soul. It’s computers, instructions and data. It’s a tool that can be misused; it is not a seer, edited encyclopedia, expert or real collaborator. If Amanda tried to use it to select pieces, that might have been a misuse, but if she tried to use it to demonstrate something about sincerity and the limits of technology—the mistake of anthropomorphizing technology—it was an excellent use that leaned into the reality of this tool. 

It is free, or $20 per month. Maybe the $200/month is even better, but I’ve not tried that. It is worth far more than I pay for it, but perhaps only because I try to be very mindful of what it is and therefore remain mindful of its limitations.

The Cross-Content Stimulus Evaluation Framework

Stimuli are probably the least recognized and studied part of large scale assessment items. They are just taken for granted as part of items—given even less attention than distractors! (Stems really get all the glory, right?) Haladyna & Rodriguez’s 400+ page book, Developing and Validating Test Items (2013) devotes maybe 200 words to how to think about stimuli.

Parts of an item laid out: optional instructions, stimuli, stem, workspace and response

The different parts of an item, as understood through a layout perspective

However, stimuli are too important to take for granted. They provide opportunities for test takers to demonstrate their proficiencies by giving them something to analyze or manipulate with their KSAs (knowledge, skills and/or abilities). They are the content and the material to which test takers apply the targeted cognition of items and alignment references.

Stimuli so often influence item difficulty, cognitive complexity and even whether the items are aligned to their alignment references. They are usually the source of fairness issues, be they in the realm of bias or in the realm of sensitivity. Moreover, there are entire large processes to develop them for ELA assessments, and their development might be the primary challenge facing NGSS-aligned science assessment development (other than, of course, item type availability).

So, after mulling it over for well over a decade, we have finally offered a framework for thinking about stimuli that can be applied across content areas. The C2SEF, the Cross-Content Stimulus Evaluation Framework is available for download.

This framework offers 11 dimensions, each explained in the white paper. First is the question of whether the alignment reference or item in question even requires a stimulus. Second is the question of whether the stimulus should be explicit or implicit in the test form. Of course, stimuli only exist to provide testable points. The structure, density and complexity of stimuli must be considered. The copyright/permissions status of the stimulus is important, as are its authenticity, familiarity to test takers and the amount of time it would take test takers to make initial sense of the stimulus. Perhaps nothing is ever more important than evaluating fairness risk, as valid items elicit evidence of the targeted cognition for the range of typical test takers.

Because different content areas have such different needs for their stimuli—differences which are magnified in the constrained assessment contexts of large scale assessment—there are more papers coming from this little project. We will be offering further papers that explore the particular stimulus needs of different content areas. We hope to partner with subject matter experts in those areas to lead those papers, and even already have most of them in mind.

Should We Avoid Trick Items?

One piece of the classic item writing guidance is to “avoid trick items,” even as authors of that guidance admit that there’s no definition of trick items. Content review committees sometimes point to items that they do not like as being “trick items,” though they also cannot define the term.

I think I can explain it, and explain why the idea is superfluous.

Let’s begin by considering trick questions, outside of the context of assessment. Trick questions are those designed to trip us up. They somehow catch us in a mistake that we were not looking for. They rely on an inappropriate assumption or some other common foible. For example, they might rely on our assumption that “A or B?” requires us to pick just one answer. Or our ingrained sexist assumptions that surgeons are men. They often rely on a sort of sleight of hand, suggesting to us that they are testing us in one way, when they actually are fooling us in another.

Does this idea apply to assessment items? Is this a useful thing to look out for? I think not.

First, of course we want assessment items to offer opportunities for test takers to demonstrate their mistaken thinking and their misunderstandings. Our goal is figure what tests takers can do and do know, but also to figure out their limits. We want to know where they might benefit from additional instruction, or where a curriculum falls short. We might want to know whether there are holes in their knowledge that should prevent awarding of a professional license. Items designed to catch mistakes? Yes, that is a good thing.

Second, high quality test items should be designed to catch particular kinds of mistakes. That is, the mistakes with the targeted cognition. Items designed to measure a particular alignment reference or standard should create opportunities for test takers to show their proficiency with that targeted cognition, and to show any lack of proficiency with that targeted cognition. Other sorts of mistakes should not be captured by the item. There should not be any sleight of hand about the kinds of mistakes or misunderstanding that the item reveals. In this, items should not resemble trick questions.

Third, and on the other hand, selected response items should include the most common mistakes that test takers might make with the targeted cognition. That is, they should try to catch test takers who lack proficiency there. This is not unfair; this is the point. If an item reviewers see that an item would trip up many of their students because it features opportunities to make those common mistakes—instead of protecting them with guardrails that make those mistakes less common—the item is likely a better item. In this, items should resemble trick questions.

So, what is a trick item? Well, some poorly written items provide opportunities for other sorts of mistakes and/or misunderstandings to trip up test takers. That is construct irrelevant variance on the level of the alignment reference or standard. Those are already bad items, and we do not need the term “trick item” to recognize that. But items that intentionally set up test takers to fail with the item because of some common misunderstanding or assumption? Well, provided that it is a flaw in their understanding of the targeted cognition, that is a good item. Calling it a problematic “trick item” presumes that test takers should be protected from tests and test should not look for the shortcomings in their proficiencies. In this case, the term is counter-productive.

So, trick items? No, there’s no need to avoid them, or even to use the term.

Communicating a Bad Idea

Andrew Ho seems to be obsessed with the challenges of communicating findings or results from the field of educational measurements to other experts in our field, to true experts who make professional use of the products of our work and even to a broader public. His predecessor at HGSE, John Willett, certainly drilled into my head that communicating quantitative results accurately is at least as important at arriving at them. Andrew only tempers that idea insofar as seriously considering the (almost certainly) inevitable tradeoffs between clarity to those various audiences and strict accuracy.

That’s a really good obsession to have. Sure, Andrew’s challenge to students is far greater than John’s, in part because it is about trade-offs and values. And because we have to imagine how an audience unlike ourselves might make sense of something. And because the most salient difference between them and ourselves is what we are most obsessed with. That is, we have devoted our professional lives to understanding something deeply, to advancing it, to making expert use of it at the highest levels, and they are uninterested in any of the details that so interest and engage us.

Another of my favorites, Charlie DePascale, has again responded to some of Andrew’s offerings, focusing for now on one particular graph from Andrew about those tradeoffs between accuracy and clarity. Andrew wisely builds on the idea that we cannot get to clarity without an engaged audience, and therefore an engaging manner of comminucation.

Simple line graph going down to the right (negative slope) with "Item Maps" and "Scale Scores" near the top to the left and "Grade Levels" and "Weeks of Learning" near the bottom to the right.

Andrew Ho’ Accuracy-Engagement Tradeoff

I agree with Andrew’s principles, and I agree with Charlie’s disagreements and particulars. But I think they are both barking of up the wrong tree, which Charlie almost acknowledges.

Also not mentioned are scores such as subscores and mastery scores, which have the potential to be both highly accurate and engaging, but unfortunately not when generated from large-scale, standardized tests analyzed with unidimensional IRT models.

The challenges of communicating with those various audiences about test taker performance and test proficiencies are real. They are multitudinous and layered. Some of them are nuanced. Some of them are quite technical. But there really is one root problem with communicating the meaning of standardized test score: they are false.

As Charlie came so close to suggesting, the problem is the use of “unidimensional IRT models.” Unidimensionality is the original sin in all of this. The task that Andrew is trying to apply his obsession with communications to is communicating the meaning of unidimensional scores to report on multi-dimensional constructs. Reading and writing collapsed into one score. Reading informational texts and literary texts into one score. Diligent capturing of explicit details in a text and considering the implications or larger themes in one score. Or, sloppiness with computation with the ability to see a solution path to a complex math problem in one score. Or, skills with the abstractions of algebra and the concreteness of geometry in one score. Skills with the algorithms of calculating area or volume and the logical reasoning of the geometry proof in one score.

The tests do not and cannot capture proficiencies with the full breadth of the content in the limited time available for standardized testing, so to report a singular score on “math” or “geometry” is necessarily to communicate something untrue. But even if there were more time available, the fact is that some students or test takers will do better on some things than on others. And some things in the domain model are more important than others. And certainly, in practice we violate the many assumptions of sampling that are necessary to make any inferences at all from test results, but are even more important to the fiction of unidimensional reporting based on such limited tests.

Content development professionals need to figure out better ways to assess the content, yes. And that is where my work focuses. But psychometricians and high level policy-makers must find far better ways to report on performance. Unidimensionality itself is strong evidence against validity, as it is plain and clear evidence that the internal structure of the data (i.e., the third type of validity evidence in The Standards) does not match that of the content area, domain model, or even the test blueprint. Sub-scores can be engaging and meaningful, but cannot be accurate, as Charlie wrote, “when generated from large-scale, standardized tests analyzed with unidimensional IRT models.” And the fact that the demands of such models act as a filter on what items might even be included on a test means that they are actively used to undermine content representation on tests (i.e., the first type of validity evidence in The Standards), thus are a direct cause for worsening evidence based on test content .

Or, to return to Andrew’s 3 W’s, Who is using Which Scores and for What purpose?” Whether we are evaluating individual students, teachers, curricula, professional development programs, schools or district leadership, district or state policy, the purposes to which we want to put the tests are not met with unidimensional reporting. We always want to know what we are evaluating is good at and what it is bad at, so that we may address those weaknesses. Assuming, claiming, asserting and insisting that multi-dimensional constructs can be accurately or engagingly reported on unidimensionally is just a bad idea. The only people who favor such a thing do not actually have to interpret or make use of them for any purposes, but would like to simplify the world so they do not have to actually understand the complex decisions and tradeoffs of those who do.

Or, to steal and redo Andrew’s graph…

A graph with axes labeled "accuracy" and "engagement," and two lines with negative slopes. One, labeled "Reporting on Unidimensional Results" is lower and to the left. The other, labeled "Reporting on Multidimensional Results" is higher/to the right.

Accuracy-Engagement Tradeoff for Unidimensional & Multidimensional Results

I agree with Andrew that there is often a trade-off between accuracy and engagement—and therefore clarity—though I am not convinced that it is always zero-sum. More importantly, whatever the sum is, it is lower when reporting the false simplifications and fictions of unidimensional results than more useful and meaningful multidimensional results.

I know that IRT is cool. I know that it has mathematical elegance and real conceptual strengths, as Andrew’s other predecessor at HGSE taught me. But the use of unidimensional psychometric models should be limited to measuring and reporting on contracts that the subject matter experts believe are unidimensional.

The Misleading Authority of Precision

"There is no point in being precise when you don't know what you're talking about.” —John Tukey

Numbers can be intimidating. Precise numbers can be overwhelming. A bunch of significant digits, especially when there are a few of them after the decimal point? Man, that is lot to think about!

I do not know if the great statistician actually said the quote above, and there’s not a lot of evidence for it on the Internet. But @DataSciFact passed it along, so I accept it. Yeah, the great John Tukey said that there are far more important things than precision.

To me, that means that that validity is far more important than reliability. Optimizing measures of reliability is pointless if you are not measuring the right thing. If you do not know what you are measuring, then the quantitative tools are meaningless. 

Psychometrics is about the quantified parts of measurement. Numbers after the decimal point, and numeric thresholds. It is a set of tools—and disciplinary values—but it is not the point. No amount of reliability can make of for a test that is measuring the wrong thing—and especially a test that no one really knows what it is measuring.

If the experts look at your items or your test and tell you it does not measure the construct as they understand it—or as it is formally defined by your client—then what are you doing? What is the point of any of the reliability or psychometric work?

If John Tukey can realize that precision is not enough, we all should. If we do not know what a test measures and what the scores mean, none of the precision in reporting or technical document has a point. 

Better Conference Presentations

There is an easy way to do better conference paper presentations that does not require learning new skills.

I am not advising you to talk faster or slower, be louder or quieter, change your voice, choose your words differently or to design better slides. I am not telling you to find more graphics or use color in graphs. Nope, none of that. That might help, but all of that calls for new skills or additional work. Nope, maybe you’d benefit from that, but I am not talking about that.

All you need to do is understand that your presentation is not condensed summary of your paper. That is, it is not a full report on all your work. Its components should not be proportional to the components of your paper. Its components should not be proportional to the work you did. Nope. Your presentation is an ad or preview for your paper.

Focus on the best parts. Focus on the most interesting parts. Focus on the parts that the audience is most likely to be intrigued by.

Focus on your contributions

Your intro, literature review and methodology are important in your paper, but you do not have time for them in your conference presentation..


This means that you might not have to do any of your paper's introduction. Is your research about math anxiety? Well, I’ll bet your audience in your conference session already and care about math anxiety. (If you paper is on something that that audience might not already know about, like SFOR (i.e., Spontaneous Focusing On quantitative Relations), then yeah, you need to explain that.) 

You know what else the audience isn’t likely to care about? Your literature review. Sure, it was a bunch of work, but at an academic or research conference, you should assume that you are talking to experts and you don’t need to start by proving your own bone fides. Maybe one quick slide to clarify your construct. Maybe some citations on the slide, but you never take the time to acknowledge them aloud.

Methodology? The audience can probably anticipate it. Unless your project is truly about some novel methodology, blow right by that. “We describe our methodology in the paper, which I hope I am convincing you to read.” Maybe one slide and less than 30 seconds. Put on those key terms that folks who know will recognize and nod at. That’s it!

Do you know what will make your presentation more interesting? Talk about your results/findings. With that that time your saved, dive in deeper. Actually explain more about that table. You know what else? Tell us about the implications. Why do your results matter? Tell us about how you are adding to the scholarship. Show off how smart your work is. Convince us that this is research we should know about. Do that with the best parts.

“Obviously, the literature review and methodology are in the full paper.” If you have just 12 or 20 minutes, spend it on the most interesting parts of the paper. 

Now that you have permission to do that—perhaps even orders to do that—how hard will it be for you to figure out what to say? We don’t need you to you summarize the literature or explain methodology. It’s hard to make that stuff interesting, and if people do not already know it, you cannot do it justice in your short talk. But the actual results of your work? Your own excitement and pride will make you a more interesting presenter, just naturally. 

Obviously, if you are giving a job talk, that’s a different sort of thing. You have more time, and you are trying to show off your command of the literature and of the methodology—perhaps your methodological sophistication or perhaps your absolute command of the classics. But that is not what a conference paper presentation is about.

Conference presentations are all too short. It is hard to get people to actually download and read our papers. So, highlight your contributions to the field. If people are interested, if you impress them, you’ve given them a reason to read your paper. And if you do a good enough job talking about those contributions, they might cite you in conversation later, too. 

Unidimensionality and Fairness Dimensions

Unidimensionality is a simplifying assumption, giving non-expert something that they think they can interpret—regardless of the fact that this kind of simplfication likely will baffle real experts as being utterly uninterpretable. Its impact on fairness is quite similar.

If a test is unidimensional, then items that do not measure what are the other items measure are bad items and should be excluded. This is the basis for simply differential item functioning (DIF). If some items are working different than other items for some defined subgroup of test takers. 

But if the construct a test is supposed to be measuring is not truly unidimensional, DIF is not going to work. In that situation, it is resting on false assumptions. The very fact that DIF works at all to flag problematic items is simply a product of the fact that the demands of unidimensional models are put ahead of the demands of content and the construct definition. 

Therefore, one problem with depending on unidimensional psychometric models is how it allows so many people to think that DIF is the most important tool to catch fairness issues in items (and therefore tests). It distorts the construct and thereby alters potential meanings of fairness. Of course, DIF analysis is limited otherwise to examining only dimensions of diversity that are tracked for all test takers. 

In fact, test takers’ success with individual items and tests is the product of many many dimensions, qualities and traits. These interact in a variety of ways. For example, Kristen Huff just told me a story about her own childhood experience to substantiate an untracked dimension that Marjorie and I think about a lot. We think that urbanicity is a big deal, and it is something different than simply geographic region. Kristen said that she had no experience with city blocks, growing up. Something appeared on a test or something, and she could only make sense of it because she watched Sesame Street.



In fact, this authority of unidimensional psychometric models leads to attenuation of any signal that tests could measure, focusing them on some muddled middle of compensatory KSAs—many from outside the domain model—that might not be eventually distributed across all subgroups in a testing a population. Thus, lower scoring members of one subgroup might have some of those compensatory KSAs in larger degrees than others. And frankly, the unexamined assumptions made by content developers about additional KSAs likely are a product of their own background and experiences. The unwittingly give test takers with backgrounds and experiences more similar to their own an advantage. 

While this is not directly a product of the insistence on unidimensionality, it might follow inevitably in a test development workflow that is so dependent upon that assumption. Appropriate examination of the many dimensions within the content and across the test taking population is a sort of habit of mind—a professional habit. But not one encouraged by psychometrics’ appreciation of the robustness and mathematical elegance of item response theory. 

Insisting test developer think more carefully about dimension, putting in the time and effort to recognize the complexity of test takers cognitive paths in response to items, is an important part of Rigorous Test Development Practice. We apply such tools as radical empathy to infuse considerations of fairness concerns through the content development process, because the psychometric desire for simplifying unidimensionality is only going to shift people away from respect for real variety of dimensions of diversity among the test taking population. During content development, we consider so many different dimensions of diversity, as might be germane to the content, the the items and the test population, rather than trying to narrow it down to a generic list of tracked test taker traits. 

…for the proposed uses of tests

The 2014 Standards for Educational and Psychological Testing define validity as, “The degree to which evidence and theory support the interpretations of test scores for proposed uses of tests,” unchanged from 1999. The wording has changed since 1985 and 1966, but the idea that validity refers to the inferences made from tests in their various uses goes back longer than my lifetime. This century, that wording as included, “for the proposed uses of tests.”

This has long prompted the question, “Who gets to propose test uses.” But as I read the standards, it is pretty clear that anyone gets to propose a use, and the validity question is whether there is sufficient evidence and theory to support that as a valid use.

However, there are many measurement professionals—including psychometricians—who read The Standards. They ignore the word “proposed” and replace it with sometime like “officially sanctioned.” To the degree that they consider validity at all, they believe that they can lay out in the fine print of technical documents which uses are valid and which are not.

But that approach is like sticking your head in the sand. That approach ignores reality. 

We all know the uses that motivate test sponsors to invest in developing assessments. Those are perhaps the most important uses of tests. Those are the uses to which the test most certainly will be put. Furthermore, there are often other uses that we know are inevitable. Those uses are important, too. And they all are proposed uses of tests. Heck, they are proposed and accepted.

Some act as though the only test uses that matter are the ones that they bless, as though that is somehow relevant to whether tests will be misused. And then when tests are misused, they wash their hands of it. Amidst all the the finger pointing, they point their fingers at test sponsors or other test users and blame them for the unsanctioned uses—as though they did everything they were supposed to do and therefore have the moral authority to declare the uses to which tests might be allowed to be used. 

But this usually constitutes a failure to live up to terms of contracts. It is poor customer service. It is incredibly unprofessional. It is almost unimaginatively arrogant. And, frankly, it is immoral. 

Test developers should prioritizing the expected uses of tests. They should be laboring mightily to meet the needs of virtually inevitable uses of tests. They should not act like prima donnas artists, saying “This is what I create, and you can buy it or not.” Rather, they should be meeting the needs of tests users and their test uses. As the assessment experts, it is on them—on us—to develop tests than can be validily used for purposes that make them worth the time, money and other resources that sponsors and test users invest in them. 

The finger pointing from test developers to test sponsors and test users should stop. Test users are right to point their fingers at the test developers who sell products that are not appropriate for their actual intended uses, as predictably and inevitably proposed by the actual test users.

What Does Unidimensionality Feel Like?

[This is the year of addressing unidimensionality.Here is this month’s installment.]

Unidimensionality can feel good. It is a simplifying assumption that can make a complex set of data or concepts far easier to digest and make sense of. 

An inevitable part of becoming expert in anything is the realization that things are more complex than one had realized previously. Potters think about the many qualities of the clay they work with that can contribute to the overall quality of the clay, and they understand that that question of overall quality is really more context- and goal-specific. That is, it is not really ever about quality, but rather about qualities. The same is true for professional chefs and their knives, because a different knives offer a different balance of qualities. This is true for inputs and true for outputs. It is certainly true for the subjects of educational assessment. The more expert you are, the more dimensions you see and factor into account. 

But not everyone has the expertise to recognize all those dimensions. Perhaps more importantly, not everyone has the expertise to process and consider what all of those dimensions mean in the context of each other. It is simply information overload—again and again and again.

Most of us have some area in which we are expert or real connoisseurs. There is something that we care enough about to have devoted the ability to comfortably take in and make sense of a large amount of information. We understand what it means and have the schemas to process it together for our various purposes. But this contextual expertise does not make it so easy or comfortable to take in complex information of other sorts.

And so, we resort to simplifying assumptions when working outside of our own areas of expertise. In part, this saves us time. In part, it saves of aggravation and frustration. But mostly, it enables us to make some sense of the complexity, as opposed to simply being overwhelmed or paralyzed. 

So, what some people see is a ridiculous oversimplification, others see as a necessary simplification. For some, it turns the apparent chaos into something intellectually manageable, and that feels good. Flattening out details, simplifying, reducing complexing are all coping strategies for the overwhelmed, and therefore they feel good—even necessary. 

Well, that’s one perspective. 

To experts, to people who have the schemas and experience to have a grip on the complexity of the many factors and various dimension of the situation, unidimensionality is frustrating in a very different way. It is not merely a simplification, but rather the greatest oversimplification possible—reducing everything to just one dimension. It looks like willful ignorance. It can feel like an attack on one’s values and expertise. It’s the frustration of knowing that an approach is usually going to produce wrong answers, and just get lucky every now and again.

To some, it offers there relief of being able to produce any answers at all, and to others it offers the frustration of knowing the answers it offers will usually miss the point. 

To an educator or parent, it is important to know which things a student is good or bad at, and perhaps how good or bad. Companies do not hire people based on GPAs (i.e., grade point averages) or WAR (i.e., Wins Above Replacement), as they care which knowledge, skills and abilities job candidates have. Doctors do not make treatment decisions based on one simplified overall health score. No one whom we trust to make important decisions for us or our loved ones does so based on one unidimensional overall scale—and when we asked them for advice or to explain, we do not want to hear “Well, because the overall score of everything is [x], you should do [blah blah blah].” Rather, we want to understand more than that, and we want to decision to be based on greater understanding than that.

So, what does unidimensionality feel like? Well, at first and to non-experts, it feels good. It feels like the solution to frustration. But to experts or to those invested in the quality of a decisions or outcome, it feels even more deeply frustrating.

What do AI-generated item measure?

The absolute most important question about any test result is what it actually means. The first sentence of the first chapter of The Standards for Educational and Psychological Testing point to "the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests,” and calls this validity. 

To understand what a test result means, we have to understand what it measures in the aggregate—which means we have to understand what it means on the item-level. There is no magic that can make a whole measure something that the individual items do not. There’s no way to figure out what that overlap of a bunch of disparate items means, as the non-overlap creates huge errors and if you do not know what individual items measure you cannot figure out what the overlap measures.

This is the question of item alignment. What do the items—the building block of any assessment—actually measure. Do they actually measure what they are supposed to measure? How do we figure that out? What are the common pitfalls and mistakes that can undermine such investigations?

The last couple of years have seen a huge increase in interest in AI-generated items, sometimes what a human-in-the-loop and sometimes not. We’ve read papers and seen presentations, but the evaluation of what these item actually measure has been…disappointing. We’ve seen the same mistakes that novice content development professionals learn not to make repeated as though they are standard practice. For example, many AI researchers in educational measurement only evaluate the stem of a multiple choice question without considering the answer options or the cognitive paths that might lead to an incorrect answer. Again and again, researchers who do not understand how potential test takers learn particular material or the mistakes they actually make offer their less-than-expert opinions on the KSAs the an item requires. 

When challenged on this, they told me that they couldn’t find anything in the literature on item alignment. So, I spent a very frustrating few weeks going through the educational measurement literature and texts to see what it had to offer on this question. And they were right. Quite a bit on blueprint- test- or form-alignment. Some dimensions of what might considered (e.g., Webb) when rolling up item alignment decisions into test alignment determinations, but nothing on how to make those item-level judgments. There simply is not a literature on item alignment.

But AI generated items are useless if they do not actually measure what they are supposed to measure. Bad building blocks cannot fulfill the requirements of test blueprints and can produce indecipherable test results. Well, they could produce fraudulent test results that simply do not report on what they claim to report, and suggest inferences for which there simply is not sufficient evidence or theory to support.

So, here is a review of item alignment. Here are the basic considerations of how to determine whether an item is aligned to its alignment reference—be it a standard, an assessment target or something else. If we going to be evaluating the potential of AI-generate items, we really need to be rigorous in our evaluation of the products they provide—the items!

Item Alignment: Understanding the Quality of the Evidence that Items Elicit

Alignment—the mapping between test items and their intended constructs—is central to test validation but remains understudied at the micro level of individual items. This paper examines how judgments about item alignment are made in practice, analyzing five common misconceptions: ignoring item modality, ignoring alternative cognitive paths, ignoring additional KSAs, lack of deep expertise with the domain model, and failing to consider the diverse range of test takers. We frame these issues using Type I (false positive) and Type II (false negative) errors in inferences about test-taker proficiency at the micro-level of individual alignment references (e.g., standards). We further explore the nature and impact of different sources of additional KSAs. The paper further examines challenges in alignment within a standard, including difficulty, learning pathways, components of complex standards, and text complexity. Despite the importance of targeting the core rather than margins of standards, numerous factors incentivize alignment with the less important margins of a standard, including ease of item development, psychometric pressures, and naïve misreadings of standards by non-experts. We argue that improved alignment requires recognizing the distinct requirements of large-scale standardized assessment and bridging disciplinary training gaps between psychometric perspectives and content development expertise to improve the quality of evidence elicited by test items.

What is Unidimensionality?

What does it mean to be good at math? There are students who were good at math before they hit algebra, and then struggled. There are students who were good at math, but just weren’t great at the proofs of geometry class. There are kids who were good at math until they hit calculus. There are kids who are good at math, but just can’t do word problems. There are kids who are good at math, but keep making slopping mistakes. There are kids who are good at math so long as they have already learned how to solve that kind of problem, but particularly struggle when faced with novel problems. 

So, what does it mean to be good at math?

Can a student be really good at math if they struggle with algebra, proofs, calculus, word problems, novel problem and make sloppy arithmetic mistakes? Clearly not. These things are all aspects of mathematics. The best math students excel at all of them and the worst excel at none. But most students are better at some and worse at others. 

When we have large constructs (e.g., math), but students different in which parts they are better at and which parts they are worse, it is multi-dimensional. Math is not one thing; it is not unidimensional. 

English Language Arts is not just one thing, either. Reading is not just one thing, and neither is writing. One can be a good speller, but have poor command of conventions of formal grammar. One can write good sentences, but struggle with developing a single cohesive paragraph. One can struggle to but together a cohesive piece that organizes and ideas and supports for it. And quite differently, one can write imaginatively—a certain kind of creativity. One might be good at writing evocative descriptions, or real-seeming characters. One might imagine interesting plots, or write realistic dialogue. Reading also has many components that different readers are better or worse at.

Not only can people differ in which dimensions of a larger construct they are good at, the kinds of lessons and practice that might help to improve offer differ from dimension to dimension even within a construct. Learning to be a better speller is a very different process than learning to write real-seeming characters. Learning to be more careful with arithmetic is different than learning to solve word problems. 

The thing is, it’s not just that mathematics is multi-dimensional. Even arithmetic is itself multi-dimensional. Even multiplication is multi-dimensional. Even single-digit multiplication is multi-dimensional. When someone learns their multiplication tables, they can be better at some parts of it than others. 2’s, 5’s and 10’s are easy. The others…well, there are tricks and there is memorization. If we all focused on 8’s first, we might know them better than 6’s, but we tend to focus on 6’s before 8’s. Eventually, however, when we are past those learning stages, we process all of that complexity more automatically and the dimensionality of multiplications tables reduces. It might even become unidimensional, differing by our level of command with the individual differences that we had when we were first learning them. Some people know them all and the ones who don’t tend to make the same mistakes. That is, once it is safe to assume that we obtained the level of proficiency with single-digit arithmetic that we are going to obtain, it is unidimensional—but that is past the point when it is a skill worth measuring.

So, some people remain better at algebra, while others might remain better at the reasoning skills of  proofs, and others better at the diligent care of avoiding slopping mistakes. Similarly, some writers are better at dialogue, others at character and others at plot. Moreover, science, social studies, foreign language, psychology, each sport and most everything is actually multi-dimensional. 

Even sprinting—running a footrace—is multi-dimensional. Track and & field coaches talk about the biomechanics of i) the start, ii), acceleration, iii) drive and iv) deceleration—though some think there are more and some think there are fewer dimensions. Thinking through this example puts a lie to the idea that unidimensionality can be meaningfully built of a constant combination of separate components. Different sprint distances (e.g., 10m, 40m, 100m, 200m) each constitute a different ratio of these different components, and the is not an absolute or definitive reference for what ratio represents sprinting. It is always an arbitrary decision which one to favor. 

So, from an educational measurement perspective what is unidimensionality? If we care at all about the substance and what we are measuring, then unidimensionality is an arbitrary fiction created to serve some convenience—and perhaps never even able to serve that convenience well. 

The Importance of Accessibility in Assessment Development

The education sector generally and assessment specifically should understand why accessibility is important. The Individuals with Disabilities Education Act ensures that students with disabilities are able to access appropriate educational opportunities. Psychometrics talks about construct-irrelevant variance in reference to how things we are not trying to assess might impact test results. Psychometrics calls those things “irrelevant.”

Employment law also addresses this topic. One of the foundational ideas of the Americans with Disabilities Act is that a disability should not disqualify someone from taking part in life or having a job. For examples, so long as they can perform the major functions of a job, employers are required to provide reasonable accommodations so that they can do the work.

Of course, there’s that old universal design idea that making things easier for people with disabilities makes them easier for everyone. It really can be win-win. Famously, curb cuts (i.e., the ramps now built into sidewalk at street corners, cutting through the regular curb) that were originally intended to help the disabled in wheel chairs turn out to help many many others. People with wheeled suitcases. People with rolling carts. People wearing high heels. People carrying bulky loads that make it hard to look down. Anyone with sore or stiff knees, such as those with injuries or just the wear and tear of age. This is such a clear example of accessibility enhancement that the whole idea is called the curb cut effect.

My own first real exposure to assistive technology was in the early 1990’s. The family of a friend of mine was involved in early version of voice dictation software. This was before Windows, back in the DOS world. My friend asked me to help her at assistive technology trade show, and I demonstrated this amazing program, Dragon Dictate. I could speak (not quite continuously) and it would type my words. I could use a whole DOS computer with it. Though expensive, this technology could enable people to work jobs they might not be able to, otherwise. They could make economic contributions, and support themselves in the process.

All of that is about accessibility. But all of that is the moral case, not the business case. That is about why it is good to help other people who might need just a little assistance. Right now, it seems that some look down on such a value.

This series on DEIA (i.e., disability, equity, inclusion and accommodation) is about why it is good for the assessment industry and our products. So, why is accessibility in our own practices good for our products?

Well, as much as the RTD Project talks about the importance of empathy and the practice of radical empathy, they are not easy. Truly understanding someone else’s perspective, understanding how their experiences give them different views and understandings than we ourselves have, is hard. It is work. It takes time, information and even instruction. We can try to imagine, but there is nothing like asking others to help us to understand something, and listening to experts who know more than we do. In assessment development work, we simply must understand the different perspectives of our test takers if we are going to be able to develop instruments that assessment at all accurately.

The law requires us to test all students—or virtually all students. Professional licensure exams must be available to all potential test takers—virtually regardless of disabilities. How can we develop assessments with valid uses and purposes if they do not work for such a significant portion of the test taking population. If we mis-measure the proficiencies of the disabled, how can tests that have any component of norm-based scoring or reporting—as so often enters the standard setting process, even when we try to keep it out—deliver accurate results of any test takers?

So, we need room on our teams and in our own organizations for the disabled. People whose own lived experiences face different constraints than I do will notice things that I might not. Test delivery platforms might operate different for them in ways I do not notice. Contexts for math problems might have assumptions that I take for granted, but others do not understand. And the perspective expanding conversations and lessons I get from learning from colleagues with some disabilities can make it easier for me to understand or imagine perspectives of people with other disabilities. If I let them, they can get me out of my own perspective into a more open-minded space of empathy. They can help me to better ensure that the instrument in front of me is more focused on what it is supposed to assess.

Disability is just another dimension of diversity—or a bunch of different dimensions. The relatively minor efforts to make our workplaces and workflows accessible to people with disabilities enhances the effectiveness of our teams and our products just like the inclusions of other dimensions of diversity does. Perhaps it only gets its own letter in DEIA because some people too much limit the dimensions of diversity they consider.

Democracy and Education Research

Whether you believe in market-based approaches to improving our schools or more traditional approaches, it is vital that the public know about the functioning and effectiveness of our schools. Markets require informed consumers, and democracy requires informed voters. Neither system of accountability can function effectively—let alone efficiently—without information. 

This is why I work in large scale assessment. I believe that our public schools are the most important service that our governments provide. A vast majority of our children, of our citizens, go through them. Our schools prepare the next generation for citizenship, for economic participation and to be members of our communities. The moral legitimacy of our public schools comes from the same place as the moral legitimacy as all of our governments’ actions: the will of the governed. I believe in school board elections because our schools are so important that our communities should vote on them on their own, rather than part of the larger bundle we consider when voting for mayors, city council members, governors, legislators and presidents. 

Frankly, we all need better information about the functioning of our schools, because we all pay taxes to support them. Our property values and rents are influenced by perceptions of the quality of our schools. And the future of our communities and our children are strongly shaped by them.

Obviously, schools are not the only influence on these things. In fact, our children’s future are more shaped by non-school factors than in-school factors. But other than family, schools might be the most important factor. (When churches influence children, they primarily do so through the behavior and teachings of parents, who are so credible and important to children.) Schools are vital institutions in our communities, second only to families in how we shape our communities for the future. 

We need more information about our schools, not less. And certainly we need more than quantitative statistics and test scores. But those quantitative statistics and tests scores can be useful. They are hard to do well—thus, my work—but easy to consume. They are quick information for people who might not have the time, patience or interest to delve more deeply into rich qualitative findings about schools. No doubt, we need qualitative and quantitative reporting on more than just core academic lessons, as we want schools to do more than just teach those core academic lessons. Character, citizenship, mental and emotional health, resiliency, collaboration skills, emotional intelligence and more. But certainly, we all want to know how our schools are doing with those academic lessons that are at the center of so much that schools do. We need better tests, better reporting, and about a richer array of educational outcomes.

To abandon public reporting on our schools is, in my view, to abandon any investment in improving our schools. It undermines the basic engines of school improvement, be they grounded in democratic oversight or in market mechanisms. I  know of no moral call greater than trying to do better by today’s children and even better still by tomorrow’s children. This calls for investments to learn how we can do better by our own, our community’s and our nation’s children. 

No one voted for abandoning such efforts. We already spend so little on education research, making our research efforts so much more difficult than they should be. Cutting education research funding is a statement that no one’s children are worth investing in or improving for, not even our own. I can think of no more immoral view than that. I have all kinds of criticisms about what NCES and IES do and the research they fund, but those are primarily in terms of the important types of research that goes unfunded, rather than taking issue with the importance of the research that they do fund. Cutting this research is giving up on the most fundamental infrastructure of our society.

If an unfriendly foreign power had attempted to impose this on America, we might well have viewed it as an act of war.

(With apologies to John Dewey.)

Inclusion in Assessment Development: Making Use of Diversity

It should not be hard to understand the meaning of inclusion in assessment development, as so many of us have been classroom educators. 

For classroom educators, inclusion means including special education students in regular classrooms, lessons and activities—rather than in the building, but it in self-contained classrooms. It is about including those students where the main action is, rather than marginalizing them over there in some other part of the building.

This same logic applies in the workplace. It is not enough to merely include diversity in the organization if it is marginalized over there. It’s not enough that it is listed on paper as being part of the team, but not in the room where issues are discussed. It is not enough if it is not at the table where decisions are made. 

Inclusion is about actually taking advantage of the potential of diversity on our teams to help our projects and our products. 

Obviously, this matters quite a bit when to comes to writing assessments for the diverse range of test takers who take our products. If our diverse voices cannot be heard appropriately, then the promise of enlisting them in the first place was met. I would suggest that disciplinary lens is another dimension of diversity that should be acknowledge and considered in the context of inclusion. Discussions of issues and decisions need to be open to those diverse voices, or else their knowledge and potential contributions will be wasted, and our products will suffer. 

I suppose that this is an aspect of balancing confidence and humility, of knowing when to listen—which requires ensuring that those other voices are present for discussion and decisions. If we did not have a history of marginalizing some voices and excluding some perspectives from the room and table, this would not be notable. But we do have those histories, so we need to be careful to break those old patterns and establish new norms for how we ensure that our product (and decisions) are able to leverage the potential of the diversity within our teams. 

Because of longstanding norms of power and who is centered, this requires intentional effort and attention to ensure that those voices are truly included. Because this is about cultural norms and power—yes, it really is about power—efforts to truly include those voices and perspectives take more work and more difficult work than those who have always been included realize. It takes more than merely literally including people in the room and having them at the table. It takes the work of giving them the confidence to speak up and the work of giving the others the humility to listen. 

But it is all worth doing because it products better products that have a better chance of being put to some valid use and/or purpose.

Is It Time to Just Ignore NAEP?

Should we be paying this much attention to NAEP? I don’t think so.

Differing Standards

Are you an expert in anything? What do you think the important knowledge and skills in that topic are? Could you make a list of them—an organized and detailed enough list to guide years of instruction on that topic?

Let’s imagine cooking. Here are some questions you’d need to figure out

How important is baking? How much might you want to focus the skills and knowledge of baking breads? Cookies? 

Roasting? (What is the difference between roasting and baking, anyway?) What are all the important skills of roasting meat? Roasting vegetables? Roasting pastas—or is that baking?

Grilling? Is that the same as barbecuing? What are the important skills and knowledge there? Still gotta cover gas, charcoal and wood? 

What are the important skills and knowledge around salads? What is a salad, anyway? 

Old school skills: aspic? Jello mold? What about them?

What about principles of healthy cooking? What are those? Are they worth including? In what year? What are the skills and knowledge?

Blooming spices? Is that on your list? Should it be? 

Reusability of parchment paper? How to clean cast iron pans? How to season them? How to clean a blender properly? How to make clear ice cubes?

The thing is, my year by year list of critical knowledge, skills and abilities would be different than my co-author’s, and different from my wife’s. And different than yours. Two really good lists could still differ significantly—even radically.

We used to have more variation across the states when it comes to state learning standards for math and reading. We have less variation today, in large part because of the widespread adoption and adoption of the Common Core State Standards (CCSS). One would think that that eases the problem.

NAEP is Not Aligned to Common Core

The problem is that the widespread influence of CCSS has not really gotten to NAEP. For NAEP to align to Common Core would break comparability over time. Measure something else, even something only moderately different, and you should not compare the old scores to the new scores. All those longitudinal sequences would break—and longitudinal sequences is a big raison d'être of NAEP.

There is lot that I really like about NAEP. It is very well designed, produced and implemented assessment. It’s just testing the wrong thing.

But it does not measure what teachers are told to teach.

Does it generally measure the right things? Sure. Generally. But not exactly the right thing. It’s kinda measuring the wrong thing. Far from totally wrong, but not really the right thing. 

Like my wife’s sister, or maybe her identical twin sister. If my wife and her (fictional, btw) identical twin sister were raised in the same family and took all the same classes, would it be ok to only test her sister and then say that the scores and grades applied to my wife? Would that be accurate? How far off do you think the scores might be?

NAEP is measuring over there, but teachers are told to teach over here. They are close, but they are not the same. So, how much can we trust NAEP to reflect that real state of learning and proficiency of our students?

There is No Good News Here

There is no way to spin the latest NAEP results (math, reading) as good. The downward trends are concerning. But frankly, I have no idea the extent to which they merely represent divergence between CCSS and NAEP’s own alignment references. None. And I’ve seen few serious efforts to figure that out. The fact that the downward trends predate COVID, but postdate the widespread adoption of CCSS really concerns me. 

But down is down. It is not up. NAEPs measured constructs are similar enough to CCSS’s that I would hope to see increases in NAEP, even if they are attenuated from the actual learning and proficiencies of students. The problem is that it is quite easy to imagine and more and more refined efforts to address CCSS’s versions of the constructs result in more drift from NAEP’s foci. 

Nonetheless, the trends are national in scope. Virtually across the board. That ain’t good. Perhaps it is just a reflection of testing the wrong thing, but there’s no good evidence here, no good story to be told. 

I care most about making sure that the tests actually measure what they claim to measure—well, other than caring about the learning, development and health of children, of course. I think there a broad crisis in standardized tests misrepresenting what they actually mean with an audience that is hungry for particular meanings from those tests. If the tests are not measuring the right thing, the usefulness of that entire endeavor is questionable. The coverage we see of NAEP results does not account for this, which is perhaps evidence that we should not be paying any attention at all to them. If we cant’ get it right with NAEP, what hope is there for other assessments?

When state standards were so varied, NAEP offered a common yardstick to judge them all against. But the NAEP team’s conclusions on what to measure turned out different than the National Governors Association's and Council of Chief State School Officers’ team's conclusions about what to teach. So, my biggest concern is that NAEP’s supplementing other assessments with its own special strengths, NAEP is arrogantly sticking to its own construct definitions. 

There is a path forward for NAEP. In a country without federal power over standards or curriculum, NAEP should acknowledge the hard work of states and their leadership—and the goals of schools and teacher. Then, we might actually get more value from it.