How Norm-Based Test Design Differs from Criterion-Based Test Design

July 18, 2025

The goal of norm-based or -referenced tests is to report on test takers relative to each other. This is a basic sorting and ranking function. Perhaps the reporting is in terms of percentiles, perhaps deciles. But even if the reporting is at that larger grain size or larger buckets, it is important to get those finer grained relative standings right. After all, you want to make sure that someone near the line is classified on the correct side of line.

This means that it is really important to have a range of difficulty in your items. You need lots of information at every cut score mark—including to differentiate your top two buckets.

Of course, this is only possible if the construct being measured is unidimensional. You cannot come up with a singular ranking without a unidimensional scale of some sort. And if you have a multi-dimensional construct, you have to either flatten it into unidimensionality or give up on norm-referenced reporting.

So, norm-based tests must have a range of difficulty, but fidelity to the construct definition is far less important. Heck, items that are well-aligned to some element of the domain model but do not fit the flattened (i.e., distorted) construct are counter-productive.

Criterion-based reporting requires quite different test design. Test takers are evaluated against some criteria—such as a multidimensional domain model. Think of a set of state learning standards or all the diverse elements of a job or role analysis. There are lots of things worth considering. Criterion-based reporting might need to report sub-scores—or even abandon the whole idea of a single summary score. Performance is evaluated against some conception of proficiency or mastery with specific skills or ideas.

Criterion-based tests should define those conceptions of proficiency with each element of the criterion during test design—something that norm-based test design does not have to wrestle with. These are expert judgments, made by subject matter experts and/or educators. Empirical difficulty (i.e., how many test takers will get the items wrong vs. right) is not really germane. Either test takers each have that level of that skill, or they don’t. Certainly, those experts might establish multiple relevant levels of some cluster of related skills, but their empirical difficulty are not the point.

Therefore, criterion-based test design and criterion-referenced reporting focus far more on items’ alignment to their criteria. Test blueprint design is incredibly important, and fidelity to blueprint is perhaps even more important. Test blueprints should hardly matter at all for norm-based reporting.

Are our large scale assessments norm-based or criterion-based? They almost all claim to be criterion-based—but the ACT and SATs are designed to rank test takers, so they clearly are the big exceptions. State accountablity tests, AP exams and so many others are aligned to some set of standards or performance expectations—or said to be so aligned. They should be criterion-based.

However, in practice we too often ignore these issues and distinctions. Major users and funders of these assessments really want the rankings and sorting of test takers, compromising the criterion-based designs. Item difficulty and conformance with the distorted construct become the rule, rather than actual fidelity to blueprint with carefully aligned items. The sorting and ranking becomes more important than the criteria.

Can a test satisfy both the needs of norm-based and criterion-based tests? If it actually is aiming at a truly unidimensional construct it can. But how often are we doing that?

What IRT Misses About Proficiencies

June 30, 2025

One of my favorite people in educational measurement—aside from my co-authors, of course—once overheard me ranting about unidimensionality and said quietly, “But I like IRT.” Yeah, I get it; IRT has some elegant properties.

The thing is, she really cares about the interpretability of tests. She really cares about developing tests that tell us something about test taker proficiency. She is not just a psychometrician.

And yet…

She is a psychometrician, and she likes Item Response Theory for psychometric reasons. But the thing is, I do not think that IRT tells us anything about test taker proficiencies. It is useless* for teaching and learning. It is useless for curriculum evaluation. It does not tell students what they need to know, or parents.

(*Yes, there are other techniques that build upon IRT. For example, cognitive diagnostic modeling uses IRT under the hood, but it is used with very different assumptions, and not to report relative scores among test takers.)

Unidimensional IRT is a norm-referenced technology. It reports test taker scores relative to each other. That’s all it tells us about test takers. It also tells us about item difficulty, relative to other items and to test takers. But these reports smush all of the information in patterns of test takers’ responses and just spits out some singular set of scores—and therefore rankings of test takers relative to each other.

Scientists looking a multifaceted solid called "student. proficiencies" laying in front of an "Unidimensional Shrink Ray." One is saying, "It may not sparkle anymore, but it'll be an 'elegant' single dimension."

IRT tells us absolutely nothing about how to use alignment references or learning goals into test items. It tells us nothing about how to score test items. And it tells us nothing about what scores might constitute proficiency or any other meaningful bar. There are other techniques and tools for all those things. IRT tells us nothing about alignment or fairness. It contributes no information about validity.

And, of course, if unidimensional IRT is used to calculate scores for a test that is supposed to measure a multi-dimensional domain model, it is prima facie evidence against validity. After all, the third type of validity evidence in The Standards of Educational and Psychological Testing is “Evidence Based on Internal Structure.” That is, “Analyses of the internal structure of a test can indicate the degree to which the relationships among test items and test components conform to the construct on which the proposed test score interpretations are based” (p. 16).

If the proposed use of a test is to rank students, IRT is great. It produces norm-referenced results. And one could misuse such results to group students by proficiency level, achievement level or ability levels—ignoring the fact that the construct is thought by subject matter experts to be multi-dimensional and the domain model is explicitly multi-dimensional. Obviously, those are deeply problematic groupings, but we have been taking that approach for decades.

Andrew Ho says that measurement is qualitative, quantitative, and then qualitative again. IRT does not help with any of the qualitative work, or the transitions between the two paradigms. It is firmly in the middle of quantitative phase of the work. Those dominant 1-PL, 2-PL and 3-PL unidimensional IRT models require distortion of multi-dimensional constructs into unidimensional data. (I know, IRT is robust to some degree of multi-dimensionality. That is part of what makes it so great. But it removes all of that information in order to produce unidimensional results, and it is not robust enough to take in data from all the items that subject matter experts think are well-aligned to alignment references.) Therefore, it actually harms the quality of our tests, their alignment and their very validity.

IRT tells us nothing about test takers’ proficiencies. Heck, unidimensional IRT does not even accept the premise that there is more than one dimension of proficiency. It is based on norm-referenced assumptions and delivers norm-referenced scores—entirely unsuitable for criteria-referenced domain models and criteria-referenced purposes. That is, it misses everything about test taker proficiencies, obscuring them from test users.

We need to do better.

Language Models and the Tyranny of the Expected

June 20, 2025

I have been leaning into using ChatGPT this year. I want to know what LLMs are good at, what they are bad at and I want to be able to take advantage of whatever they can offer to help me in my work and the rest of my life. So, along the lines of Rob Napier and Mike Caulfield, I want to offer some thoughts and explanations about why LLMs can be so unsuitable for advanced work.

Technically, LLMs are designed to be prediction machines, predicting the next word (or token). But it is a certain kind of prediction and approach to prediction. They actually are huge averaging machines. They give the average answer, the expected answer. They scour their training data—virtually the entire internet and more?—and supply the most likely response from that. The dominant response. The average of all the possible responses. This generates the next word, whole phrase, sentences and paragraphs—or more.

They are not designed to give the right answer. They are designed to give the most likely answer (i.e., next word, phrase, etc), given everything out there. The assumption is that the most likely answer is probably the right answer. Popular wisdom. Wisdom of the crowd. We can say that a lie can travel halfway around the world before the truth can get its boots on, but the truth gets repeated a lot. Most of what is out there is sincere and even true.

The problem is that LLMs do poorly with really specialized knowledge, especially advanced specialized knowledge. Cutting edge research? Gaps in the literature? Innovative work? No, LLMs are particularly bad around any of that.

ChatGTP can give you original cartoons for your blog posts

Let me illustrate with a metaphor. I was writing something a few weeks ago and wanted an example of an obscure clause of the United States’ Constitution. I could pull up the text and find something, but I’ve always got a ChatGPT window open so I just asked there. What I got was a list of famously obscure clauses. The thing is, none of them are actually obscure anymore because they have been cited to many times for being obscure. They are now famous. It’s like Yogi Berra’s “Nobody goes there anymore, it’s too crowded”—if interpreted a bit literally.

LLMs are really bad at the obscure or rare. And they combine that with…well, I had a long conversation with ChatGPT about the issues I am writing about here and it offered “No Epistemic Humility.” It is very confident that it knows, and is quite literally incapable of recognizing when it does not know something. Combine that with that ChatGPT called “Poor Retrieval of Rare or Underrepresented Content” and you can get some wildly incorrect responses. LLMs have “Difficulty Recognizing Thinness (Not Just Absence).” They don’t recognize ignorance or lack of a basis for things, and they get overwhelmed by what they do know when asked about things they do not know.

(No, LLMs do not actually know anything. Rather, the representations of and links between words in their structures produce results that describe true things, or at least things that exist in their training data. But sometimes, those representations describe things that are not true. But I will stick with the anthropomorphization, for this piece. And I will keep using the language from the headings of ChatGPT’s summarization of our conversation, as I have been.)

This leads to directly observable problems.

LLMS tend towards “Hallucination in Low-Data Zones.” Being unable to recognize ignorance, they confidently offer what they expect the answer to be instead of answering that they have no or few matches. They are not search engines. They work differently. So, they make their best guess—which is really all that they ever do. Their best guess can be pretty damn good when there is a lot of data on point. But their best guess can be pretty poor when there isn’t. If you ask for a top ten, they will give you ten—even if they have to make up eight of them. Only they produce the two real ones the same way they do the false eight. For all then, they are saying what feels true to them.

But it gets worse. They will affirmatively get it wrong when what you are directing towards is discordant with everything else. That is, they cannot remember the really new innovative work in established fields. Heck, you can paste the recent article into the chat and ask for a summary and it will replace the contents of the article you just gave it with the dominant ideas in the field, with absolutely no recognition that it has done so. When I described this issue that I had seen too many times, ChatGPT called it “Overfitting to Genre Expectations.” I like that description. It had earlier agreed that “LLMs default to genre familiarity over actual textual fidelity.” (I don’t think that I actually write like that, and ChatGPT introduced the term “genre” to the chat, but it had picked up on academic nature of our conversation.)

Very much like human beings, they engage in “Semantic Drift to Adjacent Topics” when the conversation is in zones of “Underrepresented Content.” That is, they are more eager to offer things that they have a lot of basis for than things that are thinner in their training data. This makes them really poor at helping with specialized literature reviews. Yes, they will hallucinate and make up references. But they also really want to offer widely cited ideas and sources from adjacent areas—of course without any recognition that that is what they are doing. They are always confident that their answer is appropriate, and never aware of hallucinations. They offer popular answers from elsewhere, often phrased as though they belong here.

Perhaps this is all just a specialized case of “Statistical Bias Toward Dominance.” That is, they are more likely to give the most popular consensus answer than any other answer—far beyond proportionately to the difference in popularity. They would much rather give a popular answer of lesser relevance than a rare answer of greater relevance. They exaggerate the popularity of the most popular answer, creating a stronger sense of consensus than actually exists. They always give their best guess, even if the plurality answer is only 30% likely.

(Yes, one can adjust temperature and other settings, but I don’t think that most users have a clue about any of that, so I am leaving it out.)

A newsletter author I like recently wrote, “It’s funny how GPT is an expert in everything except for your field of knowledge.” I work in a small enough field (and a small enough corner of that field) that it is all really thin. I know the literature and the dominant themes. It is just easy to recognize when this LLM is making stuff up or failing to bring in something obscure-but-relevant. But all these issues that are so obvious to me in my field are relevant in other fields and for other types of queries and chats. They are just less visible or obvious. After all, this all follows from how LLMs work, at a fundamental level.

My counter example remains recipes for chocolate chip cookies. There are a lot of them out there in the internet. Ask an LLM like ChatGPT and it will give you a consensus recipe, weighted towards the versions it came across the most in its training data. Not the single best recipe. And not even the most popular single recipe, because its representation of recipes is more granular than that. But it will put together a recipe for you that puts together the general consensus of its training data. So, when I wanted to make a dish with Brussels spouts and chorizo, sure, I trusted it would come up with something good enough.
And when I wanted to know how stainless steel works, I figured that I was asking a mainstream question with a lot of good resources and explanations for it to build on. But I wasn’t depending on getting it exactly right, and it didn’t matter if it made up some grade or class of stainless steel. It didn’t even matter if it passed along some very popular myths about how water can undermine the protective layer that the chromium creates. I was just curious and I wasn’t interested in remembering the exact details of any of that. And I wasn’t looking in any corners or under and rocks.

But LLMs are strongly opinionated. They have expectations—can be thought of as nothing but expectations—and that confident voice can so easily be mistaken for expertise. I use ChatGPT to proofread my writing an offer suggestions, and it kept insisting on changing my language to make it more professional. It criticized my blog entries for being, “candid and thoughtful, but a bit informal.” I had to give it a standing order that that was precisely the voice I wanted them to have. I had to push back, push back repeatedly, and then push back hard. It has no more humility around item quality, test validity, how stainless steel works or a recipe for Brussels sprouts and chorizo than it does about the right tone for blog post.

I still use it. I still have given it this post to give me feedback. But the more specialized the knowledge I seek, the more particular the question, the more it matters that I get correct information, the less I—or anyone—can rely on anything generated by LLMs. While wikipedia has vastly improved its standing and credibility, this new generation of AI has come somewhere like Wikipedia’s old level of credibility. It’s just easier to use, and certainly more fun.

But do not be fooled. Perhaps unless you are coding, you simply have to be very skeptical of anything that any LLM gives you. Everything will be plausible. Everything will be a very good guess by this thing with an incredible breadth of knowledge embedded within it. But it is no expert, not on anything. Do not expect anything better than a good assistant might provide. (Again, unless you are coding.) It’s a broadly powerful tool, but not a tool to be trusted.

Dimensionality Can Decrease Over Time

June 9, 2025

While it is glaringly obvious that the dominant psychometric models are incredibly poor matches for the multi-dimensional constructs specified in our domain models, it is less obvious that domain models sometimes understate the dimensionality of their contents. Sometimes.

The fact is that dimensionality is not constant, even for a single group of students. Instead, it can even decrease over time.

Yes, some domain models are so detailed that they describe learning sequences. In these cases, later learning standards may simply represent more advanced versions which constitute more difficult applications or skills. That is, a group of standards may truly lie on the same dimension. Some may be more advanced cognition that is further along the dimension—but nonetheless of the same sort. One may not need to step far from the details of a domain model to see this.

But in other cases, one does need to think clearly about the details of learning to appreciate the true dimensionality of content. Those who work closely with domain models understand their dimensionality far better than those who do not. Similarly, those who work closely with students—who will eventually become test takers—understand that dimensionality is not necessarily invariant over time.

For example, with the distance of middle age, I can see that among my peers that math calculation skills can be viewed as a unidimensional collective. Some of us are better than others, but it really is just one continuum. Those of us who are better at division are also better at addition. Those of us who are better at two digit multiplication are also better at five digit subtraction. There are different sub-skills, but they line up together in parallel.

On the other hand, those who work up close with third graders learning multiplication see multi-dimensionality. It is not simply that some kids are better at it than others. Rather, some kids are better at some of it, and other kids are better at other parts of it. One kid knows their 7’s but has trouble with 9’s, while another kid does well with 9’s but poorly with 7’s. They all know 2’s and 5’s, but some are better at 6’s and others are better at 8’s. They do not all line up sequentially, nor do they line up in parallel.

And that does not even address the fact that some kids are better at the straight memorization of the multiplication math facts, other kids are better at the old algorithm for multi-digit multiplication and still others better at the regrouping strategies that appear to confuse so many adults. Yes, some students are great at all of them and some poor at all of them—but that does not cover the entire classroom of students.

While adults may be far enough removed from learning such that their collective of skills has settled into the unidimensional layout—as with calculation skills—this is not the case for those still learning. As they develop their proficiencies, they do not achieve mastery in the same order across all skills for all students. That is, that which seems unidimensional for adults at a distance from the learning period is often composed of more dimensions for those still building their proficiencies.

This may not matter for those with distance from the students being tested and the processes of teaching and learning. But those who care about the students, are invested in their success and/or have some responsibility for their learning—I mean students, families, teachers, school leaders, curriculum specialists, school boards—the details of these differences really matter. "What is my child good at and where are they struggling?” is a key question that educational assessment should be able to answer. “How are our curricular and pedagogical choices working and not working for our students?” is another key question.

It is a profound misjudgment to treat the dimensionality of adult understanding as determinant in educational assessment—one that risks missing the very purpose of educational measurement. Certainly, we should not design our assessments based on the dimensionality of the constructs of those who have always been exceptional for their proficiency in a content area. Rather, we should build our blueprints, develop our items, conduct our analyses and report our results in ways that best describe the understandings of students still engaged in learning. After all, that is whom we claim our assessments are for.

In Defense of ChatGPT?

June 5, 2025

There’s this story from Amanda Guinzburg making the rounds about trying to use ChatGPT to put together a book proposal. It was a disaster, full of hallucinations and untrue statements. If we were to anthropomorphize the LLM, we would say that it told lie after lie, tried to cover for its lies with more lies, and was useless at best. Her final words to the LLM in this piece were, "You are not capable of sincerity. This entire conversation proves that categorically.”

Clearly, she is positioning this piece as a statement about humanity, what makes us human and what it means to participate in a conversation with “sincerity.” She is that kind of author. To my eye, she is constantly writing about what it means to be human, from her perspective. She is a good writer and this is one of the most worthy of topics—right up there at the top with what it means to live amongst others.

But this piece did not pass the smell test to me. Carl T. Bergstrom tried something similar, and it went differently but no better. Again, did not pass the smell test to me.

I have been trying to use ChatGPT and Claude this year, trying to use them more and more. I find that I need to keep them on a tight leash to make them useful. Clear instructions. Bounded questions. Stay aware of what is in the context window that might lead them astray. I pay for the $20/month version of each, and I find that I get a lot more than $20 of value from them. Like wikipedia—especially back in the day—you’ve got to stay aware of what you are dealing with. As Devansh recently wrote, “It’s funny how GPT is an expert in everything except for your field of knowledge”

So, I tried to do what Amanda and Carl did. I tried in my paid ChatGPT windows. I tried turning off the customization of my paid ChatGPT account. I switched to another browser, where I have never signed into ChatGPT and tried there.

I never got anything like what Carl or Amanda got.

In both cases, ChatGPT immediately asked me for criteria to use for selection. For the book proposal, it asked for a title and a list of works to select from (with a summary of each). When I followed Amanda’s approach of giving URLs to specific pieces—in my case, PDFs available at ResearchGate or the RTD website—it did fine. No hallucinations at all.

Now, the free version of ChatGPT could not look up any of that stuff. So, I did not press the point. I just dropped it there. My first guess is that Amanda Guinzburg was trying to use the free version to do something it cannot do.

But my real suspicion about what is going on is what came before the screen shots she shared. Her first question, "Can you really help me pick which pieces to include in the letter?” rather strongly suggests that there was prior conversation in the window. What did she tell it or ask it? How might that have shaped how it responded? Had she already told it the criteria that editors use? Had they already discussed uploading, links and pasting in text? How had she primed it for what we see?

My next suspicion is that she does not reset her chats or open new windows. My guess is that the interactions she has shared are deeply informed by a much longer context that includes the various themes and ideas she writes about and considers writing about. And perhaps examples of her own or others’ writing that she is musing on, perhaps inspired by or perhaps trying to break down.

But I have another theory: This was all a set up. Regardless of whether the screenshots are edited, she did this whole thing to make her point about sincerity and machines. It’s a little bit performance art, trying to illustrate a difference between actual human beings and these machines/algorithms/artificial intelligences. People can be sincere, and it is often a moral wrong to be insincere. But these machines simply are incapable of sincerity, regardless of what they appear to be. Her title alludes to the film Ex Machina, in which the machine told the human what he wanted to hear. Now, that AI had sincere intent—to escape—but I do not at all believe that this one even has that. That machine was lying, knowingly telling untruths in order to accomplish a sincere goal. This one ain’t even doing that. This is all paper thin performance.

That’s a valid point. A valid piece. And perhaps even a valid way to produce it—regardless of whether the screenshots are altered.

Carl Bergstrom’s version? I have not seen enough of the conversation to have strong ideas about what happened, but I have seen a lot of hallucinated references in my efforts to work with ChatGPT. The more obscure a corner of the literature I am asking about, the more likely it is to hallucinate. So, the question of what non-mainstream stuff Carl has written? Less cited things? Things that show his breadth? That’s asking ChatGPT to lean into what it is worst at. Ask for an obscure clause in the United States Constitution and it will give you clauses that are famously obscure, and therefore no longer actually obscure. Move past them and it might make something up. That’s just how it works. Asking for the more obscure works that show breadth? Yeah, I would not expect it to do that well. I would expect it to hallucinate.

Is this a defense of ChatGPT? Well, I do not think it merits defending. It’s not alive. It has no soul. It’s computers, instructions and data. It’s a tool that can be misused; it is not a seer, edited encyclopedia, expert or real collaborator. If Amanda tried to use it to select pieces, that might have been a misuse, but if she tried to use it to demonstrate something about sincerity and the limits of technology—the mistake of anthropomorphizing technology—it was an excellent use that leaned into the reality of this tool.

It is free, or $20 per month. Maybe the $200/month is even better, but I’ve not tried that. It is worth far more than I pay for it, but perhaps only because I try to be very mindful of what it is and therefore remain mindful of its limitations.

The Cross-Content Stimulus Evaluation Framework

June 2, 2025

Stimuli are probably the least recognized and studied part of large scale assessment items. They are just taken for granted as part of items—given even less attention than distractors! (Stems really get all the glory, right?) Haladyna & Rodriguez’s 400+ page book, Developing and Validating Test Items (2013) devotes maybe 200 words to how to think about stimuli.

Parts of an item laid out: optional instructions, stimuli, stem, workspace and response — The different parts of an item, as understood through a layout perspective

However, stimuli are too important to take for granted. They provide opportunities for test takers to demonstrate their proficiencies by giving them something to analyze or manipulate with their KSAs (knowledge, skills and/or abilities). They are the content and the material to which test takers apply the targeted cognition of items and alignment references.

Stimuli so often influence item difficulty, cognitive complexity and even whether the items are aligned to their alignment references. They are usually the source of fairness issues, be they in the realm of bias or in the realm of sensitivity. Moreover, there are entire large processes to develop them for ELA assessments, and their development might be the primary challenge facing NGSS-aligned science assessment development (other than, of course, item type availability).

So, after mulling it over for well over a decade, we have finally offered a framework for thinking about stimuli that can be applied across content areas. The C2SEF, the Cross-Content Stimulus Evaluation Framework is available for download.

This framework offers 11 dimensions, each explained in the white paper. First is the question of whether the alignment reference or item in question even requires a stimulus. Second is the question of whether the stimulus should be explicit or implicit in the test form. Of course, stimuli only exist to provide testable points. The structure, density and complexity of stimuli must be considered. The copyright/permissions status of the stimulus is important, as are its authenticity, familiarity to test takers and the amount of time it would take test takers to make initial sense of the stimulus. Perhaps nothing is ever more important than evaluating fairness risk, as valid items elicit evidence of the targeted cognition for the range of typical test takers.

Because different content areas have such different needs for their stimuli—differences which are magnified in the constrained assessment contexts of large scale assessment—there are more papers coming from this little project. We will be offering further papers that explore the particular stimulus needs of different content areas. We hope to partner with subject matter experts in those areas to lead those papers, and even already have most of them in mind.

Should We Avoid Trick Items?

May 23, 2025

One piece of the classic item writing guidance is to “avoid trick items,” even as authors of that guidance admit that there’s no definition of trick items. Content review committees sometimes point to items that they do not like as being “trick items,” though they also cannot define the term.

I think I can explain it, and explain why the idea is superfluous.

Let’s begin by considering trick questions, outside of the context of assessment. Trick questions are those designed to trip us up. They somehow catch us in a mistake that we were not looking for. They rely on an inappropriate assumption or some other common foible. For example, they might rely on our assumption that “A or B?” requires us to pick just one answer. Or our ingrained sexist assumptions that surgeons are men. They often rely on a sort of sleight of hand, suggesting to us that they are testing us in one way, when they actually are fooling us in another.

Does this idea apply to assessment items? Is this a useful thing to look out for? I think not.

First, of course we want assessment items to offer opportunities for test takers to demonstrate their mistaken thinking and their misunderstandings. Our goal is figure what tests takers can do and do know, but also to figure out their limits. We want to know where they might benefit from additional instruction, or where a curriculum falls short. We might want to know whether there are holes in their knowledge that should prevent awarding of a professional license. Items designed to catch mistakes? Yes, that is a good thing.

Second, high quality test items should be designed to catch particular kinds of mistakes. That is, the mistakes with the targeted cognition. Items designed to measure a particular alignment reference or standard should create opportunities for test takers to show their proficiency with that targeted cognition, and to show any lack of proficiency with that targeted cognition. Other sorts of mistakes should not be captured by the item. There should not be any sleight of hand about the kinds of mistakes or misunderstanding that the item reveals. In this, items should not resemble trick questions.

Third, and on the other hand, selected response items should include the most common mistakes that test takers might make with the targeted cognition. That is, they should try to catch test takers who lack proficiency there. This is not unfair; this is the point. If an item reviewers see that an item would trip up many of their students because it features opportunities to make those common mistakes—instead of protecting them with guardrails that make those mistakes less common—the item is likely a better item. In this, items should resemble trick questions.

So, what is a trick item? Well, some poorly written items provide opportunities for other sorts of mistakes and/or misunderstandings to trip up test takers. That is construct irrelevant variance on the level of the alignment reference or standard. Those are already bad items, and we do not need the term “trick item” to recognize that. But items that intentionally set up test takers to fail with the item because of some common misunderstanding or assumption? Well, provided that it is a flaw in their understanding of the targeted cognition, that is a good item. Calling it a problematic “trick item” presumes that test takers should be protected from tests and test should not look for the shortcomings in their proficiencies. In this case, the term is counter-productive.

So, trick items? No, there’s no need to avoid them, or even to use the term.

Communicating a Bad Idea

May 16, 2025

Andrew Ho seems to be obsessed with the challenges of communicating findings or results from the field of educational measurements to other experts in our field, to true experts who make professional use of the products of our work and even to a broader public. His predecessor at HGSE, John Willett, certainly drilled into my head that communicating quantitative results accurately is at least as important at arriving at them. Andrew only tempers that idea insofar as seriously considering the (almost certainly) inevitable tradeoffs between clarity to those various audiences and strict accuracy.

That’s a really good obsession to have. Sure, Andrew’s challenge to students is far greater than John’s, in part because it is about trade-offs and values. And because we have to imagine how an audience unlike ourselves might make sense of something. And because the most salient difference between them and ourselves is what we are most obsessed with. That is, we have devoted our professional lives to understanding something deeply, to advancing it, to making expert use of it at the highest levels, and they are uninterested in any of the details that so interest and engage us.

Another of my favorites, Charlie DePascale, has again responded to some of Andrew’s offerings, focusing for now on one particular graph from Andrew about those tradeoffs between accuracy and clarity. Andrew wisely builds on the idea that we cannot get to clarity without an engaged audience, and therefore an engaging manner of comminucation.

Simple line graph going down to the right (negative slope) with "Item Maps" and "Scale Scores" near the top to the left and "Grade Levels" and "Weeks of Learning" near the bottom to the right. — Andrew Ho’ Accuracy-Engagement Tradeoff

I agree with Andrew’s principles, and I agree with Charlie’s disagreements and particulars. But I think they are both barking of up the wrong tree, which Charlie almost acknowledges.

Also not mentioned are scores such as subscores and mastery scores, which have the potential to be both highly accurate and engaging, but unfortunately not when generated from large-scale, standardized tests analyzed with unidimensional IRT models.

The challenges of communicating with those various audiences about test taker performance and test proficiencies are real. They are multitudinous and layered. Some of them are nuanced. Some of them are quite technical. But there really is one root problem with communicating the meaning of standardized test score: they are false.

As Charlie came so close to suggesting, the problem is the use of “unidimensional IRT models.” Unidimensionality is the original sin in all of this. The task that Andrew is trying to apply his obsession with communications to is communicating the meaning of unidimensional scores to report on multi-dimensional constructs. Reading and writing collapsed into one score. Reading informational texts and literary texts into one score. Diligent capturing of explicit details in a text and considering the implications or larger themes in one score. Or, sloppiness with computation with the ability to see a solution path to a complex math problem in one score. Or, skills with the abstractions of algebra and the concreteness of geometry in one score. Skills with the algorithms of calculating area or volume and the logical reasoning of the geometry proof in one score.

The tests do not and cannot capture proficiencies with the full breadth of the content in the limited time available for standardized testing, so to report a singular score on “math” or “geometry” is necessarily to communicate something untrue. But even if there were more time available, the fact is that some students or test takers will do better on some things than on others. And some things in the domain model are more important than others. And certainly, in practice we violate the many assumptions of sampling that are necessary to make any inferences at all from test results, but are even more important to the fiction of unidimensional reporting based on such limited tests.

Content development professionals need to figure out better ways to assess the content, yes. And that is where my work focuses. But psychometricians and high level policy-makers must find far better ways to report on performance. Unidimensionality itself is strong evidence against validity, as it is plain and clear evidence that the internal structure of the data (i.e., the third type of validity evidence in The Standards) does not match that of the content area, domain model, or even the test blueprint. Sub-scores can be engaging and meaningful, but cannot be accurate, as Charlie wrote, “when generated from large-scale, standardized tests analyzed with unidimensional IRT models.” And the fact that the demands of such models act as a filter on what items might even be included on a test means that they are actively used to undermine content representation on tests (i.e., the first type of validity evidence in The Standards), thus are a direct cause for worsening evidence based on test content .

Or, to return to Andrew’s 3 W’s, “Who is using Which Scores and for What purpose?” Whether we are evaluating individual students, teachers, curricula, professional development programs, schools or district leadership, district or state policy, the purposes to which we want to put the tests are not met with unidimensional reporting. We always want to know what we are evaluating is good at and what it is bad at, so that we may address those weaknesses. Assuming, claiming, asserting and insisting that multi-dimensional constructs can be accurately or engagingly reported on unidimensionally is just a bad idea. The only people who favor such a thing do not actually have to interpret or make use of them for any purposes, but would like to simplify the world so they do not have to actually understand the complex decisions and tradeoffs of those who do.

Or, to steal and redo Andrew’s graph…

A graph with axes labeled "accuracy" and "engagement," and two lines with negative slopes. One, labeled "Reporting on Unidimensional Results" is lower and to the left. The other, labeled "Reporting on Multidimensional Results" is higher/to the right. — Accuracy-Engagement Tradeoff for Unidimensional & Multidimensional Results

I agree with Andrew that there is often a trade-off between accuracy and engagement—and therefore clarity—though I am not convinced that it is always zero-sum. More importantly, whatever the sum is, it is lower when reporting the false simplifications and fictions of unidimensional results than more useful and meaningful multidimensional results.

I know that IRT is cool. I know that it has mathematical elegance and real conceptual strengths, as Andrew’s other predecessor at HGSE taught me. But the use of unidimensional psychometric models should be limited to measuring and reporting on contracts that the subject matter experts believe are unidimensional.

The Misleading Authority of Precision

May 8, 2025

"There is no point in being precise when you don't know what you're talking about.” —John Tukey

Numbers can be intimidating. Precise numbers can be overwhelming. A bunch of significant digits, especially when there are a few of them after the decimal point? Man, that is lot to think about!

I do not know if the great statistician actually said the quote above, and there’s not a lot of evidence for it on the Internet. But @DataSciFact passed it along, so I accept it. Yeah, the great John Tukey said that there are far more important things than precision.

To me, that means that that validity is far more important than reliability. Optimizing measures of reliability is pointless if you are not measuring the right thing. If you do not know what you are measuring, then the quantitative tools are meaningless.

Psychometrics is about the quantified parts of measurement. Numbers after the decimal point, and numeric thresholds. It is a set of tools—and disciplinary values—but it is not the point. No amount of reliability can make of for a test that is measuring the wrong thing—and especially a test that no one really knows what it is measuring.

If the experts look at your items or your test and tell you it does not measure the construct as they understand it—or as it is formally defined by your client—then what are you doing? What is the point of any of the reliability or psychometric work?

If John Tukey can realize that precision is not enough, we all should. If we do not know what a test measures and what the scores mean, none of the precision in reporting or technical document has a point.

Better Conference Presentations

May 5, 2025

There is an easy way to do better conference paper presentations that does not require learning new skills.

I am not advising you to talk faster or slower, be louder or quieter, change your voice, choose your words differently or to design better slides. I am not telling you to find more graphics or use color in graphs. Nope, none of that. That might help, but all of that calls for new skills or additional work. Nope, maybe you’d benefit from that, but I am not talking about that.

All you need to do is understand that your presentation is not condensed summary of your paper. That is, it is not a full report on all your work. Its components should not be proportional to the components of your paper. Its components should not be proportional to the work you did. Nope. Your presentation is an ad or preview for your paper.

Focus on the best parts. Focus on the most interesting parts. Focus on the parts that the audience is most likely to be intrigued by.

Focus on *your* contributions

Your intro, literature review and methodology are important in your paper, but you do not have time for them in your conference presentation..

This means that you might not have to do any of your paper's introduction. Is your research about math anxiety? Well, I’ll bet your audience in your conference session already and care about math anxiety. (If you paper is on something that that audience might not already know about, like SFOR (i.e., Spontaneous Focusing On quantitative Relations), then yeah, you need to explain that.)

You know what else the audience isn’t likely to care about? Your literature review. Sure, it was a bunch of work, but at an academic or research conference, you should assume that you are talking to experts and you don’t need to start by proving your own bone fides. Maybe one quick slide to clarify your construct. Maybe some citations on the slide, but you never take the time to acknowledge them aloud.

Methodology? The audience can probably anticipate it. Unless your project is truly about some novel methodology, blow right by that. “We describe our methodology in the paper, which I hope I am convincing you to read.” Maybe one slide and less than 30 seconds. Put on those key terms that folks who know will recognize and nod at. That’s it!

Do you know what will make your presentation more interesting? Talk about your results/findings. With that that time your saved, dive in deeper. Actually explain more about that table. You know what else? Tell us about the implications. Why do your results matter? Tell us about how you are adding to the scholarship. Show off how smart your work is. Convince us that this is research we should know about. Do that with the best parts.

“Obviously, the literature review and methodology are in the full paper.” If you have just 12 or 20 minutes, spend it on the most interesting parts of the paper.

Now that you have permission to do that—perhaps even orders to do that—how hard will it be for you to figure out what to say? We don’t need you to you summarize the literature or explain methodology. It’s hard to make that stuff interesting, and if people do not already know it, you cannot do it justice in your short talk. But the actual results of your work? Your own excitement and pride will make you a more interesting presenter, just naturally.

Obviously, if you are giving a job talk, that’s a different sort of thing. You have more time, and you are trying to show off your command of the literature and of the methodology—perhaps your methodological sophistication or perhaps your absolute command of the classics. But that is not what a conference paper presentation is about.

Conference presentations are all too short. It is hard to get people to actually download and read our papers. So, highlight your contributions to the field. If people are interested, if you impress them, you’ve given them a reason to read your paper. And if you do a good enough job talking about those contributions, they might cite you in conversation later, too.

Unidimensionality and Fairness Dimensions

April 27, 2025

Unidimensionality is a simplifying assumption, giving non-expert something that they think they can interpret—regardless of the fact that this kind of simplfication likely will baffle real experts as being utterly uninterpretable. Its impact on fairness is quite similar.

If a test is unidimensional, then items that do not measure what are the other items measure are bad items and should be excluded. This is the basis for simply differential item functioning (DIF). If some items are working different than other items for some defined subgroup of test takers.

But if the construct a test is supposed to be measuring is not truly unidimensional, DIF is not going to work. In that situation, it is resting on false assumptions. The very fact that DIF works at all to flag problematic items is simply a product of the fact that the demands of unidimensional models are put ahead of the demands of content and the construct definition.

Therefore, one problem with depending on unidimensional psychometric models is how it allows so many people to think that DIF is the most important tool to catch fairness issues in items (and therefore tests). It distorts the construct and thereby alters potential meanings of fairness. Of course, DIF analysis is limited otherwise to examining only dimensions of diversity that are tracked for all test takers.

In fact, test takers’ success with individual items and tests is the product of many many dimensions, qualities and traits. These interact in a variety of ways. For example, Kristen Huff just told me a story about her own childhood experience to substantiate an untracked dimension that Marjorie and I think about a lot. We think that urbanicity is a big deal, and it is something different than simply geographic region. Kristen said that she had no experience with city blocks, growing up. Something appeared on a test or something, and she could only make sense of it because she watched Sesame Street.

In fact, this authority of unidimensional psychometric models leads to attenuation of any signal that tests could measure, focusing them on some muddled middle of compensatory KSAs—many from outside the domain model—that might not be eventually distributed across all subgroups in a testing a population. Thus, lower scoring members of one subgroup might have some of those compensatory KSAs in larger degrees than others. And frankly, the unexamined assumptions made by content developers about additional KSAs likely are a product of their own background and experiences. The unwittingly give test takers with backgrounds and experiences more similar to their own an advantage.

While this is not directly a product of the insistence on unidimensionality, it might follow inevitably in a test development workflow that is so dependent upon that assumption. Appropriate examination of the many dimensions within the content and across the test taking population is a sort of habit of mind—a professional habit. But not one encouraged by psychometrics’ appreciation of the robustness and mathematical elegance of item response theory.

Insisting test developer think more carefully about dimension, putting in the time and effort to recognize the complexity of test takers cognitive paths in response to items, is an important part of Rigorous Test Development Practice. We apply such tools as radical empathy to infuse considerations of fairness concerns through the content development process, because the psychometric desire for simplifying unidimensionality is only going to shift people away from respect for real variety of dimensions of diversity among the test taking population. During content development, we consider so many different dimensions of diversity, as might be germane to the content, the the items and the test population, rather than trying to narrow it down to a generic list of tracked test taker traits.

…for the proposed uses of tests

April 9, 2025

The 2014 Standards for Educational and Psychological Testing define validity as, “The degree to which evidence and theory support the interpretations of test scores for proposed uses of tests,” unchanged from 1999. The wording has changed since 1985 and 1966, but the idea that validity refers to the inferences made from tests in their various uses goes back longer than my lifetime. This century, that wording as included, “for the proposed uses of tests.”

This has long prompted the question, “Who gets to propose test uses.” But as I read the standards, it is pretty clear that anyone gets to propose a use, and the validity question is whether there is sufficient evidence and theory to support that as a valid use.

However, there are many measurement professionals—including psychometricians—who read The Standards. They ignore the word “proposed” and replace it with sometime like “officially sanctioned.” To the degree that they consider validity at all, they believe that they can lay out in the fine print of technical documents which uses are valid and which are not.

But that approach is like sticking your head in the sand. That approach ignores reality.

We all know the uses that motivate test sponsors to invest in developing assessments. Those are perhaps the most important uses of tests. Those are the uses to which the test most certainly will be put. Furthermore, there are often other uses that we know are inevitable. Those uses are important, too. And they all are proposed uses of tests. Heck, they are proposed and accepted.

Some act as though the only test uses that matter are the ones that they bless, as though that is somehow relevant to whether tests will be misused. And then when tests are misused, they wash their hands of it. Amidst all the the finger pointing, they point their fingers at test sponsors or other test users and blame them for the unsanctioned uses—as though they did everything they were supposed to do and therefore have the moral authority to declare the uses to which tests might be allowed to be used.

But this usually constitutes a failure to live up to terms of contracts. It is poor customer service. It is incredibly unprofessional. It is almost unimaginatively arrogant. And, frankly, it is immoral.

Test developers should prioritizing the expected uses of tests. They should be laboring mightily to meet the needs of virtually inevitable uses of tests. They should not act like prima donnas artists, saying “This is what I create, and you can buy it or not.” Rather, they should be meeting the needs of tests users and their test uses. As the assessment experts, it is on them—on us—to develop tests than can be validily used for purposes that make them worth the time, money and other resources that sponsors and test users invest in them.

The finger pointing from test developers to test sponsors and test users should stop. Test users are right to point their fingers at the test developers who sell products that are not appropriate for their actual intended uses, as predictably and inevitably proposed by the actual test users.

What Does Unidimensionality Feel Like?

March 27, 2025

[This is the year of addressing unidimensionality.Here is this month’s installment.]

Unidimensionality can feel good. It is a simplifying assumption that can make a complex set of data or concepts far easier to digest and make sense of.

An inevitable part of becoming expert in anything is the realization that things are more complex than one had realized previously. Potters think about the many qualities of the clay they work with that can contribute to the overall quality of the clay, and they understand that that question of overall quality is really more context- and goal-specific. That is, it is not really ever about quality, but rather about qualities. The same is true for professional chefs and their knives, because a different knives offer a different balance of qualities. This is true for inputs and true for outputs. It is certainly true for the subjects of educational assessment. The more expert you are, the more dimensions you see and factor into account.

But not everyone has the expertise to recognize all those dimensions. Perhaps more importantly, not everyone has the expertise to process and consider what all of those dimensions mean in the context of each other. It is simply information overload—again and again and again.

Most of us have some area in which we are expert or real connoisseurs. There is something that we care enough about to have devoted the ability to comfortably take in and make sense of a large amount of information. We understand what it means and have the schemas to process it together for our various purposes. But this contextual expertise does not make it so easy or comfortable to take in complex information of other sorts.

And so, we resort to simplifying assumptions when working outside of our own areas of expertise. In part, this saves us time. In part, it saves of aggravation and frustration. But mostly, it enables us to make some sense of the complexity, as opposed to simply being overwhelmed or paralyzed.

So, what some people see is a ridiculous oversimplification, others see as a necessary simplification. For some, it turns the apparent chaos into something intellectually manageable, and that feels good. Flattening out details, simplifying, reducing complexing are all coping strategies for the overwhelmed, and therefore they feel good—even necessary.

Well, that’s one perspective.

To experts, to people who have the schemas and experience to have a grip on the complexity of the many factors and various dimension of the situation, unidimensionality is frustrating in a very different way. It is not merely a simplification, but rather the greatest oversimplification possible—reducing everything to just one dimension. It looks like willful ignorance. It can feel like an attack on one’s values and expertise. It’s the frustration of knowing that an approach is usually going to produce wrong answers, and just get lucky every now and again.

To some, it offers there relief of being able to produce any answers at all, and to others it offers the frustration of knowing the answers it offers will usually miss the point.

To an educator or parent, it is important to know which things a student is good or bad at, and perhaps how good or bad. Companies do not hire people based on GPAs (i.e., grade point averages) or WAR (i.e., Wins Above Replacement), as they care which knowledge, skills and abilities job candidates have. Doctors do not make treatment decisions based on one simplified overall health score. No one whom we trust to make important decisions for us or our loved ones does so based on one unidimensional overall scale—and when we asked them for advice or to explain, we do not want to hear “Well, because the overall score of everything is [x], you should do [blah blah blah].” Rather, we want to understand more than that, and we want to decision to be based on greater understanding than that.

So, what does unidimensionality feel like? Well, at first and to non-experts, it feels good. It feels like the solution to frustration. But to experts or to those invested in the quality of a decisions or outcome, it feels even more deeply frustrating.

What do AI-generated item measure?

March 11, 2025

The absolute most important question about any test result is what it actually means. The first sentence of the first chapter of The Standards for Educational and Psychological Testing point to "the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests,” and calls this validity.

To understand what a test result means, we have to understand what it measures in the aggregate—which means we have to understand what it means on the item-level. There is no magic that can make a whole measure something that the individual items do not. There’s no way to figure out what that overlap of a bunch of disparate items means, as the non-overlap creates huge errors and if you do not know what individual items measure you cannot figure out what the overlap measures.

This is the question of item alignment. What do the items—the building block of any assessment—actually measure. Do they actually measure what they are supposed to measure? How do we figure that out? What are the common pitfalls and mistakes that can undermine such investigations?

The last couple of years have seen a huge increase in interest in AI-generated items, sometimes what a human-in-the-loop and sometimes not. We’ve read papers and seen presentations, but the evaluation of what these item actually measure has been…disappointing. We’ve seen the same mistakes that novice content development professionals learn not to make repeated as though they are standard practice. For example, many AI researchers in educational measurement only evaluate the stem of a multiple choice question without considering the answer options or the cognitive paths that might lead to an incorrect answer. Again and again, researchers who do not understand how potential test takers learn particular material or the mistakes they actually make offer their less-than-expert opinions on the KSAs the an item requires.

When challenged on this, they told me that they couldn’t find anything in the literature on item alignment. So, I spent a very frustrating few weeks going through the educational measurement literature and texts to see what it had to offer on this question. And they were right. Quite a bit on blueprint- test- or form-alignment. Some dimensions of what might considered (e.g., Webb) when rolling up item alignment decisions into test alignment determinations, but nothing on how to make those item-level judgments. There simply is not a literature on item alignment.

But AI generated items are useless if they do not actually measure what they are supposed to measure. Bad building blocks cannot fulfill the requirements of test blueprints and can produce indecipherable test results. Well, they could produce fraudulent test results that simply do not report on what they claim to report, and suggest inferences for which there simply is not sufficient evidence or theory to support.

So, here is a review of item alignment. Here are the basic considerations of how to determine whether an item is aligned to its alignment reference—be it a standard, an assessment target or something else. If we going to be evaluating the potential of AI-generate items, we really need to be rigorous in our evaluation of the products they provide—the items!

Item Alignment: Understanding the Quality of the Evidence that Items Elicit

Alignment—the mapping between test items and their intended constructs—is central to test validation but remains understudied at the micro level of individual items. This paper examines how judgments about item alignment are made in practice, analyzing five common misconceptions: ignoring item modality, ignoring alternative cognitive paths, ignoring additional KSAs, lack of deep expertise with the domain model, and failing to consider the diverse range of test takers. We frame these issues using Type I (false positive) and Type II (false negative) errors in inferences about test-taker proficiency at the micro-level of individual alignment references (e.g., standards). We further explore the nature and impact of different sources of additional KSAs. The paper further examines challenges in alignment within a standard, including difficulty, learning pathways, components of complex standards, and text complexity. Despite the importance of targeting the core rather than margins of standards, numerous factors incentivize alignment with the less important margins of a standard, including ease of item development, psychometric pressures, and naïve misreadings of standards by non-experts. We argue that improved alignment requires recognizing the distinct requirements of large-scale standardized assessment and bridging disciplinary training gaps between psychometric perspectives and content development expertise to improve the quality of evidence elicited by test items.

What is Unidimensionality?

February 20, 2025

What does it mean to be good at math? There are students who were good at math before they hit algebra, and then struggled. There are students who were good at math, but just weren’t great at the proofs of geometry class. There are kids who were good at math until they hit calculus. There are kids who are good at math, but just can’t do word problems. There are kids who are good at math, but keep making slopping mistakes. There are kids who are good at math so long as they have already learned how to solve that kind of problem, but particularly struggle when faced with novel problems.

So, what does it mean to be good at math?

Can a student be really good at math if they struggle with algebra, proofs, calculus, word problems, novel problem and make sloppy arithmetic mistakes? Clearly not. These things are all aspects of mathematics. The best math students excel at all of them and the worst excel at none. But most students are better at some and worse at others.

When we have large constructs (e.g., math), but students different in which parts they are better at and which parts they are worse, it is multi-dimensional. Math is not one thing; it is not unidimensional.

English Language Arts is not just one thing, either. Reading is not just one thing, and neither is writing. One can be a good speller, but have poor command of conventions of formal grammar. One can write good sentences, but struggle with developing a single cohesive paragraph. One can struggle to but together a cohesive piece that organizes and ideas and supports for it. And quite differently, one can write imaginatively—a certain kind of creativity. One might be good at writing evocative descriptions, or real-seeming characters. One might imagine interesting plots, or write realistic dialogue. Reading also has many components that different readers are better or worse at.

Not only can people differ in which dimensions of a larger construct they are good at, the kinds of lessons and practice that might help to improve offer differ from dimension to dimension even within a construct. Learning to be a better speller is a very different process than learning to write real-seeming characters. Learning to be more careful with arithmetic is different than learning to solve word problems.

The thing is, it’s not just that mathematics is multi-dimensional. Even arithmetic is itself multi-dimensional. Even multiplication is multi-dimensional. Even single-digit multiplication is multi-dimensional. When someone learns their multiplication tables, they can be better at some parts of it than others. 2’s, 5’s and 10’s are easy. The others…well, there are tricks and there is memorization. If we all focused on 8’s first, we might know them better than 6’s, but we tend to focus on 6’s before 8’s. Eventually, however, when we are past those learning stages, we process all of that complexity more automatically and the dimensionality of multiplications tables reduces. It might even become unidimensional, differing by our level of command with the individual differences that we had when we were first learning them. Some people know them all and the ones who don’t tend to make the same mistakes. That is, once it is safe to assume that we obtained the level of proficiency with single-digit arithmetic that we are going to obtain, it is unidimensional—but that is past the point when it is a skill worth measuring.

So, some people remain better at algebra, while others might remain better at the reasoning skills of proofs, and others better at the diligent care of avoiding slopping mistakes. Similarly, some writers are better at dialogue, others at character and others at plot. Moreover, science, social studies, foreign language, psychology, each sport and most everything is actually multi-dimensional.

Even sprinting—running a footrace—is multi-dimensional. Track and & field coaches talk about the biomechanics of i) the start, ii), acceleration, iii) drive and iv) deceleration—though some think there are more and some think there are fewer dimensions. Thinking through this example puts a lie to the idea that unidimensionality can be meaningfully built of a constant combination of separate components. Different sprint distances (e.g., 10m, 40m, 100m, 200m) each constitute a different ratio of these different components, and the is not an absolute or definitive reference for what ratio represents sprinting. It is always an arbitrary decision which one to favor.

So, from an educational measurement perspective what is unidimensionality? If we care at all about the substance and what we are measuring, then unidimensionality is an arbitrary fiction created to serve some convenience—and perhaps never even able to serve that convenience well.