Dimensionality Can Decrease Over Time

While it is glaringly obvious that the dominant psychometric models are incredibly poor matches for the multi-dimensional constructs specified in our domain models, it is less obvious that domain models sometimes understate the dimensionality of their contents. Sometimes. 

The fact is that dimensionality is not constant, even for a single group of students. Instead, it can even decrease over time.

Yes, some domain models are so detailed that they describe learning sequences. In these cases, later learning standards may simply represent more advanced versions which constitute more difficult applications or skills. That is, a group of standards may truly lie on the same dimension. Some may be more advanced cognition that is further along the dimension—but nonetheless of the same sort. One may not need to step far from the details of a domain model to see this.

But in other cases, one does need to think clearly about the details of learning to appreciate the true dimensionality of content. Those who work closely with domain models understand their dimensionality far better than those who do not. Similarly, those who work closely with students—who will eventually become test takers—understand that dimensionality is not necessarily invariant over time.

For example, with the distance of middle age, I can see that among my peers that math calculation skills can be viewed as a unidimensional collective. Some of us are better than others, but it really is just one continuum. Those of us who are better at division are also better at addition. Those of us who are better at two digit multiplication are also better at five digit subtraction. There are different sub-skills, but they line up together in parallel. 

On the other hand, those who work up close with third graders learning multiplication see multi-dimensionality. It is not simply that some kids are better at it than others. Rather, some kids are better at some of it, and other kids are better at other parts of it. One kid knows their 7’s but has trouble with 9’s, while another kid does well with 9’s but poorly with 7’s. They all know 2’s and 5’s, but some are better at 6’s and others are better at 8’s. They do not all line up sequentially, nor do they line up in parallel. 

And that does not even address the fact that some kids are better at the straight memorization of the multiplication math facts, other kids are better at the old algorithm for multi-digit multiplication and still others better at the regrouping strategies that appear to confuse so many adults. Yes, some students are great at all of them and some poor at all of them—but that does not cover the entire classroom of students.

While adults may be far enough removed from learning such that their collective of skills has settled into the unidimensional layout—as with calculation skills—this is not the case for those still learning. As they develop their proficiencies, they do not achieve mastery in the same order across all skills for all students. That is, that which seems unidimensional for adults at a distance from the learning period is often composed of more dimensions for those still building their proficiencies. 

This may not matter for those with distance from the students being tested and the processes of teaching and learning. But those who care about the students, are invested in their success and/or have some responsibility for their learning—I mean students, families, teachers, school leaders, curriculum specialists, school boards—the details of these differences really matter. "What is my child good at and where are they struggling?” is a key question that educational assessment should be able to answer. “How are our curricular and pedagogical choices working and not working for our students?” is another key question.

It is a profound misjudgment to treat the dimensionality of adult understanding as determinant in educational assessment—one that risks missing the very purpose of educational measurement. Certainly, we should not design our assessments based on the dimensionality of the constructs of those who have always been exceptional for their proficiency in a content area. Rather, we should build our blueprints, develop our items, conduct our analyses and report our results in ways that best describe the understandings of students still engaged in learning. After all, that is whom we claim our assessments are for.  

Communicating a Bad Idea

Andrew Ho seems to be obsessed with the challenges of communicating findings or results from the field of educational measurements to other experts in our field, to true experts who make professional use of the products of our work and even to a broader public. His predecessor at HGSE, John Willett, certainly drilled into my head that communicating quantitative results accurately is at least as important at arriving at them. Andrew only tempers that idea insofar as seriously considering the (almost certainly) inevitable tradeoffs between clarity to those various audiences and strict accuracy.

That’s a really good obsession to have. Sure, Andrew’s challenge to students is far greater than John’s, in part because it is about trade-offs and values. And because we have to imagine how an audience unlike ourselves might make sense of something. And because the most salient difference between them and ourselves is what we are most obsessed with. That is, we have devoted our professional lives to understanding something deeply, to advancing it, to making expert use of it at the highest levels, and they are uninterested in any of the details that so interest and engage us.

Another of my favorites, Charlie DePascale, has again responded to some of Andrew’s offerings, focusing for now on one particular graph from Andrew about those tradeoffs between accuracy and clarity. Andrew wisely builds on the idea that we cannot get to clarity without an engaged audience, and therefore an engaging manner of comminucation.

Simple line graph going down to the right (negative slope) with "Item Maps" and "Scale Scores" near the top to the left and "Grade Levels" and "Weeks of Learning" near the bottom to the right.

Andrew Ho’ Accuracy-Engagement Tradeoff

I agree with Andrew’s principles, and I agree with Charlie’s disagreements and particulars. But I think they are both barking of up the wrong tree, which Charlie almost acknowledges.

Also not mentioned are scores such as subscores and mastery scores, which have the potential to be both highly accurate and engaging, but unfortunately not when generated from large-scale, standardized tests analyzed with unidimensional IRT models.

The challenges of communicating with those various audiences about test taker performance and test proficiencies are real. They are multitudinous and layered. Some of them are nuanced. Some of them are quite technical. But there really is one root problem with communicating the meaning of standardized test score: they are false.

As Charlie came so close to suggesting, the problem is the use of “unidimensional IRT models.” Unidimensionality is the original sin in all of this. The task that Andrew is trying to apply his obsession with communications to is communicating the meaning of unidimensional scores to report on multi-dimensional constructs. Reading and writing collapsed into one score. Reading informational texts and literary texts into one score. Diligent capturing of explicit details in a text and considering the implications or larger themes in one score. Or, sloppiness with computation with the ability to see a solution path to a complex math problem in one score. Or, skills with the abstractions of algebra and the concreteness of geometry in one score. Skills with the algorithms of calculating area or volume and the logical reasoning of the geometry proof in one score.

The tests do not and cannot capture proficiencies with the full breadth of the content in the limited time available for standardized testing, so to report a singular score on “math” or “geometry” is necessarily to communicate something untrue. But even if there were more time available, the fact is that some students or test takers will do better on some things than on others. And some things in the domain model are more important than others. And certainly, in practice we violate the many assumptions of sampling that are necessary to make any inferences at all from test results, but are even more important to the fiction of unidimensional reporting based on such limited tests.

Content development professionals need to figure out better ways to assess the content, yes. And that is where my work focuses. But psychometricians and high level policy-makers must find far better ways to report on performance. Unidimensionality itself is strong evidence against validity, as it is plain and clear evidence that the internal structure of the data (i.e., the third type of validity evidence in The Standards) does not match that of the content area, domain model, or even the test blueprint. Sub-scores can be engaging and meaningful, but cannot be accurate, as Charlie wrote, “when generated from large-scale, standardized tests analyzed with unidimensional IRT models.” And the fact that the demands of such models act as a filter on what items might even be included on a test means that they are actively used to undermine content representation on tests (i.e., the first type of validity evidence in The Standards), thus are a direct cause for worsening evidence based on test content .

Or, to return to Andrew’s 3 W’s, Who is using Which Scores and for What purpose?” Whether we are evaluating individual students, teachers, curricula, professional development programs, schools or district leadership, district or state policy, the purposes to which we want to put the tests are not met with unidimensional reporting. We always want to know what we are evaluating is good at and what it is bad at, so that we may address those weaknesses. Assuming, claiming, asserting and insisting that multi-dimensional constructs can be accurately or engagingly reported on unidimensionally is just a bad idea. The only people who favor such a thing do not actually have to interpret or make use of them for any purposes, but would like to simplify the world so they do not have to actually understand the complex decisions and tradeoffs of those who do.

Or, to steal and redo Andrew’s graph…

A graph with axes labeled "accuracy" and "engagement," and two lines with negative slopes. One, labeled "Reporting on Unidimensional Results" is lower and to the left. The other, labeled "Reporting on Multidimensional Results" is higher/to the right.

Accuracy-Engagement Tradeoff for Unidimensional & Multidimensional Results

I agree with Andrew that there is often a trade-off between accuracy and engagement—and therefore clarity—though I am not convinced that it is always zero-sum. More importantly, whatever the sum is, it is lower when reporting the false simplifications and fictions of unidimensional results than more useful and meaningful multidimensional results.

I know that IRT is cool. I know that it has mathematical elegance and real conceptual strengths, as Andrew’s other predecessor at HGSE taught me. But the use of unidimensional psychometric models should be limited to measuring and reporting on contracts that the subject matter experts believe are unidimensional.

Unidimensionality and Fairness Dimensions

Unidimensionality is a simplifying assumption, giving non-expert something that they think they can interpret—regardless of the fact that this kind of simplfication likely will baffle real experts as being utterly uninterpretable. Its impact on fairness is quite similar.

If a test is unidimensional, then items that do not measure what are the other items measure are bad items and should be excluded. This is the basis for simply differential item functioning (DIF). If some items are working different than other items for some defined subgroup of test takers. 

But if the construct a test is supposed to be measuring is not truly unidimensional, DIF is not going to work. In that situation, it is resting on false assumptions. The very fact that DIF works at all to flag problematic items is simply a product of the fact that the demands of unidimensional models are put ahead of the demands of content and the construct definition. 

Therefore, one problem with depending on unidimensional psychometric models is how it allows so many people to think that DIF is the most important tool to catch fairness issues in items (and therefore tests). It distorts the construct and thereby alters potential meanings of fairness. Of course, DIF analysis is limited otherwise to examining only dimensions of diversity that are tracked for all test takers. 

In fact, test takers’ success with individual items and tests is the product of many many dimensions, qualities and traits. These interact in a variety of ways. For example, Kristen Huff just told me a story about her own childhood experience to substantiate an untracked dimension that Marjorie and I think about a lot. We think that urbanicity is a big deal, and it is something different than simply geographic region. Kristen said that she had no experience with city blocks, growing up. Something appeared on a test or something, and she could only make sense of it because she watched Sesame Street.



In fact, this authority of unidimensional psychometric models leads to attenuation of any signal that tests could measure, focusing them on some muddled middle of compensatory KSAs—many from outside the domain model—that might not be eventually distributed across all subgroups in a testing a population. Thus, lower scoring members of one subgroup might have some of those compensatory KSAs in larger degrees than others. And frankly, the unexamined assumptions made by content developers about additional KSAs likely are a product of their own background and experiences. The unwittingly give test takers with backgrounds and experiences more similar to their own an advantage. 

While this is not directly a product of the insistence on unidimensionality, it might follow inevitably in a test development workflow that is so dependent upon that assumption. Appropriate examination of the many dimensions within the content and across the test taking population is a sort of habit of mind—a professional habit. But not one encouraged by psychometrics’ appreciation of the robustness and mathematical elegance of item response theory. 

Insisting test developer think more carefully about dimension, putting in the time and effort to recognize the complexity of test takers cognitive paths in response to items, is an important part of Rigorous Test Development Practice. We apply such tools as radical empathy to infuse considerations of fairness concerns through the content development process, because the psychometric desire for simplifying unidimensionality is only going to shift people away from respect for real variety of dimensions of diversity among the test taking population. During content development, we consider so many different dimensions of diversity, as might be germane to the content, the the items and the test population, rather than trying to narrow it down to a generic list of tracked test taker traits. 

What Does Unidimensionality Feel Like?

[This is the year of addressing unidimensionality.Here is this month’s installment.]

Unidimensionality can feel good. It is a simplifying assumption that can make a complex set of data or concepts far easier to digest and make sense of. 

An inevitable part of becoming expert in anything is the realization that things are more complex than one had realized previously. Potters think about the many qualities of the clay they work with that can contribute to the overall quality of the clay, and they understand that that question of overall quality is really more context- and goal-specific. That is, it is not really ever about quality, but rather about qualities. The same is true for professional chefs and their knives, because a different knives offer a different balance of qualities. This is true for inputs and true for outputs. It is certainly true for the subjects of educational assessment. The more expert you are, the more dimensions you see and factor into account. 

But not everyone has the expertise to recognize all those dimensions. Perhaps more importantly, not everyone has the expertise to process and consider what all of those dimensions mean in the context of each other. It is simply information overload—again and again and again.

Most of us have some area in which we are expert or real connoisseurs. There is something that we care enough about to have devoted the ability to comfortably take in and make sense of a large amount of information. We understand what it means and have the schemas to process it together for our various purposes. But this contextual expertise does not make it so easy or comfortable to take in complex information of other sorts.

And so, we resort to simplifying assumptions when working outside of our own areas of expertise. In part, this saves us time. In part, it saves of aggravation and frustration. But mostly, it enables us to make some sense of the complexity, as opposed to simply being overwhelmed or paralyzed. 

So, what some people see is a ridiculous oversimplification, others see as a necessary simplification. For some, it turns the apparent chaos into something intellectually manageable, and that feels good. Flattening out details, simplifying, reducing complexing are all coping strategies for the overwhelmed, and therefore they feel good—even necessary. 

Well, that’s one perspective. 

To experts, to people who have the schemas and experience to have a grip on the complexity of the many factors and various dimension of the situation, unidimensionality is frustrating in a very different way. It is not merely a simplification, but rather the greatest oversimplification possible—reducing everything to just one dimension. It looks like willful ignorance. It can feel like an attack on one’s values and expertise. It’s the frustration of knowing that an approach is usually going to produce wrong answers, and just get lucky every now and again.

To some, it offers there relief of being able to produce any answers at all, and to others it offers the frustration of knowing the answers it offers will usually miss the point. 

To an educator or parent, it is important to know which things a student is good or bad at, and perhaps how good or bad. Companies do not hire people based on GPAs (i.e., grade point averages) or WAR (i.e., Wins Above Replacement), as they care which knowledge, skills and abilities job candidates have. Doctors do not make treatment decisions based on one simplified overall health score. No one whom we trust to make important decisions for us or our loved ones does so based on one unidimensional overall scale—and when we asked them for advice or to explain, we do not want to hear “Well, because the overall score of everything is [x], you should do [blah blah blah].” Rather, we want to understand more than that, and we want to decision to be based on greater understanding than that.

So, what does unidimensionality feel like? Well, at first and to non-experts, it feels good. It feels like the solution to frustration. But to experts or to those invested in the quality of a decisions or outcome, it feels even more deeply frustrating.

What is Unidimensionality?

What does it mean to be good at math? There are students who were good at math before they hit algebra, and then struggled. There are students who were good at math, but just weren’t great at the proofs of geometry class. There are kids who were good at math until they hit calculus. There are kids who are good at math, but just can’t do word problems. There are kids who are good at math, but keep making slopping mistakes. There are kids who are good at math so long as they have already learned how to solve that kind of problem, but particularly struggle when faced with novel problems. 

So, what does it mean to be good at math?

Can a student be really good at math if they struggle with algebra, proofs, calculus, word problems, novel problem and make sloppy arithmetic mistakes? Clearly not. These things are all aspects of mathematics. The best math students excel at all of them and the worst excel at none. But most students are better at some and worse at others. 

When we have large constructs (e.g., math), but students different in which parts they are better at and which parts they are worse, it is multi-dimensional. Math is not one thing; it is not unidimensional. 

English Language Arts is not just one thing, either. Reading is not just one thing, and neither is writing. One can be a good speller, but have poor command of conventions of formal grammar. One can write good sentences, but struggle with developing a single cohesive paragraph. One can struggle to but together a cohesive piece that organizes and ideas and supports for it. And quite differently, one can write imaginatively—a certain kind of creativity. One might be good at writing evocative descriptions, or real-seeming characters. One might imagine interesting plots, or write realistic dialogue. Reading also has many components that different readers are better or worse at.

Not only can people differ in which dimensions of a larger construct they are good at, the kinds of lessons and practice that might help to improve offer differ from dimension to dimension even within a construct. Learning to be a better speller is a very different process than learning to write real-seeming characters. Learning to be more careful with arithmetic is different than learning to solve word problems. 

The thing is, it’s not just that mathematics is multi-dimensional. Even arithmetic is itself multi-dimensional. Even multiplication is multi-dimensional. Even single-digit multiplication is multi-dimensional. When someone learns their multiplication tables, they can be better at some parts of it than others. 2’s, 5’s and 10’s are easy. The others…well, there are tricks and there is memorization. If we all focused on 8’s first, we might know them better than 6’s, but we tend to focus on 6’s before 8’s. Eventually, however, when we are past those learning stages, we process all of that complexity more automatically and the dimensionality of multiplications tables reduces. It might even become unidimensional, differing by our level of command with the individual differences that we had when we were first learning them. Some people know them all and the ones who don’t tend to make the same mistakes. That is, once it is safe to assume that we obtained the level of proficiency with single-digit arithmetic that we are going to obtain, it is unidimensional—but that is past the point when it is a skill worth measuring.

So, some people remain better at algebra, while others might remain better at the reasoning skills of  proofs, and others better at the diligent care of avoiding slopping mistakes. Similarly, some writers are better at dialogue, others at character and others at plot. Moreover, science, social studies, foreign language, psychology, each sport and most everything is actually multi-dimensional. 

Even sprinting—running a footrace—is multi-dimensional. Track and & field coaches talk about the biomechanics of i) the start, ii), acceleration, iii) drive and iv) deceleration—though some think there are more and some think there are fewer dimensions. Thinking through this example puts a lie to the idea that unidimensionality can be meaningfully built of a constant combination of separate components. Different sprint distances (e.g., 10m, 40m, 100m, 200m) each constitute a different ratio of these different components, and the is not an absolute or definitive reference for what ratio represents sprinting. It is always an arbitrary decision which one to favor. 

So, from an educational measurement perspective what is unidimensionality? If we care at all about the substance and what we are measuring, then unidimensionality is an arbitrary fiction created to serve some convenience—and perhaps never even able to serve that convenience well. 

The Original Sin of Large Scale Educational Assessment

The Standards for Educational and Psychological Testing explain five "sources of validity evidence,” on pages 13-21.

  • Evidence Based on Test Content

  • Evidence Based on Response Processes

  • Evidence Based on Internal Structure

  • Evidence Based on Relations to Other Variables

  • Evidence for Validity and Consequences of Testing 

Only one of these is really about even moderately sophisticated psychometrics: Evidence Based on Internal Structure. The others are either content based or rely on other sorts of statistical techniques. But evidence based on internal structure gets at some real issues in psychometrics. It is easy to understand, as it has the shortest explanation of the five potential sources of validity evidence. For example, the first of its three paragraphs says:

Analyses of the internal structure of a test can indicate the degree to which the relationships among test items and test components conform to the construct on which the proposed test score interpretations are based. The conceptual framework for a test may imply a single dimension of behavior, or it may posit several components that are each expected to be homogeneous, but that are also distinct from each other. For example, a measure of discomfort on a health survey might assess both physical and emotional health. The extent to which item interrelationships bear out the presumptions of the framework would be relevant to validity (p. 16).

And yet, the practice of developing, administering and reporting large scale standardized educational assessment seems to have mostly abandoned this form of validity evidence—the only form that really gets at psychometric issues. 

Straightforward examination of domain models (e.g., state learning standards) immediately reveals that these tests are supposed to measure multi-dimensional constructs. Those who know the constructs and content areas best are quite clear that these constructs (i.e., content areas) are multidimensional, with different students doing better in some areas and worse in others. They require an array of different sorts of lessons and ought to be measured with an array of different sorts of questions. 

I was taught that this kind of psychometric analysis is really about factor analysis of some sort. Which items tend to lean into which factors—dimensions—and then qualitative content-based analysis to confirm that this is as it should be. Heck, the basic question of whether the hypothetical dimensionality of the construct is reflected in the empirical dimensionality of the instrument…well, I was taught that that is really important. And The Standards seems to say that, too. 

But instead of ensuring that the dimensionality of the instrument matches the dimensionality of the domain model, the dominant mode in large scale educational assessment has an almost knee-jerk reliance on unidimensional models. Heck, items that fail to conform to this demand are discarded, as model fit statistics are the ultimate determinant of whether they can be included on a test (form). Such statistics are used to ensure that the dimensionality of the instruments does not match that of the construct. 

This use of such statistics combine with the use of unidimensional models to ensure that tests are not valid, by design. It ensures that domain models will be reread, reinterpreted and selected from only insofar as they can support the psychometric model. The tail wags the dog. 

There are many issues with large scale assessment that cause educators, learners, parents and the public to view them as “the enemy,as Steve Sireci observed in his 2020 NCME Presidential Address. But if I had to pick the single most important one, this would be it. Multiple choice items are problematic, but it quite often is possible to write good multiple choice items that i) reflect the content of the domain model, ii) prompt appropriate response processes, iii) combine for an internal structure that resembles that of the domain model, iv) combine to have appropriate relationships to other variables, and v) support appropriate inferences and consequences. But none of those are possible while insisting that items and tests are not allowed to match the structure of the domain model. This is not simply about ignoring the domain model, as some sort of neglect. Rather, this is active hostility that affirmatively bars using it as the primary reference for test development.   

Looking for DIF or other invariances that suggest fairness issues is not enough, so long as the structure of the domain model itself is barred from properly influencing test construction, as The Standards say it should.

To state this more plainly, this practice sets psychometric considerations as the main obstacle to developing valid tests—or tests that can be put to any valid use or purpose.