One of my favorite people in educational measurement—aside from my co-authors, of course—once overheard me ranting about unidimensionality and said quietly, “But I like IRT.” Yeah, I get it; IRT has some elegant properties.

The thing is, she really cares about the interpretability of tests. She really cares about developing tests that tell us something about test taker proficiency. She is not just a psychometrician.

And yet…

She is a psychometrician, and she likes Item Response Theory for psychometric reasons. But the thing is, I do not think that IRT tells us anything about test taker proficiencies. It is useless* for teaching and learning. It is useless for curriculum evaluation. It does not tell students what they need to know, or parents.

(*Yes, there are other techniques that build upon IRT. For example, cognitive diagnostic modeling uses IRT under the hood, but it is used with very different assumptions, and not to report relative scores among test takers.)

Unidimensional IRT is a norm-referenced technology. It reports test taker scores relative to each other. That’s all it tells us about test takers. It also tells us about item difficulty, relative to other items and to test takers. But these reports smush all of the information in patterns of test takers’ responses and just spits out some singular set of scores—and therefore rankings of test takers relative to each other.

Scientists looking a multifaceted solid called "student. proficiencies" laying in front of an "Unidimensional Shrink Ray." One is saying, "It may not sparkle anymore, but it'll be an 'elegant' single dimension."

IRT tells us absolutely nothing about how to use alignment references or learning goals into test items. It tells us nothing about how to score test items. And it tells us nothing about what scores might constitute proficiency or any other meaningful bar. There are other techniques and tools for all those things. IRT tells us nothing about alignment or fairness. It contributes no information about validity.

And, of course, if unidimensional IRT is used to calculate scores for a test that is supposed to measure a multi-dimensional domain model, it is prima facie evidence against validity. After all, the third type of validity evidence in The Standards of Educational and Psychological Testing is “Evidence Based on Internal Structure.” That is, “Analyses of the internal structure of a test can indicate the degree to which the relationships among test items and test components conform to the construct on which the proposed test score interpretations are based” (p. 16).

If the proposed use of a test is to rank students, IRT is great. It produces norm-referenced results. And one could misuse such results to group students by proficiency level, achievement level or ability levels—ignoring the fact that the construct is thought by subject matter experts to be multi-dimensional and the domain model is explicitly multi-dimensional. Obviously, those are deeply problematic groupings, but we have been taking that approach for decades.

Andrew Ho says that measurement is qualitative, quantitative, and then qualitative again. IRT does not help with any of the qualitative work, or the transitions between the two paradigms. It is firmly in the middle of quantitative phase of the work. Those dominant 1-PL, 2-PL and 3-PL unidimensional IRT models require distortion of multi-dimensional constructs into unidimensional data. (I know, IRT is robust to some degree of multi-dimensionality. That is part of what makes it so great. But it removes all of that information in order to produce unidimensional results, and it is not robust enough to take in data from all the items that subject matter experts think are well-aligned to alignment references.) Therefore, it actually harms the quality of our tests, their alignment and their very validity.

IRT tells us nothing about test takers’ proficiencies. Heck, unidimensional IRT does not even accept the premise that there is more than one dimension of proficiency. It is based on norm-referenced assumptions and delivers norm-referenced scores—entirely unsuitable for criteria-referenced domain models and criteria-referenced purposes. That is, it misses everything about test taker proficiencies, obscuring them from test users.

We need to do better.

Complex Variety: Assessment Development, Education and Occasional Other Topics

Latest & Greatest

Dr. Hoffman