The goal of norm-based or -referenced tests is to report on test takers relative to each other. This is a basic sorting and ranking function. Perhaps the reporting is in terms of percentiles, perhaps deciles. But even if the reporting is at that larger grain size or larger buckets, it is important to get those finer grained relative standings right. After all, you want to make sure that someone near the line is classified on the correct side of line.
This means that it is really important to have a range of difficulty in your items. You need lots of information at every cut score mark—including to differentiate your top two buckets.
Of course, this is only possible if the construct being measured is unidimensional. You cannot come up with a singular ranking without a unidimensional scale of some sort. And if you have a multi-dimensional construct, you have to either flatten it into unidimensionality or give up on norm-referenced reporting.
So, norm-based tests must have a range of difficulty, but fidelity to the construct definition is far less important. Heck, items that are well-aligned to some element of the domain model but do not fit the flattened (i.e., distorted) construct are counter-productive.
Criterion-based reporting requires quite different test design. Test takers are evaluated against some criteria—such as a multidimensional domain model. Think of a set of state learning standards or all the diverse elements of a job or role analysis. There are lots of things worth considering. Criterion-based reporting might need to report sub-scores—or even abandon the whole idea of a single summary score. Performance is evaluated against some conception of proficiency or mastery with specific skills or ideas.
Criterion-based tests should define those conceptions of proficiency with each element of the criterion during test design—something that norm-based test design does not have to wrestle with. These are expert judgments, made by subject matter experts and/or educators. Empirical difficulty (i.e., how many test takers will get the items wrong vs. right) is not really germane. Either test takers each have that level of that skill, or they don’t. Certainly, those experts might establish multiple relevant levels of some cluster of related skills, but their empirical difficulty are not the point.
Therefore, criterion-based test design and criterion-referenced reporting focus far more on items’ alignment to their criteria. Test blueprint design is incredibly important, and fidelity to blueprint is perhaps even more important. Test blueprints should hardly matter at all for norm-based reporting.
Are our large scale assessments norm-based or criterion-based? They almost all claim to be criterion-based—but the ACT and SATs are designed to rank test takers, so they clearly are the big exceptions. State accountablity tests, AP exams and so many others are aligned to some set of standards or performance expectations—or said to be so aligned. They should be criterion-based.
However, in practice we too often ignore these issues and distinctions. Major users and funders of these assessments really want the rankings and sorting of test takers, compromising the criterion-based designs. Item difficulty and conformance with the distorted construct become the rule, rather than actual fidelity to blueprint with carefully aligned items. The sorting and ranking becomes more important than the criteria.
Can a test satisfy both the needs of norm-based and criterion-based tests? If it actually is aiming at a truly unidimensional construct it can. But how often are we doing that?