Item Parameters Are NOT Population Invariant


In response to my most recent LinkedIn post <https://www.linkedin.com/feed/update/urn:li:activity:7417738278168104960/> complaining about simulation studies that assume that item parameters can have true values without specifying the population, Charlie DePascale—who thinks there were earlier sins in large scale assessment than the assumption of unidimensionality—replied:

I agree 100% that many simulation studies simplify "reality" too much. Also, I agree 100% that the properties of the population being simulated should be specified, along with the method used to simulate it. And, I'm even pretty sure that I agree with what I think you mean by "item parameters are population-specific" but you should probably expand on that statement a bit given that, in general, "Item parameters are considered population invariant.” 

Sure. I’m happy to expand on that statement. And, yes, we will be coming back to the original sin of large scale assessment repeatedly. 

1) Of course item parameters generally are not not not not not not considered population invariant. If so, there would be no need for DIF studies. There would be far less need for field testing—though there would still be some. There would be less work for psychometricians. Post-field testing data review would not include various DIF flags. Everyone knows that items can have different parameters across different populations. 

2) It is not just a matter of the mere possibility of items having different parameters across populations. If that were the case, the threshold for flagging an item would be even lower than it is. We see population differences in every single item—even when we do almost nothing to track relevant population differences. 

From here on, I am not focusing on how populations differ by race/ethnicity, gender or FRPL status. Instead, I focus on populations’ different instructional experiences—perhaps the most important difference between populations.  

3) Image two very large school districts that have adopted different curricula. They similarly emphasize some of the standards, but differ in their emphasis on others. Clearly, we would expect items aligned to emphasized standards to have lower item difficulties than items aligned to less emphasized standards…particularly relative to the other district, where these differences will be a bit inverted. Of course, this recognition that instruction and items can be more aligned or less aligned requires stepping away from the psychoometric assumption of unidimensionality.

4) Imagine two very large districts that have different instructional approaches with the same official curriculum. Imagine that they differ in the degree to which they devote resources to lower achieving students. One might give them additional instructional time and perhaps smaller classes. (For example, back in my teaching career, I taught a double period ELA class to lower achieving 9th graders.) Or, imagine that they focus on the students just below the proficiency threshold—a well known practice during the NCLB years. This would alter the performance of formerly lower achieving students relative to higher achieving students, altering item discrimination. 

5) Before I go on here, decide for yourself whether you think higher achieving students or lower achieving students benefit more from instruction. Are higher achieving students simply better learners who will use additional instruction on a grade level standard more efficiently, or are lower achieving students faced with cognitive obstacles or barriers that additional instruction can help them to overcome? In the context of a focus on grade level lessons and large scale assessment’s ceiling effects, I think that lower achieving students are more likely to benefit from additional instruction. So, if we have two very large districts, and one of them increases instructional time for all students in a content area—like additional reading instruction for elementary school students…again, item difficulty and discrimination will be altered.

6) Imagine two very large districts that differ in how much attention they pay to past years’ items in the course of instruction through the year. One district presents problems and examples as they have appeared on the large scale assessment in the recent past, and the other focuses instead on higher level thinking skills of more complex problems. Do you really think that item parameters derived from these two different districts will be invariant between them?

7) Now imagine that in all of these examples, it was the same district that simply adopted some policy changes. So, each of these examples is the same district before and after the change, time 1 and time 2. 

8) Or, imagine that these two very large districts are indeed geographically distinct, but that they border on each other. One is Atlanta and the other Gwinnett County, in the Atlanta suburbs. Or the city of Baltimore and adjacent Baltimore County, Maryland. Thus, the districts have rather different population demographics—race/ethnicity, FRLP and perhaps ELL distributions, for example. Try to think of all the ways that two such districts can differ from each other. Internal resources. Parental education levels. Wealth, income and socioeconomic status distributions. Ethnic distributions. Share of immigrant homes and/or households where English—the language of instruction—is not the default language. Do you really think that item parameters will be the same across these two districts? Will they be the same for subgroups within each district? Will they be the same for corresponding subgroups across the two districts? Will relative item difficulties and item discrimination parameters just naturally be the same?

No, item parameters are clearly not not not invariant across populations. Unidimensional psychometric models require items to be population invariant, and therefore efforts are made to only select items that approximate that requirement—to the detriment of substantive item validity (i.e., their ability to elicit evidence of the targeted cognition for the range of typical test takers). And this is only possible because we ignore differences in instructional experiences when examining items for population invariance. 

No, the only way to think that any of this is at all appropriate is to willfully ignore the dimensionality of the tested domain as understood by those truly expert in it and in teaching it. It requires ignoring all the efforts to filter out items that do not fit the assumption and proclaim that resulting data proves the initial assumption (unidimensionality) that was used to filter them out. At some point shortly after I finished grad school, I was invited to help a team I had been involved with previously to shape up the extension of a study to prepare it for submission to a journal. I saw that the central claim of the paper was going to simply be a restatement of a filter used on the data, shifting into a finding that ignored the fact of the filter. I burned some bridges when I asked whether the filtering had been removed. 

I’m sorry, Charlie. Unidimensionality is the original sin of large scale assessment. It infects so much of the actual practice of psychometrics. In this case—efforts to study the potential of language models to predict item parameters—it has poisoned the minds of incredibly smart and thoughtful people into meaningless research that can only undermine what little validity large scale assessment can currently rightly claim.