Perhaps the Biggest Problem: Misunderstanding Bias and Error

"Bias" has a particular meaning in the field of measurement. Fortunately, this means is not that far off from our colloquial/every day use of the term. In measurement, it means systemic error in a particular direction. This meaning highlights the fact that there is another class of error, the kind that is not systemic in a particular direction, "noise."

Unfortunately, too many people -- including too many psychometricians and other professionals in the field of measurement -- do not really recognize bias, and therefore our use of measurement suffers for it. 


Noise is endemic in any measurement. Measurements are always a little off -- maybe a little high, maybe a little low. I need 2 tablespoons for sugar, and maybe I grabbed 26 grams, rather than 25. Or maybe it was 24. Or 24.2. Or even 22.8.

The more careful we are, the measure our instruments, the better we can do at reducing noise. But we cannot eliminate it. Our measurements will always have some random error component.

In cooking, if we are careful and actually use the right tools, the noise is too small to matter. 

In other applications, the noise can matter more. In educational measurement, where we make important decisions for and about children, the noise can be very important. We know this, and we have statistical tools to help us recognize it, to help use quantify it, and to help us to think about how to reduce it.

The primary way to deal with noise is to make longer tests. Seriously. And this works. Because noise is -- by definition -- random, in the long run it will cancel out. Test with more items (i.e., questions) actually lead to less noise in the final score. In this case, more is better. In this case, adding more noise leads to less overall noise, because random error can cancel out

This is not how bias works. 


At the annual NCME (National Council on Measurement in Education) meeting in Washington, DC this month, I had a whole bunch of interesting conversations with other people. These usually happened immediately after a session ended, as I spoke with someone I'd never known before about something we heard from one of the presenters.

Prof. Mark Reckase gave a presentation that focused generally on the differences between educational measurement and psychological measurement. In this speech, he mentioned the Hippocratic Oath, the idea that doctors pledge to, "First, do no harm." It would be wonderful if we had that professional ethical standard in educational measurement. After the session, I spoke with another attendee about this. I was saying that if we were to live by that standard, sometimes we just wouldn't test, we wouldn't use a measurement, we might not sell/license a test to some customers. But he didn't understand.

I tried to give an example, speaking of the problem of gender and racial/ethnic bias in job interviews. Unfortunately, our candidate screening processes tend to perpetuate the make up of our companies. People are more likely to see their own positive traits in other people who look like them and who have similar backgrounds. It just is harder for people to see positive traits in people who look different and come from different background than in those who are already similar to them. Thus, an argument could be made that in-person interviews might do more damage than help, and if we lived by the "first, do no harm" standard, perhaps we should just skip them entirely -- even though they are a well established practice. That is, the fact that we have always done them might run right up against the "first, do no harm" standard.

This other gentleman insisted -- over and over again -- that the answer was to do more interviews. That if there was error in the interview process that doing more interviews would compensate for it. 

He was confusing random error/noise with bias. He had in his head that bias is just a form of error, and the answer is to let the error cancel out.

But bias does not work this way. 


Imagine that you have a measuring cup that is off. Image that it is just too small, by 10%. 

Each time you use it, there will be a random component to the error. You won't get exactly 0.9 cups every time. You will get a little more or a little less than 0.9 cups. The random error/noise will be in addition to the bias. Therefore, carefully measuring out 32 "cups" to get 2 gallons will lead to a really good chance that the noise has cancelled out (for the most part), but you'll still be around 10% short. 

If it bunch of item are individually biased a little bit against girls, then using 32 of them won't fix that problem. It will produce a score that is just as biased against girls. 

The answer that the measurement industry appears to use is to add a bunch of items that are biased against boys to the item pool. The thinking seems to be that these biases will cancel out. They want to turn bias into noise, and think that they can make it cancel out. 

And they do the same for other forms of bias, too. They think that they can just make the bias cancel out.

Unfortunately, this doesn't work for individual test takers. Even if the strategy was sound, it is not applied for individual test takers. A balanced item pool is one thing, but test takers don't take item pools. Test taker use individual forms, and I do not know anyone who examines individual test for (or adaptive generated forms) for gender bias, or urbanicity bias, or racial/ethnic bias, or SES bias, or any other bias. 

Validity problems cannot be turned into noise. Construct underrepresentation -- a huge problem in educational measurement -- cannot be turned into noise. Dumbing down of content for the sake of our testing technology cannot be turned into noise. Lowering the cognitive complexity of items to accommodate time limits on our tests cannot be turned into noise.

These are all biases in our tests. But too often we forget that not all error is noise.