Vertical Scales and Unexamined Assumptions about Unidimensionality

Just this week, Chalkbeat’s Matt Barnum asked about the meaning of NAEP’s apparent use of a single scale to report all of its test result. This topic — vertical scaling — reveals problems with vertical scaling. This example makes easy to see.

What is a Vertical Scale?

While the same set of grades are reused across grades (e.g., either the A-F system or the 100-point scale), this is not always done with reporting on standardized tests. Though people understand that an a student who just earned a B+ in 10th grade knows much more than a student who just earned an A- in 5th grade, some people want to highlight that there is this longer continuum across the grades. They even want to compare performance of students (or collections of students) across grades. That is where vertical scaling comes in.

With vertical scaling, we do not have to reset our understanding of the reporting scale for each grade. Instead, the scale just keeps going up. So, the average 2nd grader might score in the 140’s, and average 3rd grader in the 160’s, an average 4th grader in the 190’s, and so on and so on all the way up to the average 11th grader in 620’s. It’s a VERY long scale, with lots of overlap between grades.

There are generally defensible techniques for doing this — though they rely on problematic assumptions. Vertical scales are very important to support various policy goals and evaluation approaches. More simply, though, they support more kinds of comparisons — even comparisons of how much a single child learned one year vs. another year or how much two children in two different grades learned.

The key to vertical scaling is the use of anchor items. Anchor items allow the linking of two tests — across multiple forms of a tests, across different years, across different grades. By reusing a handful of items on each test, they can act as a kind of splice that enable comparisons across tests. That is, comparisons of items across tests. So, if they quantify the performances of test takers on those anchor items on each test, they can use them as a common baseline to link performances across all the items on each test to each other — regardless of which test the items are on.

In the context of vertical linking, they take some of the harder items on the lower test and some of the easier items on the higher test and make sure they are all on both tests. (They do not have to be the easier/harder items, but I think the logic works better when they are.) These shared anchor items provide the psychometric bridge to create a single reporting scale for both tests. Do that with all the gaps between each grade and you can get single scales for the entire span of K-12 education.

Unfortunately, I don’t buy it.

Unidimensionality’s Basic Falsehood

Unidimensionality is the idea that whatever this is that we are measuring, we really are measuring just one thing. That is, if this is a math test, so we are measuring math. We can basically treat each item as contributing equally to the score because each item measures one unit of math. We can summarize performance with a single score on this 3rd grade math test because 3rd grade math is just this single homogenous thing.

The problem is that 3rd grade math is not a single homogenous thing. 3rd grade math is MANY things. Common Core has 5 different domains in 3rd grade math, comprise of 25 different Math Content Standard. If one counts all the individually broken down subparts of CCSS’s 3rd grade math standard, you get 33. Of course, there are also the eight Standards for Mathematical Practice.

How can we report a 3rd grade math as a single score when it has all those different parts? We know the parts are different because the content experts tell us that. We know that different kids have trouble with different parts. We know that they are different grain sizes — even just between the Standards for Mathematical Practice and the Content Standards.

The Reporting Compromise and Its Unexamined Assumption

There is such utility in reporting performance unidimensionally, we simply have to find a compromise. Now, this is a compromise that we have all long been comfortable with. After all, we accept report cards that give students a single grade for math, a single grade for science, and a single grade for each course they take. We accept that in test reporting as well.

The compromise is acknowledge that there are different standards, so the reported score is a composite score. 4 parts this domain, 3 parts that domain, 6 parts this other domain. It is like a teacher who says that grades in their class are made of up of:

  • 30% homework

  • 30% projects

  • 20% tests

  • 20% class participation

Because standardized test reporting impacts so many thousands or millions of students, those composites should be designed very carefully. They should properly weigh the different elements of the entire content domain because different weightings will yield different results. Different weightings will encourage teachers to focus on different parts of the curriculum. Different weightings will favor or disfavor different students, different teachers, different schools and different instructional approaches.

Thank god, the developers and sponsors of standardized tests know that the weightings matter. They try to be thoughtful about them. However, they may not be thoughtful enough. They may be too driven by convenience and too accommodating of the limitations on the tests (and of how those limitations drive the weightings). But no one takes designing a test blueprint lightly. Nonetheless, there is always something arbitrary about the weightings, as there is no definitively correct answer and there are so many factors that influence blueprint design that have nothing to do with the content domain itself (e.g.., item type limitations, scoring budgets, seat time limitations, etc.).

Unfortunately, the real unexamined assumption is that the items themselves actually measure what they purported measure. There is very little work on making sure that items do not individually produce false positive or false negative results. That is, whether students can solve them without using the targeted standard or might fail to solve them for reasons other than lack of proficiency with the targeted standard.

This lack of care with item validity (i.e., items that elicit evidence of the targeted cognition for the range of typical test takers) undermines the thoughtful work of designing the composite that a test’s blueprint promises. If the items don’t measure what they purport to measure, the elements of the composite are not properly weighted. Some elements might not even be represented at all!

This leads to scores who meanings are uninterpretable — unless we just accept that the blueprint and details of the composite’s weights do not really matter. After all, 3rd grade math really is just one thing, right?

Problematicly Assuming Unidimensionality for Vertical Scaling

Vertical scaling necessarily assumes unidimensionality. It has to. Even if the composite was crafted incredibly wisely and the items each actually was perfectly valid, successive grades would have different composites. Some subdomains are more important in 3rd grade math and others more important in 4th grade math. Eventually, lower level content is taken for granted so that higher level content can be focused on. For example, while arithmetic is always important, the importance of interger addition on tests fades as more advanced arithmetic is covered and eventually the importance of arithmetic fades and algebra and other more advanced topics gain focus.

  • If the composite changes, what does it even mean to link scores between them?

  • If we acknowledge that the summative score is made up of different subdomains, how many anchor items do we need to link the subdomains across grades?

  • If a new subdomain appears at some grade, what does it do to the very idea of linking scores across grades?

The only way to resolve these (and other) issues is to hand wave them away and assume unidimensionality.

Back to NAEP’s (facially) Vertical Scale

The National Assessment of Educational Progress — “the nation’s report card”!! — makes no such claim. It does not claim to be a vertical scale. It does not claim that 4th grade scores can be compared to 8th or 12th grade scores. It does not claims a two-point increase in 8th grade means the same thing as a 2-point increase in 4th grade. It does not claim that high enough performance on the 8th grade test would mean more advanced average proficiency than a very low performance on the 12th grade test.

Not at all. it is not a vertical scale. But the three grades are reported in a way that looks like it might be a vertical scale.

But here is how we know it could never be a vertical scale: You cannot anchor items between two levels so far apart. If the items on the 4th and the 8th grades test each actually represent appropriate grade-level standards, we should not expect that any decent number of 8th graders would get the 4th grades items incorrect. Nor should we expect sufficient 4th graders to get any 8th grade items correct. Certainly not enough to splice the two test’s scales together.

This is not about how smart the 4th graders are. Rather, it is that they simply have not been exposed to the 8th grade content, yet. Any signal (i.e., information about 8th grade math skills) in that data would be overwhelmed by noise (e.g., test taking savvy). Similarly, 8th graders who get 4th grade items incorrect might be far more likely do so because they misread the item, rushed or were sloppy than because they lack the content expertise. Again, the noise of construct-irrelevant factors would overwhelm any signal of some 8th graders’ lack of proficiency with 4th grade content.

You simply cannot link tests that are so far apart because you cannot ask these students the same kinds of questions.

The Point?

Well, I see two important takeaways.

First, I find Matt’s question disturbing because he works for a very good education-specific news site and his beat includes both education policy and education research. Among scholars I respect, he is well thought of. No question, he knows a lot for an education journalist.

And yet, even Matt did not understand this. I’ve no idea how many times he has reported on NAEP scores and use of testing has been one of the dominant themes in education policy for decades. If Matt does not understand this, then what does that say about the rest of the media? What does this say about our elected leaders, about parents and about voters?

Second, whenever I challenge psychometricians about their assumptions of unidimentionality, they retort that their methods are robust to some amount of multi-dimensionality. They report that their statistical methods do not break down when faced with data that is not stricitly unidimensional. Of course, I accept that. But that does not mean that the results the report mean at all what they think they do. Validity is about “interpretations of test scores for the proposed uses of tests” (The Standards for Educational and Psychological Testing, 2014, p. 11). Even if the statistics yield a result, the use and acceptance of vertical scales — even if only the suggestion of a vertical scale with NEAP — shows how little considerations psychometrics gives to validity.

I suppose that there’s a third takeaway, though it is less far-reaching. Matt’s question about NAEP scores has long since been addressed. In 2012, David Thissen wrote about the question of the NAP and vertical scales. “The conclusion of this essay will be that evidence can and should be assembled to support, and make more precise, interpretations of the first kind (“one year’s growth”), while interpretations of the second kind (cross-group comparisons across four-year spans) should be discouraged.” This work was done under contract with the publishers of NAEP, and yet it has take up neither of his suggestions. They should do better.

Excellence is Multi-Dimensional

My high school experience back in the 1980’s was a bit odd, in quite a few ways. For one, it was an almost brand new school when I got there. It was a new public exam/magnet school and for various reasons, they the district decided to just let in one class at a time. So, the first year, there were just freshman. The second year, that first class rose to be sophomores and my class joined. It wasn’t until it’s fourth year that we had seniors, and that first class was the top class their entire high school careers.

I was on a competitive team from my freshman year, and there were two real stars in the class above me, but they took very different paths with very different strengths. One was rock steady, always doing what he could do, without mistakes. The other was more mercurial, with more brilliant moments mixed in with too frequent mistakes.

Now, both of them were excellent. But one was steady at a high level, and the other had more variation from meet to meet. Sometimes James exceeded Peter, but sometimes James fell short.

Throughout our high school years, Peter raised his level. He remained consistent, not making mistakes. But he did that a higher level of performance each year. Through those years, he nearly caught up to James’s peaks. Similarly, James also improved. But for James, improvement had to mean addressing those mistakes. Through those years, he nearly caught up to Peter’s consistency.

Back then, I thought that I was more like James. I wanted to be more like James. I wanted to reach those heights, and I did not yet realize that James and Peter were converging. I saw them embodying two contrasting archetypes. And I certainly did not appreciate the value of consistency or of simply not making mistakes.

It was not until late in college that I really started to appreciate that James was not better than Peter. I did not understand the value of reliability — particularly when that reliability comes with a high level of performance. Yes, I still see value in moments of peak brilliance, but I value consistently far more than I used to.

Consistently avoiding mistakes that you are capable of avoiding individually requires a kind of focus that I did not have as a teenager. While I have gotten better, it is still sometimes hard for me. Whatever the reasons, it does not come easy to me in any domain.

As a adult, I see incredibly value in avoiding downsides, potholes and mistakes. I see reliable contributions from colleagues, reliable friends and reliable recipes. The staples of our lives, of our work, of our pantries are so under-appreciated. Delivering every day and being able to count on them make everything else so very much easier.

This was true on my high school math team. The most thoughtful football analysts say it is true of running backs, too. It is an under appreciated kind of excellence.

Who Make Decisions about Goals and Resources?

Recently, someone tweeted to me, “I have lots of faith in teachers to implement learning properly. I have less faith in schools and admins to set the proper goals and resource appropriately.”

We are in an era of decreasing trust of teachers and schools. Of course, we are in an era of increasing distrust of all institutions, so this shouldn’t be so shocking. And while trust in teachers remains quite high, is has declined a little bit in recent years. Teachers now trail only nurses and medical doctors, but they used to rank higher. (They are still far ahead of police officers, judges and bankers. Local office holders and members of congress are net a little and very much distrusted, respectively.)

Nonetheless, it is quite striking that someone would distrust schools and administrators to “set the proper goals and resource appropriately.” These simply are not the jobs of teachers or school administrators.

Educational goals are laid out in state learning standards. These state standards are developed by educational professionals, researchers and policy-makers, and then customized for various states. Finally, these customized standards are ratified and endorsed by state legislatures. For example, Florida customized the Common Core State standards and the Next Generation Science Standards and calls them standards The Sunshine State Standards.

Educational goals are not set by individual teachers, individual schools, districts or their administrators. Educational goals are set by state legislators.

Educational resources are similarly out of the hands of schools and educational administrators. States are the primary determiner of educational resources — again, through acts of state legislatures. Local municipalities also contribute to educational resources through local government budgets. Again, it is local elected officials who make these decisions. In some areas, the school district has the authority to levy taxes, instead of the general local government. But this is done through elected school boards. In none of these cases are schools or administrators responsible for these decisions. In all of these cases, it is elected officials.

Of course, the federal government contributes ~10% of school resources,. Here, it is Congress that decides. Again, elected officials.

To be fair to all of those legislative bodies, their acts usually have to be signed by an executive. Thus, it is not the legislatures alone who do set standards or set resource levels. But they are all elected officials.

Now, where I live, we actually vote on he town budget every year. My local town government does not have the power to set budgets. Rather, it’s elected officials puts together a budget for the citizens of the town to vote on. Occasionally, a town budget somewhere does not pass, and the town government must put forth a new proposal for citizens to vote on. This American Life recently did a piece on a a contentious effort of citizens to radically alter a school budget. But no where in any of this do schools or school administrators sets budgets.

It is incredible that people distrust teachers and administrators to do things that they’ve not be responsible for in generations.

Better Tests, Not Lesser Tests

Standardized tests and the uses to which they have been put have a very troubled history — and in many ways that is still true today. One very common response to this situation has been attempts to marginalized or eliminate standardized tests, or at least any meaning that make provide a foundation for decision-making. 

And yet, teachers should still be accountable to principals, parents and students, schools accountable to communities and school boards, and school districts accountable to communities and various levels of governmental oversight. 

There has been this idea that standardized tests are responsible for bad decisions that have used them as justification. This idea persists, even though poor school funding and marginalization — within our schools! – of low performing students and populations go back as long as any concept of schooling has existed. 

There has been this idea that if we can protect students from the evil tests which come from those unknown strangers that we will ensure that those who know and love them best are will do right by them. And I agree that that is the best case. That is what I want teachers, schools and school districts to do.

But actual history shows us that that is often not the case. We have too often settled for unacceptably low performance from some children and expected even less from others. Too often, educators and policy-makers have been blinded by the soft bigotry of low expectations. Dumbing down assessments so every kid will score well on them does a disservice to the very populations and communities that our educational systems has so long failed to do right by. 

Now, I am the the first one to assail the quality of our standardized assessments. They really do need to be changed. But the answer cannot be to make them so easy that they are incapable of providing any meaningful information. That perpetuates the false senses of complacency that this child is being well-served and that community is having its educational needs met. It lowers the bar on what we can expect from our schools, and I find that entirely unacceptable. 

On the other hand, simply making tests more difficult is no better an answer. It is trivial to accomplish, but it too fails to provide useful information. 

Rather, each state has standards that define what students should learn in each grade. Some set of standards has been endorsed through our democratic processes in each state (i.e., by state legislatures and signed onto by governors). We know what the children should be learning — at least academically. Our standardized tests must to a better job of assessing those goals, so that parents, communities, school boards and other levels of government can make appropriate decisions about how to better support our children. 

While these academically-focused standardized test should not be the only basis that policy-makers used to make decisions about our schools and not the only basis by which community members should evaluate their schools, they deserve better information about this core function of schools, not lesser.

Copyright, Fair Use and Plagiarism in Assessment Development: Part II

Last week I wrote about plagiarism and how the concept applies in assessment development. Plagiarism is about using others’ ideas or words without giving appropriate credit (i.e., citation). But appropriate varies from context to context, and in assessment development the only things that are generally credited are excerpts from previously published work. That is, quite extended quotations — which are credited to their original creators or copyright holders.

This week, I address copyright infringement and fair use, in the context of assessment development.

What Is Copyright?

Copyright is mentioned in the US Constitution, that Congress shall have the power to “secur[e] for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.” That covers copyright and patents, respectively.

Copyright only protects expression, and not ideas. So, it protects this blog post as I wrote it, but not the ideas or organization of the ideas. It protects my words. Copyright protects visual media, as well. That’s photographs, illustrations, painting, drawings, film/video, etc.. Again, it protects the actual exact thing, not the ideas behind it. You can rip off a plot, without violating copyright.

Copyright does NOT protect physical things. It does not protect inventions (i.e., that’s patent law). It does not protect designs for clothing or handbags, even though their look is part of the point. Should it? Well, it doesn’t. That’s how it works. It protects computer software because…because it does. Because software is written in a programming language, and back in the day lawyers convinced courts that that was the best way to think about computer software.

Copyright allows the copyright holder — usually the creator, unless they have transferred their copyright to someone else — to decide how the work may be used. It’s up to them. And it lasts for a limited amount of time, though that period keeps getting extended longer and longer. It is supposed to expire, eventually.

What About Fair Use?

Fair use is the big exception to copyright. The copyright holder gets to control how and in what conditions the work can be used, except for fair use. People talk about parody and satire, but what they really are talking about is a particular kind of fair use.

Fair use is not just an opinion. It is a technical term and the law (i.e., Section 107 of the Copyright Act) defines a four factor test to determine whether something is fair use.

  1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

  2. the nature of the copyrighted work;

  3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

  4. the effect of the use upon the potential market for or value of the copyrighted work.

These four factors are all relevant. Each must be examined and then the results are weighed out.

Factor 1: Purpose and Character of the Use

Some people falsely think that if they are not making money when they violate copyright then it is ok, that it is fair use. But they are generally wrong. Yes, that is relevant, but it is not enough. It is just one part of this one factor.

As one weighs out the four factors, it does matter if the goal of the enterprise that is using the work is to make money. That is, use by a commercial entity in the furtherance of its goals and operations — even if the reproduction is not actually being sold — is commercial use. Even if the use does not make money itself, it can still weigh against the use. On the other extreme, non-profit educational use is the best case for fair use. A for-profit educational organization is close to that extreme, but not all the way there. A non-profit that is not education is not all the way there. A random book publisher or restaurant? Those are just straight commercial uses.

Quoting a work to comment upon it, like in a book review or various forms of scholarship? That’s leans heavily into fair use. Some might even consider that an educational use.

Regardless of the context, when use of a prior work transforms it in some way, fair use is a more likely conclusion. This is usually the point of parody and satire. The more they transform the original work, the more likely they are to be seen as fair use. Summarizing a work may include all of its ideas, but it entirely transforms the expression of those ideas. How transformative a use is can be a matter for debate, but there’s no question that greater transformation is more likely to be fair use.

Factor 2: Nature of the Copyrighted Work

Not only does the use matter, but so does the nature of the original work being copied.

Creators should generally have control over first publishing of their work, so courts are more protective of unpublished work than published work. That weighs against a finding of fair use.

Copyright is generally understood to be more focused on encouraging creativity than other sorts of endeavors, so fiction and other literary work is generally more protected than work aimed at being informative. This blog post? Informative. The 5pm newscast on your local TV station? Informative. That hit movie or novel? Literary.

Where does documentary film fit in that? It is a certain kind of creativity to record, edit and put that together. It is supposed to be entertaining, in addition to being informative. So, the courts might give medium weight to that, rather than maximal or minimal.

Once again, however, this is just one of four factors, and they all must be weighed against each other. There is no condition of any factor that guarantees a particular final ruling on the question of fair use.

Factor 3: Amount and Substantiality of the Portion Used

This is likely the easiest factor to understand, of the four.

If you are using a small piece of the original, then it is more likely to be fair use. The more you use, the less likely it is to be fair use. You copying the whole thing? That’s gonna weight heavily against you.

Note that this is not amount the absolute quantity used, but rather the relative quantity. Copying all of a 4-line poem is using the whole thing, while copying four lines of a 100-line poem is only using a small piece. 200 words from a 20-page short story is very different than 200 words from a long novel.

Of course, this posits a problem: how do you define the original work? Is it five minutes from an episode of a television show (i.e., nearly 25%)? Or, is it five minutes from a whole television series (i.e., less than 1%)? Obviously, the copyright holder would want to claim the former, and the person seeking fair use would claim the latter. It can be quite a challenge to figure out how to evaluate even a single of the factors, sometimes.

Factor 4: Effect of the Use Upon the Potential Market

This factor ought to be quite easy to understand, but it is far too often ignored.

If the use would tend to decrease the market for the original work, that counts against fair use. So, a teacher who makes copies of a story to distribute to their class, so that the school does not have to pay for the book in which the story appears? Sure, that’s an educational use, but it directly harms the market for the original work. Without making those copies, the school would have to pay for the book.

In theory, some uses may encourage the market for the original work. For example, movie reviews bring free publicity for films. Now, this — like the photocopied story — is hard to disentangle from the purchase and character of the use. But it can be pretty easy to recognize when the use is primarily to avoid having to pay for the original.

On the other hand, when a work is out of print and unavailable on the market, there is less likely to be an impact. That doesn’t mean it has to be available in the form you would prefer (e.g., streaming). If it is only available in some other form (e.g., BluRay or DVD), that’s still available.

Applying the Four Factors to Test Development and Publishing

The assessment industry has settled on some patterns of practice in building stimuli for items that can be examined through these four factors to determine whether they are fair use. These stimuli may include reading passages or other works that could be under copyright.

  1. While professional licensure exams and certainly a commercial purpose, K12 assessments are intended for educational and public policy use. Test development vendors that are non-profit organizations are, therefore, engaged in non-profit educational work. For-profit test developers are not quite as well off, in this regard. But all of them tilt at least a little towards fair use.

  2. Original works maybe more creative (e.g., poetry or short stories) or more informative (e.g., journalism or scholarship). If they are previously unpublished, they are always are commissioned by the test developers, so they own the copyright. But those other works — if the copyright has not expired (i.e., are now public domain) are quite often in the more creative realms. That argues against fair use.

  3. Test may include whole poems or articles, and may be limited to excerpts from larger works. It really runs the whole range. In some cases, this argues for a determination of fair use, and in others it agues against. Mostly commonly, though, they are excerpts.

  4. It is not likely that any on a large scale assessment would lessen the demand for work in the market. That argues for fair use.

Taken together, this very straightforward analysis suggests that most use of potentially copyrighted works would pass for fair use. Excerpts used by a non-profit company to put out a product for educational use is usually going to be fair use. A for-profit company using an entire poem or an original photograph is far less likely to be considered fair use, regardless of its impact on the market for the work.

Additional Considerations

Fair use does not matter if the user (e.g., a test developer or client) is willing to pay a licensing fee agreeable to the copyright holder and works that are in the public domain have no copyright claims — by definition). This why most test developers go with three options.

  • Try to find public domain works

  • Permission existing works (i.e., pay a licensing fee)

  • Commission works that they can own the copyright to.

Even more broadly than that, many test developers make use of ideas that the find elsewhere. But ideas are not copyrighted. Summaries, reimaginings, simplifications and adaptation are so transformative as to not even constitute use of the original expression, and therefore questions of fair use vs. copyright violation.

*********************

This entry does not address any of those other intellectual properties areas (e.g., trademark, trade dress, trade secrets, patents), as they do not apply at all to questions of test content — though they certainly are interesting in their application elsewhere, even in the context of assessment development organizations.

Copyright, Fair Use and Plagiarism in Assessment Development: Part I

Large scale assessments do not exist in a vacuum and often rely on using the work of others — even feature the work of others. This can lead to concerns about copyright infringement and plagiarism.

Copyright is a legal construct. It is mentioned in the US Constitution. Article 1, Section 8 says, “Congression shall have the power to” of do whole bunch of things, and Clause 8 lists, “To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries;” But copyright is the focus of the next blog post. Today, I am focusing on plagiarism.

Plagiarism is not a legal issue. There are no hard and fast rules on plagiarism because there is no central authority who gets to decide on such things. But generally, plagiarism is inappropriately using someone else’s work — usually without proper attribution.

In the previous paragraph, “inappropriately” and “proper” are carrying a lot of weight. No one can give a general definition of plagiarism that is more specific than that because standards for plagiarism are contextual. That is, they vary from one context to another.

Example #1: Shakespeare

The great broadway musical and film West Side Story rips off Shakespeare’s Romeo and Juliet. We all know that. Two warring groups. A teen from each meet at party, immediately fall in love, find a way to be together — even to marry in secret. He kills one of of her cousins, but she marries him anyway. Through tragic misunderstanding and mistakes, they die in the end. Well, he dies. In West Side Story, she lives. Everyone knows that West Side Story is Romeo and Juliet.

So, is that plagiarism? It’s never credited. The story is changed, but it is barely changed. The setting is updated and it’s turned into a musical, but the plot and main characters are basically the same.

There is this book from the 1990’s by Jane Smiley, 1000 Acre. It’s King Lear on an 20th century American farm. The same plot and characters — whose first initials match their analogues in King Lear.

There’s this new play, Fat Ham. It’s Hamlet. Well, it’s not quite Hamlet. There are a LOT of changes. But the basic setup is Hamlet. And the main character slips into some of Shakespeare’s lines from time to time — from Hamlet, of course.

None of that is plagiarism. We are ok with movies, books, plays and television reusing old ideas, be they famous or more obscure. We do not expect them to attributed to the original author, and certainly do not require it. Often, knowing about the original work can deeper our appreciation of the new work. It is not the changes from the original plots or settings that keep them from being plagiarism. It is just the expectations that that in this context ideas are reused all the time, even with rather little change.

In this context, it’s just how it works. Heck, Shakespeare himself plagiarized almost everything. Sure, he added a twist here and there — brilliant alterations to make for a better story — but he based is work on the work of others without ever giving any credit for it. Scholars have basically figured out what books he owned because there are particular details in different plays that make clear whose versions he based his own work upon.

Example #2: Academic Writing

Academic writing is “the worst” (Manual Miranda, 2015). The expectation this that you have to attribute every idea that was not originally yours to the proper source. That is, “Using the ideas, data or language of another without specific and proper acknowledgement” (Vice Provost for Student Affairs, 2020). Like, if I wanted to mention chocolate chip cookies, I should credit Ruth Wakefield, who invented them in the 1930’s.

Whether I am using someone else’s words or someone else’s ideas, in academic writing I have to credit them. “Specific and proper acknowledgment,” says the Vice Provost of Teachers College.

Now, even in academic writing, as careful as it is about plagiarism and credit, there is room for judgment. No, no one would expect you to cite the inventor of the chocolate chip cookie. And these days, no one cites Watson and Crick for their 1953 discovery of the structure of DNA, either. Ironically, if a work is important enough — foundational enough — it transcends the need for credit. Watson and Crick (1953) have only been cited 16,000 times, even though vastly more work builds on their ideas. And though when I wanted to use the phrase “the worst” and I knew what I heard in my head was from Lin Manual Miranda’s use of it in Hamilton, I didn’t really need to cite him.

But if I were to talk about the need to consider the details of things and understand how it feels to be in them, and also to consider the big picture using the idea of the balcony and the dance floor, I would have to cite Heifetz. This video is all about that, and particularly ripping off how he used that metaphor, its purpose and context and lessons. There is nothing wrong with writing about this, even writing that much, but they really should have cited Heifetz. That video does not meeting academic standards for citation, but it is not from an academic institution.

Example #3: Blogs

There is no rule or expectation for blogs. Blogs are a little microcosm of the world, in this regard. Some blogs cite sources more, and some cite them not at all. With blogs, links often serve as citations.

Both academia and link-heavy blogs are about the conversation and the connection of ideas. There are other reasons (e.g., credibility) to cite and link, but a big part of it is just to continue to be a part of a larger and ongoing conversation. But that’s voluntary, when it comes to blogs.

Most blogs are not trying that hard to give credit. Their authors want to feel more ownership of their ideas. And adding all those citations and/or links actually makes it harder to read. Academic writing is really hard to read, and one of the contributors to that difficulty is all of those citations. Blogs want to be more accessible than that.

Plagiarism and Assessment Development

Plagiarism is about failing to give appropriate credit for using someone else’s words and/or ideas. And what constitutes appropriate credit varies by context. What is appropriate in the context of large scale assessments can be observed by looking at what large scale assessments have historically done.

Assessments do not credit originators for ideas, not generally. Excerpts from previously published works are generally credited to their authors and/or their copyright holders. Generally. And that’s it.

That is how it works.

One might argue that it should work otherwise. One might argue that it it should be more like academic standards for plagiarism. I would respond that the audience (i.e., the test takers) are likely not prepared for academic levels of citations and that they certainly do not expect it. But should there be more citation than there has been? Well, some might think that, but it is just their view in the context of established expectations that are quite different. They can try to convince people, but there is no authority they can cite that makes it mandatory or even appropriate.

They might argue that to do otherwise is copyright infringement. In fact, citation is no protection against copyright infringement and what they have a problem with is likely not copyright infringement, in the first place. But that’s for Part II.

Why Not Speak Up?

One of our colleagues pointed out to us last year that while humility — one of our core principles — is important, it was also important to recognize that a lack of confidence is also important. Appropriate confidence in one’s own expertise is critical to successful collaborative work, too.

Thus, we we dove back into thinking and came up with Expertise, Confidence & Humility. We are pretty happy with it, but we did not dive into the messier aspects of why people might not speak up when they should. We focused on the importance of appropriate confidence in one’s expertise, but not about reasons why someone with such confidence still might not speak up.

In that piece, we acknowledge that there are gender issues here around internalized and externally imposed societal expectations, but we did not address the social expectations that the expert in question might have for others — often well grounded expectations.

As we mentioned, one of us has had too much experience not being expected to have something worthy to contribute. If this happens enough — being faced with others not having confidence on your own expertise — it obvious gets more and more difficult to speak up. Why bother when you know that no one will listen, anyway. This is not a matter of confidence in one’s own expertise, but rather confidence in lack of respect by others. Yes, this is a real problem.

Unfortunately, it gets worse. Members of less powerful or prestigious groups (e.g., women, members of underrepresented minorities) can face real negative consequences for speaking up. For not knowing their place. For being — for lack of a better word — uppity.

And even worse, being right can make these consequences even more severe. People who resent the uppity voice will very much want to reinforce their own dominance, perhaps making sure to point out when that voice is wrong. But if that voice is not wrong, their need to reinforce dominance will seek other outlets — and perhaps require even more substantial efforts to enforce a desired hierarchy.

We don’t have an answer to this. We know that within our own teams, such dynamics should never happen. That violates the norms we try to establish and maintain. And we hope that the larger organizations in which our teams works are similarly disapproving of such attitudes.

But we know that this horrific dynamic exists broadly, and even within our own organizations and teams, there are people who are nervous to speak up because they have learned these problematic lessons elsewhere.

Of course, all of us still have to be mindful of when to we are relatively more expert or relatively less expert in a room. But one’s ability and readiness to offer contributions is complicated by doubt about whether others will listen, and fear of backlash simply for speaking.

We wish we had an answer to this, but we do not.

Innovations and Citations

As an academic researcher and a dissertation coach, I am very familiar with the importance of citations. When explaining this to people, I say that what makes scholarship different is participation in what I call “the scholarly conversation.” That is, by positioning this research in the context of what came before, the scholar credits those who came before, demonstrates understanding of what came before and frames their new offering as they wish it to be framed for their readers.

My wife is an attorney and often a litigator. When filing briefs and motions with a court, she has to play a similar citation game. That is, the credibility of her arguments are increased because she demonstrates her understanding of what courts, legislatures and regulators have decided in in past and she similarly frames her argument as she wishes her readers (i.e., a judge and their clerks) to understand them.

Both of these uses of citations is intrinsically concervative. They look back in time for wisdom and authority from those who came before.

In matters of law, this makes a great deal of sense. We do not want the law to change frequently. Stability and predictability of the law and of the outcomes of judicial decisions is generally a good thing. In matters of scholarship, it makes a great deal of sense, as well. It allows scholars to build on a huge amount of previous — often complex and subtle — work of others without having to review all of it in depth. Rather, the scholar tells the reader where they can find support for a point or full explanation of something or others’ evidence for an idea.

However, conservatism is at odds with innovation. Innovation looks forward to something new, while conservative looks backward for the wisdom of the ages. To get around this, innovation often has to claim that they are restoring the true wisdom of the past that has been missed or misunderstood. I just had to do this, to some degree. Efforts to expand civil and level rights to new groups of people in this country often have to attempt a similar strategy (i.e., no, you’ve underread the t4th amendment. it actually suggests that we should…).

The problem with this conservative approach is that it stifles innovation. This may be a good thing. We do not want radical changes in contract law. We do not want to change what basically works and is widely depended upon. But it depends on the assumption that things do basically work and that when they could bear to be improved upon, there is relevant wisdom in the ancients that we can depend upon.

This is a challenge to the (RTD) Rigorous Test Development project. We have always said that content development is a black box that researchers and the literature have ignored. There simply is very little work on what the practice of assessment development is about, and virtually nothing about content development. ECD (Evidence Centered Design) really stops short of content development — or rather, it works around it, looking at test design and psychometrics, but not item development.

So, I wonder about assumptions about the value of citation-based arguments and the values they are based in. If things are not – in fact – on the right track, then is it an obstacle to necessary improvement?

Reading Skills, Making History and the 3rd Amendment

(Many do not understand this, but Common Core did not just include standards for Math and English classes. The Literacy standards are for English class, but even on the title page it is clear they are are also for “History/Social Studies, Science, and Technical” classes. Anywhere that one reads to learn information.)

Common Core’s (CCSS’s) sixth reading standard says, “Assess how point of view or purpose shapes the content and style of a text.” This skill is applied to reading literacy works and to reading for information — and even to speaking and writing. Instruction in this skill begins in kindergarten and awareness of the author and that fact that author’s have a point of view begins in third grade. By high school exit, students should be able to, “Determine an author’s point of view or purpose in a text in which the rhetoric is particularly effective, analyzing how style and content contribute to the power, persuasiveness, or beauty of the text.”

The Third Amendment to the Constitution of the United States says, “No Soldier shall, in time of peace be quartered in any house, without the consent of the Owner, nor in time of war, but in a manner to be prescribed by law.” The Bill of Rights only has ten amendments, and two proposed amendments were rejected. This issue of quartering soldiers was clearly very important to our Founding Fathers. The 9th Amendment says this is not an exhaustive list and there are other rights (“The enumeration in the Constitution, of certain rights, shall not be construed to deny or disparage others retained by the people.”), but they listed a whole bunch of particular rights that they wanted to specifically innumerate.

Why did the quartering of troops make that list when things like marriage or travel did not? Heck, voting is not even listed — not enumerated.

That important sixth CCSS standard is quite relevant here. As adults we should be able to consider the point of view and purpose of the authors of Bill of Rights when thinking about what was included and what was not.

The political leaders of the new states and nation had just gone through quite a trauma. That had particular grievances with their old king (e.g., “Quartering large bodies of armed troops among us,” “He has plundered our seas, ravaged our Coasts, burnt our towns, and destroyed the lives of our people.”) and wanted to make sure that the new central government did not repeat those offenses. They listed the issues that were on their minds, because they had just gone through something. They did not wish those offenses repeated.

And they knew there were others (i.e., see the Ninth Amendment, but they needed to make sure about that recent stuff not happening again.

******************************

There is a great quote of ambiguous meaning, “Well-behaved women seldom make history.” Many people take this to mean that women should not be concerned with being well behaved, as behaving well would prevent them from great accomplishments. That is a wonderful interpretation, and historian Laurel Thatcher Ulrich applauds the sentiment.

However, when Dr. Ulrich first wrote those words, she was lamenting the difficulty of finding historical records of the lives of women who were not remarked upon for their misbehavior — unlike, say, “witches.”

It simply is difficult to find the concerns of well behaved women in all those written documents. Historians know that. Women had less access to power, to education, to quill and paper. Their letters were far fewer and their direct participation in matters of state was virtually nil.

******************************

Taken together, this should not be hard to understand.

Our Constitution and laws were focused on the concerns of men and they were focused on the recent offenses of their former colonial overlords. They certainly did not see any need to protect — or even address — the concerns and habits of the women around them who were engaged in the normal lives of society’s women.

Should we take the absence of a right to reproductive freedom by free white women in our founding documents as a sign that it did not exist? As a sign that pre-quickening abortion was rare or socially unacceptable? Or should we take it merely as an indicator that the issue was not threatened or on the minds of the holders of political power?

That’s not actually a hard question at all.

Constitional Reasoning

I try very hard to understand the views of those I disagree with. I really look to understand the values, reasoning and priorities of my opponents and rivals. I particularly try to think about whether what I might view as a compromise, they might view as offending them in exactly the same ways as the original.

Now, I admit that one big reason i do this is that I was taught from a very early age that the best way to win is to know the other side’s arguments better than they do; I was raised by an attorney. But nonetheless, I look for what they think and to look for inconsistencies. I look for holes. I look for bullshit at key points.

Now, one would have to be a bit of a legal nerd to know that, no, the Bill of Rights originally only provided protection from the federal government. Individual states were free to violate all of those rights, unless they themselves offered similar protections. (Off the top of my head, I believe that the US Bill of Rights was based on Virginia’s Bill of Rights.) And one would have to be a bit of a legal nerd to know that it was the Civil War Amendments that incorporated the federal Bill of Rights, making them apply to each state, as well.

For this reason and others, the Civil War and Reconstruction are often referred to as our nation’s Second Founding.

It is therefore quite suspect when anyone claims that the laws and norms of the late 19th century are insufficient historical grounding for making sense of the meaning of the Bill of Rights. If the question is what states may or must do, to go back 50-100 years further — when the Bill of Rights did not apply to the states — is willful blindness. It is intellectually insensible (that a typo, but it works for me!). It is an exercise if motivated reasoning that is — to use a legal term — risible.

I’ve never understood the idea that some of the Bill of Rights could be incorporated, but not others. I can follow sensible reasoning that says that the Second Amendment provides protection against state governments, and that the therefore state National Guards and the like cannot serve that “well regulated militia” function mentioned in the amendment. I can see a sensible line of reasoning that says the amendment protects an individual right to bear arms. But that individual right is not at all for individual protection. Rather, the amendment makes explicitly clear that this right is to bear arms for collective action and community protection. To suggest an individual right to individual protection is…entirely ungrounded in the text or tradition. It is, once again to use a legal term, risible.

That means laughable. It means so lacking in sense and reason to just be laughable. It should be laughed out the room, out of the courtroom.

New York State Rifle & Pistol Association Inc. v. Bruen is not the case that will bother me most from this Supreme Court term. I do not think that it is the case that will do the most damage to our society. It is not even the case that offends me the most, even though Kennedy and Dobbs has yet to be announced. But it might be the most intellectual dishonest case from this term. I do not simply mean misguided or confused. I mean flat out dishonest.

Is There Anything More Important Than Trust?

Trust has always been a huge part of my practice. As an educator, as a leader, as someone who thinks about learning and leadership development, trust is a mainline in my thinking.

I probably learned this from my doctoral advisor, Prof. Ellie Drago-Severson. Her genuine trust and presence in the room — as a teacher and as a staff developer — creates trust like nothing I have ever seen. So much of her teaching depends on learners admitting vulnerability and mistakes, and this is only possible of there is trust in the group.

This is not to say that I did not think trust was important before I met Ellie. Rather, my many years of work with her and under her direction raised the importance of trust, in my thinking.

Like Ellie, the technical skills that I lay out and teach really just serve as examples or exemplars of deep values and ideas in practice. These are techniques that how frameworks of thinking can be used and lay out what that would look like. This means that I am really trying influence the thinking of those I am working with. I am hoping to plant ideas deeply and nurture them into influence on how they work — on how they think about their work.

Now, some people are more open to this kind of deep learning and some are more resistant. Ellie’s great gift was her ability to move the resistant towards being more open. I know that I originally held a lot of her ideas at arm’s length, but between their brilliance and her own brilliant ability to build trust, I came to appreciate them deeply.

I’ve been thinking particularly about trust and trust-building this week. I am always concerns about these things in my teaching and coaching. This week, though, I am thinking about how trust is built by leaders. Managers and direct supervisors can — and should — work on building trust through their direct relationships with their team members. However, many leaders are not primarily direct supervisors. On larger teams — an in whole organizations — the leaders have rather little direct contact with most of the people they lead. However they might have built trust with those they worked with more closely in the past, they need to find new ways to build trust from a larger group who will never have that kind sustained direct contact with them.

So, that is what I am thinking about today: How does a leader build trust with people without depending on the direct interpersonal relationship?

 

So many cognitive Paths…

I was just looking at simple math problem with a group of people and we came up with four different ways to solve this multiple choice item. This is not the actual item, but it was ver similar to this:

Jeremy goes to the store to buy a jar of peanut butter and a jar of jelly. If he starts with $15.85, the peanut butter costs $2.95 and the jell costs $3.65, how much money does he have when he leaves the store with his purchases?

This really was a very simple item. Sure, the items (and the starting amount) were expressed in both dollars and cents. But they were all multiple of five.

Pretty simple, right? But when we each went though the item ourselves, we naturally and authentically did it in four different ways.

  • One of us was lazy, and just rounded all the numbers and did the math in their head. They then added the two purchases (~$7) and subtracted that from the original ~$16, leaving around ~$9. Only one answer option was around $9, so they picked that.

  • Another one of use did the straightforward two step math problems on a piece of paper. $15.85 - $2.95 = $12.90, and then $12.90 - $3.65 = $9.25.

  • A third person tried to do the two step math problem, but in their head. They doubted their mental math skills, so they pulled out a piece of paper to double check — and it was good that they did. They ended up with $9.25.

  • The last person stuck with addition. They added the two items to get $6.60, and added that to each answer option until they found an answer option that yielded a total of $15.85.

Four very capable and very education professionals addressed a single simple math problem, but had four different strategies. And that was for a VERY simple problem.

Imagine how many different cognitive paths more complex problems might prompt. Imagine how many different ways including more context for a problem could lead test takers to different obstacles, distractions and even mistakes. To help with that, we have put together a set of a few dozen personas to try to think about as hypothetical test takers who might approach an item differently than you.

The Importance of Humility in the Work

Way way back in the day, when my collaborator and I started or Rigorous Test Development project, we came up with six core principles that we thought were essential to the work. Though we might phrase things differently today (e.g., three of those core principles can be found in our mantra, valid items elicit evidence of the targeted cognition for the range of typical test takers), we really cannot argue with our final core principle at all: Test development requires approaching the work with humility.

We have said “approaching the work with humility” and “engaging with humility” and all kinds of minor difference in phrasing, but this idea of humility remains central to the RTD approach. It’s the start pointing for so much — for collaboration, for individual learning, for organizational learning. It’s one of our original six principles — born of her observations about the major stumbling blocks to creating higher quality assessments.

Humility in the work begins with recognizing the limits of one’s expertise. That includes:

  • Recognizing that your own expertise DOES have limits.

  • Recognizing the expertise of others.

  • Recognizing the the areas of your own expertise — perhaps even fairly fine grained — both relative to each other and relative to the level of expertise of others.

  • Speaking from your expertise and listening when beyond your areas of expertise.

  • Recognizing the difference between interest and expertise.

Now, if one wishes to expand one’s areas of expertise, that’s awesome. But it does not happen simply because one wants it to.

  • Listen when those with expertise you desire speak and think hard about what is behind what they said (value, goals, priorities, knowledge, principles, etc..)

  • Ask questions of those with expertise — both to get them to be more explicit and to signal that you want to better understand what they are saying. (e.g., “Does that mean…”))

  • When one ventures to speak beyond one’s present expertise, own it explicitly (eg.g., “I’m not sure about this…” or “Maybe…”)

Now, there is CLEARLY a gender component here. Men often feel more free to speak outside of their expertise. Men often disregard women’s actual expertise (ie. thus, "mansplaining" is actually a very meaningful term). And too often, women have learned to downplay their own expertise or even to be unaware of it. I try hard to combat all three of these — both in myself and in the women I work with. But we are all products of our American culture, so that gender crap is going to to be present too often — including in me. This is part of why I encourage people I work with — including women — to interrupt me. I know that some people really have trouble with outspoken women, but I really hope that I am not one of them. I don’t think I am. But I am aware that I am a product of America culture, so there’s always that possibility.

I hope that I model questioning. I hope that I model encouraging others to speak. I hope I model elevating the voices and views of people I am working with. I know that I learn a ton by working with others that have expertise that I lack. For example, I have been working with the Next Generation Science Standards quite a bit this year, and my knowledge of NGSS has deepened (and I am seeing the more subtle issues within NGSS more clearly) by working with people who have far greater NGSS expertise than me. We are building stuff together, and I know that I am incredibly dependent upon them as a partners to help me to become more expert here — dependent upon them in ways that they likely don’t appreciate. Yeah, I’m still me. I still talk too much. But I run my thinking by them, because I am looking for correction and redirection. I cannot be confident in what I am saying unless I expose it to them for criticism. I know that I’ve got expertise and skills that they lack, and I also know that they have expertise, skills and experience that **I** lack. Thus, we are learning from each other and approach our work together with that stance/goal in mind.

So, yes, there is a place for arrogance in the work. (If there weren’t, how could I do it?) But even that can done with humility.

So, one aspect of approaching the work with humility is maintaining an open-minded learning stance — even if simultaneously holds a critical stance. This can be reconciles by leading with questions and actually listening to other people’s answers. Of course, one must be sure to apply that critical stance to one’s own ideas and contributions.

When people do not approach this work — or, likely, any work — with humility, the loudest and most confidence voices dominate, rather than the most knowledgable. Issues or objections get prioritized over each other, rather than reconciled in some best (or least bad) compromise. Often, the whole remains far less than sum of its parts.

I cannot say that humility ever came naturally to me. My own entry point into this principle — and my most reliable reminder of this stance — is my interest in learning from others. I am always looking for what I can learn from others and try to invite opportunities to do so. No, that is not the same thing as approaching the work with humility, but it is the easiest facet to me. The rest, I more consciously work on.

Terrorizing School Boards

There has been some hullabaloo around the National School Board Association’s (NSBA) recent letter to the Biden administration about protecting school boards and school board members from threats of violence. I have found the backlash against this to be so unreasonable as to be clearly bad faith.

The question at and is whether NSBA’s use of the term “domestic terrorism” was over the line, was inappropriate and/or perhaps entirely misleading.

First, let us just agree that terrorism is not just about Muslim extremists. It is not not just about events that happen in far away lands. Terrorism is the use of violence — and even credible threats of violence — to achieve political ends. The idea that violence can be a form of addressing political concerns is not a new one. Henry Kissinger spoke of "war [as] a continuation of political activity by other means,” an idea rightly credited to Karl von Clausewitz. Today, when it is asymmetric warfare, we generally call it terrorism.

While there have been many heated — though still non-violent — debates at school board meetings this year, there have also been many threats of violence, too. The fact that most disagreements — and even yelling, shouting and interrupting meetings — do not constitute threats of (or actual) violence does not at all undermine that fact that such things have happened.

Some people have made those who agree with them look bad, as most do not resort to violence or threats of violence. But blaming those who cite reality for being inflammatory or dishonest simply for citing reality cannot be taken as good faith objections.

I don’t need to address the question of what to call intentional efforts to disrupt school board meetings to recognize that some are resorting to violence (and threats of violence) to achieve political ends that they have not been able to further at the ballot box.

That is domestic terrorism. And there is nothing wrong with calling it out as such.

The Most Common Mistaken Approach to Item Alignment

The biggest mistake that people make when thinking about item alignment is focusing on the charged task, rather than thinking about the clarity of the observable evidence that an item generates or about the cognitive path that test takers might follow. 

True item alignment is about quality of the evidence that the item generates. Does the item present strong evidence that the successful tester does indeed have proficiency with the Targeted Cognition (i.e., the KSAs revealed by a close reading of the standard)? Does the item present strong evidence that the unsuccessful test that lacks proficiency with the Targeted Cognition? 

Mistaken thinking about item alignment tends to reduces those questions mere to questions of relevance – rather than focusing on the clarity of the evidence. That is, some look at what test takers are asked to do or think about (i.e., the task), and consider whether it is relevant to the standard, or how relevant. From this perspective, an item that suggest tasks which depend on cognition that is closer to what is described in the standard and/or that are more dependent on that cognition are deemed to be more strongly aligned. However, items that merely make use of cognition that is merely related the what is described in the standard cannot provide clear affirmative and/or negative evidence of test takers’ proficiency with the Targeted Cognition. Simply put, it is not enough.

This mistaken approach often accepts items that test takers get wrong for reasons other than the Targeted Cognition as aligned. This mistaken approach often accepts items to which that test takers can respond correctly without the Targeted Cognition as aligned. It allows for the mere possibility that the test taker used the Targeted Cognition or misapplied the Targeted Cognition to count as alignment, in spite of the ambiguity of the such evidence.

(With multiple choice items, quite often, this is exacerbated by focusing only on whether the Key is accurate, whether the distractions are incorrect and/or – at best – whether they capture the kinds mistakes that test takers might make. However, these mistakes do not have to be mistakes in or misunderstandings of the cognition described in standard, itself. Thus, these items frequently max out at level 3 (i.e., Task Alignment).)

The RTD Alignment Scale addresses this mistake, head on.