I have been leaning into using ChatGPT this year. I want to know what LLMs are good at, what they are bad at and I want to be able to take advantage of whatever they can offer to help me in my work and the rest of my life. So, along the lines of Rob Napier and Mike Caulfield, I want to offer some thoughts and explanations about why LLMs can be so unsuitable for advanced work.

Technically, LLMs are designed to be prediction machines, predicting the next word (or token). But it is a certain kind of prediction and approach to prediction. They actually are huge averaging machines. They give the average answer, the expected answer. They scour their training data—virtually the entire internet and more?—and supply the most likely response from that. The dominant response. The average of all the possible responses. This generates the next word, whole phrase, sentences and paragraphs—or more.

They are not designed to give the right answer. They are designed to give the most likely answer (i.e., next word, phrase, etc), given everything out there. The assumption is that the most likely answer is probably the right answer. Popular wisdom. Wisdom of the crowd. We can say that a lie can travel halfway around the world before the truth can get its boots on, but the truth gets repeated a lot. Most of what is out there is sincere and even true.

The problem is that LLMs do poorly with really specialized knowledge, especially advanced specialized knowledge. Cutting edge research? Gaps in the literature? Innovative work? No, LLMs are particularly bad around any of that.

ChatGTP can give you original cartoons for your blog posts

Let me illustrate with a metaphor. I was writing something a few weeks ago and wanted an example of an obscure clause of the United States’ Constitution. I could pull up the text and find something, but I’ve always got a ChatGPT window open so I just asked there. What I got was a list of famously obscure clauses. The thing is, none of them are actually obscure anymore because they have been cited to many times for being obscure. They are now famous. It’s like Yogi Berra’s “Nobody goes there anymore, it’s too crowded”—if interpreted a bit literally.

LLMs are really bad at the obscure or rare. And they combine that with…well, I had a long conversation with ChatGPT about the issues I am writing about here and it offered “No Epistemic Humility.” It is very confident that it knows, and is quite literally incapable of recognizing when it does not know something. Combine that with that ChatGPT called “Poor Retrieval of Rare or Underrepresented Content” and you can get some wildly incorrect responses. LLMs have “Difficulty Recognizing Thinness (Not Just Absence).” They don’t recognize ignorance or lack of a basis for things, and they get overwhelmed by what they do know when asked about things they do not know.

(No, LLMs do not actually know anything. Rather, the representations of and links between words in their structures produce results that describe true things, or at least things that exist in their training data. But sometimes, those representations describe things that are not true. But I will stick with the anthropomorphization, for this piece. And I will keep using the language from the headings of ChatGPT’s summarization of our conversation, as I have been.)

This leads to directly observable problems.

LLMS tend towards “Hallucination in Low-Data Zones.” Being unable to recognize ignorance, they confidently offer what they expect the answer to be instead of answering that they have no or few matches. They are not search engines. They work differently. So, they make their best guess—which is really all that they ever do. Their best guess can be pretty damn good when there is a lot of data on point. But their best guess can be pretty poor when there isn’t. If you ask for a top ten, they will give you ten—even if they have to make up eight of them. Only they produce the two real ones the same way they do the false eight. For all then, they are saying what feels true to them.

But it gets worse. They will affirmatively get it wrong when what you are directing towards is discordant with everything else. That is, they cannot remember the really new innovative work in established fields. Heck, you can paste the recent article into the chat and ask for a summary and it will replace the contents of the article you just gave it with the dominant ideas in the field, with absolutely no recognition that it has done so. When I described this issue that I had seen too many times, ChatGPT called it “Overfitting to Genre Expectations.” I like that description. It had earlier agreed that “LLMs default to genre familiarity over actual textual fidelity.” (I don’t think that I actually write like that, and ChatGPT introduced the term “genre” to the chat, but it had picked up on academic nature of our conversation.)

Very much like human beings, they engage in “Semantic Drift to Adjacent Topics” when the conversation is in zones of “Underrepresented Content.” That is, they are more eager to offer things that they have a lot of basis for than things that are thinner in their training data. This makes them really poor at helping with specialized literature reviews. Yes, they will hallucinate and make up references. But they also really want to offer widely cited ideas and sources from adjacent areas—of course without any recognition that that is what they are doing. They are always confident that their answer is appropriate, and never aware of hallucinations. They offer popular answers from elsewhere, often phrased as though they belong here.

Perhaps this is all just a specialized case of “Statistical Bias Toward Dominance.” That is, they are more likely to give the most popular consensus answer than any other answer—far beyond proportionately to the difference in popularity. They would much rather give a popular answer of lesser relevance than a rare answer of greater relevance. They exaggerate the popularity of the most popular answer, creating a stronger sense of consensus than actually exists. They always give their best guess, even if the plurality answer is only 30% likely.

(Yes, one can adjust temperature and other settings, but I don’t think that most users have a clue about any of that, so I am leaving it out.)

A newsletter author I like recently wrote, “It’s funny how GPT is an expert in everything except for your field of knowledge.” I work in a small enough field (and a small enough corner of that field) that it is all really thin. I know the literature and the dominant themes. It is just easy to recognize when this LLM is making stuff up or failing to bring in something obscure-but-relevant. But all these issues that are so obvious to me in my field are relevant in other fields and for other types of queries and chats. They are just less visible or obvious. After all, this all follows from how LLMs work, at a fundamental level.

My counter example remains recipes for chocolate chip cookies. There are a lot of them out there in the internet. Ask an LLM like ChatGPT and it will give you a consensus recipe, weighted towards the versions it came across the most in its training data. Not the single best recipe. And not even the most popular single recipe, because its representation of recipes is more granular than that. But it will put together a recipe for you that puts together the general consensus of its training data. So, when I wanted to make a dish with Brussels spouts and chorizo, sure, I trusted it would come up with something good enough.
And when I wanted to know how stainless steel works, I figured that I was asking a mainstream question with a lot of good resources and explanations for it to build on. But I wasn’t depending on getting it exactly right, and it didn’t matter if it made up some grade or class of stainless steel. It didn’t even matter if it passed along some very popular myths about how water can undermine the protective layer that the chromium creates. I was just curious and I wasn’t interested in remembering the exact details of any of that. And I wasn’t looking in any corners or under and rocks.

But LLMs are strongly opinionated. They have expectations—can be thought of as nothing but expectations—and that confident voice can so easily be mistaken for expertise. I use ChatGPT to proofread my writing an offer suggestions, and it kept insisting on changing my language to make it more professional. It criticized my blog entries for being, “candid and thoughtful, but a bit informal.” I had to give it a standing order that that was precisely the voice I wanted them to have. I had to push back, push back repeatedly, and then push back hard. It has no more humility around item quality, test validity, how stainless steel works or a recipe for Brussels sprouts and chorizo than it does about the right tone for blog post.

I still use it. I still have given it this post to give me feedback. But the more specialized the knowledge I seek, the more particular the question, the more it matters that I get correct information, the less I—or anyone—can rely on anything generated by LLMs. While wikipedia has vastly improved its standing and credibility, this new generation of AI has come somewhere like Wikipedia’s old level of credibility. It’s just easier to use, and certainly more fun.

But do not be fooled. Perhaps unless you are coding, you simply have to be very skeptical of anything that any LLM gives you. Everything will be plausible. Everything will be a very good guess by this thing with an incredible breadth of knowledge embedded within it. But it is no expert, not on anything. Do not expect anything better than a good assistant might provide. (Again, unless you are coding.) It’s a broadly powerful tool, but not a tool to be trusted.

Complex Variety: Assessment Development, Education and Occasional Other Topics

Latest & Greatest

Dr. Hoffman