LLMs Need Better Executive Function
- Rob Dearborn
- December 1, 2025
- 11:28 pm
In the past several weeks we’ve gotten GPT-5.1, Gemini 3, and Opus 4.5. They’re incredible machines. Their benchmarks are superhuman and climbing. They can whip up RNA explainer simulations faster than you can consume them. They have EQ and souls! They’re smarter, speedier, cheaper.
…and yet:
- I don’t feel closer to having an AI I can trust to — real administrivia from the past week — pay out weekly high scorer prizes in my fantasy football league, book a good spot for my son’s birthday party, or safely free up disk space on my wife’s laptop.
- Doom looping remains a daily experience when pairing with AI in production codebases.
- Models continue to tell us what they think we want to hear.
- Anthropic is still putting out (quite handy) stuff like this.
- Inference is sub-1% of world GDP.
Earlier this year a consensus formed that AI needs better continual learning to make the METR chart -> GDPval -> global prosperity go up. I’ve come to believe that something even more foundational is missing: better executive function.
Let’s define executive function as the ability to set goals, plan how to achieve them, then drive until completion without wandering off. It’s some combination of sustained focus, accurate self-assessment, high conscientiousness, and inhibition. Effective agents require it. Current models don’t have it. Attention is all you need, but the AIs have ADHD.
To start, the ability to stay focused on tasks until completion is literally what METR’s headline benchmark measures. By it, relative to humans who can sometimes work for whole lifetimes on things, frontier models are not doing so well. The current best offers a 50% chance of completing a 2.75 hour task successfully. Want 80%? Slide down to 30 minutes. 99%? Not plotted. Sure progress is on an exponential (or is that a sigmoid? 🙂), but month+ reliability is trending next decade.
Beyond METR, most other interesting-and-unsaturated benchmarks reflect at some level an inability to stay on track. You can see it in ARC and Sudoku-Bench and when LLMs play Towers of Hanoi. You can see it in Terminal-Bench and computer use broadly. You can see it in anything requiring multiple hops over long context. You can see it in c-bench, which I built to gauge models’ capacities for bearing down and grinding through the kinds of tedium that we’d all love to automate.
Most of all, I spend my professional day-to-day playing LLM unreliability whack-a-mole. Tell a smart human “do x” and they will either do x or tell you why they can’t. Tell a model “do x” and you’ll build a Rube Goldberg machine of prompt engineering, context engineering, workflows, harnesses, retries, evals, and on and on. Then you’ll live in constant fear of your LLM saying work is done when it isn’t, or producing plausible nonsense, or producing obvious nonsense. Then your fears will be realized again and again. “AI engineering”, it turns out, is mostly just bandaging models’ intrinsic lack of executive function.
Autoregressive attention is most obviously to blame. LLMs have ~instant recall over absurd Harry-Potter-series-length token sequences and zero recall beyond them. Once context falls out of mind, work falls off target. There’s a subtler problem as well: everything — context, plan, reasoning, execution — competes in a single channel. Serial attention means token dilution, aka context rot. As sequence length grows, the marginal influence of important tokens approaches zero. Errors compound. Progress separates agents from their motivations. Models are, depending on how you look at them, either stateless or juggling all states concurrently.
Also at fault is that LLMs are galaxy brains with superhuman factual recall. They’re too quick to recall outdated information (e.g., stale APIs) like a seasoned employee who wants to pattern match all situations to ones they’ve seen before. They’re too quick to recall superfluous information (e.g., in error handling) like a booksmart new hire overeager to show off how much they’ve retained from school. They go off on tangents, then trip over themselves and forget where they were going in the first place.
Meditating on LLMs as galaxy brains reveals more fundamental limitations of current training recipes too. Natural intelligence, the kind we’re trying to replicate, is deductive and extrapolative. Artificial intelligence produced by current paradigms is inductive and interpolative. This is a problem and we’re not on track to resolve it. To elaborate:
- I think of human learning like additive construction with a 3d printer. With harsh environments and scarce data, nature has evolved us to learn only as needed and as efficiently as possible. We started as automata and layered on generalizable executive functioning and meta learning early and centrally. Book learning is a latecomer. Most of us don’t know most facts because we have no need to.
- Current AI, by contrast, is developed like a block of marble: we start with a giant mass then subtract and polish. LLMs by definition are mostly evolved by their pretraining — i.e., trying to memorize the internet. They are fundamentally internet compressions. We take these shoggoth blobs and, with postraining, give them a usable veneer.
- It’s become gospel that we will with RL — first RLHF, now RLVR — take our pretrained galaxy brains and transform their infinite priors into coherent agents. We’ll create environments for as many tasks as we can. We’ll rubric-ify the squishy ones and make them verifiable. We’ll bolt on ad hoc harnesses and tools. RL will approach 100% of training flops, and eventually we’ll have either brute forced human intelligence or be delighted when it falls out as so much has before. I’m not so optimistic.
- RL upweights and connects applicable facts and functions that are already present, but with current techniques and relative scales it doesn’t much absorb new information or alter model structure. We are not on course to meaningfully change this: RL approaching 100% of training flops != RL approaching 100% of training bits. With current techniques, RL will remain mere postraining for the foreseeable future. We are not on a path to produce intelligence with learning and agency at its core.
Above all, I hope we can all recognize the problem here and prioritize solving it. If you’re at NeurIPS this week, I hope you’ll argue about this. If you’re launching gigawatt training runs or deploying trillions in capital, I hope you’ll try in whatever way is most natural to get your models to improve. If you read this and go on to develop some unorthodox intelligence that’s more human-shaped and human-useful in ways I completely don’t see coming, I’ll be thrilled.
More prescriptively: I personally expect EF gains from training slimmed-down models, especially which retrieve from and write to scalable external memory natively (i.e., from pretraining onwards; disclosure: my hobby horse is modernizing RETRO, more on this coming soon). Let them avoid computing over all context all the time, but instead flow in and out of attention frames. Let them hoover up new facts and functions and call on them when needed. Let them decouple attention, reasoning, and output cycles and scale each independently. I imagine a whirring continuous inner loop, augmented by cross attention and asynchronously emitting actions, as a good forge for emergent executive function.
Finally, I’m hungry for out-there training regimes. We need to achieve RL’s providing the majority of training bits. We further need to achieve that executive function and meta-learning are the gradient of least resistance. Toward these ends I’m encouraged by early work on simulated gym and game environments, and generally am bullish on returning to AIs learning from games (WoW-bench, anyone?). I’ve also changed my mind on the virtue of unconventional JEPA / program synthesis / etc. work, and I’m increasingly open to the possibility that we may need to go even crazier still. I’m not ready to throw out pretraining yet, but I saw this meme in passing and it’s stuck with me since.
Look, maybe I’m wrong. Maybe we really will componentize all economically valuable work. Maybe I’m underaccepting of alien intelligences, however dysmorphic, and fixated on birds while we’re building airplanes. Maybe I’m just impatient — o3 was 7.5 months ago! Maybe. But I don’t think so, and I don’t think I’m alone.
It feels like a shift is underway. I’m confident there’s no bubble that’s bursting, but there is healthy reconsideration and reorientation going down. Folks are increasingly willing to say out loud (mostly to Dwarkesh) that scaling autoregression may not be it. Ilya put it nicely last week:
This is one of the very confusing things about the models right now. How to reconcile the fact that they are doing so well on evals? You look at the evals and you go, “Those are pretty hard evals.” They are doing so well. But the economic impact seems to be dramatically behind. It’s very difficult to make sense of, how can the model, on the one hand, do these amazing things, and then on the other hand, repeat itself twice in some situation?
We don’t have all the answers yet, but I see good reasons for optimism. One can squint at Claude and Claude Code — subagent scaling, native compaction, craftful training — and catch glimmers of what I advocate above. Titans want cognitive cores and very tiny models. Investors, much to other investors’ unease, want to park money in weird labs.
We don’t have all the answers yet, but would it be any fun if we did?
Hit us, Ilya, one more time. “So it’s back to the age of research again, just with big computers.”