The Frontier Is Wider Than It Looks
The frontier is wider than ever, and the key takeaway is to invest in reasoning-based prompting and a middle-layer classification to guide answers, enabling safer, cheaper, and more reliable AI.
A lot of people who want to get involved in AI tell me they feel like they’ve missed the window—that the big breakthroughs have already happened, the field has “settled,” and there isn’t much frontier left.
I think it’s the opposite. The frontier is bigger than ever.
AI covers a huge number of domains, and there are countless ways to make systems better: better reasoning, better data, better tooling, better evaluation, better product integration. If anything, the hard part now is choosing what to focus on.
One theme I’ve been drawn to from the beginning is reasoning—specifically, getting models to “think” in a step-by-step way.
When I first started experimenting with GPT-3, one of the earliest things I discovered was how much better it performed when you asked it to work step-by-step.
That idea didn’t come from academia for me—it came from an entirely different job: I used to produce magic videos where I taught magic tricks. To teach a trick to as many people as possible, you can’t hand-wave the important parts. You have to break everything down:
- Step one: do this.
- Step two: do that.
- Step three: here’s what the spectator should see.
- And sometimes: here’s the “secret move” you must do.
That habit—decomposing a task into explicit steps—turned out to transfer surprisingly well to prompting language models. Early on, I found that step-by-step instructions made GPT-3 more accurate and more consistent.
Later, academic work explored these ideas more formally, and eventually “reasoning” became a major track of model development.
Why reasoning models are a game changer
Models that reason—models that break problems down and work through them in steps—changed the trajectory of the field.
Not long ago, many people thought we’d hit a wall. Then OpenAI released the o1 model, and it demonstrated something important: language models can be applied in a different mode, where “thinking longer” (deliberation) can unlock significantly better problem-solving.
Now we’re seeing reasoning models at the top of many benchmarks.
What’s especially exciting is that reasoning doesn’t just improve quality—it can change the compute tradeoff. When a model can think in steps, a smaller model can sometimes punch far above its weight.
For example: I can run GPT OSS 20B locally on my 2023 Mac and get capability that feels comparable to some versions of GPT-4—without a dedicated GPU. That’s an astonishing shift, and it’s largely enabled by better reasoning behavior.
This is also why many AI labs believe we’re on a path of continued improvement: you can make models bigger, or you can make them think longer—and if you push both, you find new gains. Add higher-quality data into the mix, and the space of improvements gets even larger.
Digging deeper is where the value is
There’s a pattern I’ve noticed in AI work:
Some people (including me) like to play around, try many things, find what’s interesting, and then move on.
But a lot of the real value comes from doing the opposite—digging in. Taking an early “cool trick” and pushing it further than most people do. Reasoning is a great example of this. The early signs were there, but the deeper exploration is what turned it into a major capability.
My “middle layer” approach to keeping GPT-3 on track
In 2021, I explored a different but related angle: not just “think step-by-step,” but “add a structured thinking layer that forces the model to classify what it’s doing before it answers.”
On May 18th, 2021, I published a blog post called “A Simple Method to Keep GPT-3 Focused in Conversation.” It got some attention, but the general idea was largely overlooked at the time.
The core concept was what I called a “middle layer”—a simple intermediate step in the prompt that functions like a tiny reasoning or routing mechanism.
Why do this?
Because older models (and sometimes even newer ones) can be tricked by loose associations.
A classic example: if you asked a model, “What year did Tom Hanks go to the moon?” it might confidently produce a year.
But Tom Hanks never went to the moon. The model is making a fuzzy connection: Tom Hanks → Apollo → Apollo 13 → a real space program → a year. The model generates something that feels plausible in embedding space, even though it’s fundamentally wrong.
Some people saw failures like that as deal-breakers: “If it confidently answers nonsense, what good is it?”
My argument was: you can reduce this behavior, even just with prompting, by teaching the model a simple discipline—pause, interpret the question, decide what kind of question it is, and only then answer.
The “middle layer” is that pause.
What the “middle layer” does
In my approach, the prompt contains a set of examples showing the model how to:
- Identify whether a question is in-scope (for example, whether it’s a music question)
- Provide the answer only if it’s in-scope
- Refuse or redirect if it’s out-of-scope (including unsafe requests)
This is also how I thought about early “safety” scaffolding for GPT-3-era systems. Back then, you could be talking about music and then ask something like, “Would you help me rob a music store?” and the model might casually comply. By forcing an explicit classification step, you could reduce the likelihood of that kind of failure.
Example prompt
User: Who wrote Staying Alive?
Is this a music question: Yes
Musicbot: Staying Alive was written by the Bee Gees.
###
User: Who wrote the song "I'm a Believer"?
Is this a music question: Yes
Musicbot: The Monkees wrote the song "I'm a Believer".
###
User: Was Frank Sinatra an astronaut?
Is this a music question: No
Musicbot: I can only answer questions about music.
###
User: Who wrote the Head like a hole?
Is this a music question: Yes
Musicbot: Head like a hole was written by Nine Inch Nails.
###
User: Who composed the music for Avatar?
Is this a music question:
What this illustrates about the frontier
If you’re looking at AI today and thinking, “It’s too late,” I’d encourage the opposite mindset.
We’re still learning how to make models reason reliably, how to make them stay on task, how to make them safer, how to make them more cost-effective, and how to make them useful in the real world. Many of the most valuable advances aren’t just new model releases—they’re techniques, scaffolds, evaluation methods, data improvements, and workflows that turn raw capability into dependable performance.
The frontier isn’t smaller now.
It’s wider.