The Uneven AI Frontier: Why Capabilities Arrive Jagged
Capabilities often arrive in messy, frame-by-frame forms rather than polished breakthroughs, so valuable insights come from imperfect experiments that hint at real potential.
The frontier is uneven. It isn’t just that capabilities arrive in fits and starts—it’s that you don’t always recognize what you’re looking at when it’s in front of you.
Sometimes you notice a capability, but it’s not as clean as you’d want. It doesn’t come packaged in the “right” form. It feels a bit hacky, or too expensive, or not quite the thing people mean when they say the thing. And because of that, it’s easy to overlook, or not draw attention to it. Then later, you realize it was more significant than it seemed at the time.
While working on the GPT-4 release—specifically the vision side—I ran a bunch of experiments. I threw complex Rube Goldberg machine images at it to see if it could reason about what was happening. I tried physics-like scenarios, psychology-test-style prompts, and other cases where you’re asking a model to do more than just identify objects.
One experiment in particular stuck with me: I started feeding it sequences of images.
The setup was simple. I’d take a short video—say someone doing an exercise movement—extract the individual frames, then feed those frames to the model as a set and ask it what was going on. And GPT-4 was actually quite good at it. It could interpret the differences from frame to frame and infer the action. It wasn’t just describing one image at a time; it was tracking change.
Before GPT-4 launched, I even put together a very simple “video decoder” program using GPT-4 itself. It would take an MP4, extract frames, send them to the model, and get back a coherent description of what was happening in the clip.
It worked. The big catch was that it was intensive. For long clips, it took forever. But from a practical standpoint, it was still “processing video” in the way most people care about: you give it a clip and it tells you what’s happening.
I remember posting the idea in the research Slack. The reaction was basically: neat, amusing—but not “true” video understanding. Because technically, it wasn’t consuming a native video format or doing anything special with a temporal video representation. It was “just” an image model looking at a series of frames.
At the time, I suggested that maybe we should mention it as a capability at launch. The decision was no. Part of the reason was that there were rumors Google was working on a true video model, and the thinking was: if someone is about to announce real video understanding, we don’t want to stake a claim on something that looks like a workaround.
Then later, Google launched and announced video understanding.
And it turned out they were doing exactly the same thing I had done: extract frames from the video, send them to the image model, and have it interpret what’s going on between them.
I’ll admit that was disappointing in two different ways.
First, because we’d been able to do it earlier with GPT-4, and we were in a position where we could have announced it as a real capability. In hindsight it felt like a missed opportunity—especially watching the attention their announcement got.
Second, I was disappointed they didn’t have something better. I’d hoped that if they were going to make a big “video understanding” splash, it would involve a more sophisticated method than “sample frames and caption them intelligently.” Seeing the media run with it as if it were a fundamentally new breakthrough gave me a bit of FOMO, because I knew how unglamorous the underlying approach actually was—and I knew we’d already done it.
To be fair, I wasn’t completely quiet about it. When GPT-4 launched and we talked about vision understanding, I tweeted about it being able to break down video clips by analyzing frames. So when people started paying attention to Google’s model, I could at least point back and say: look, GPT-4 could do this a long time ago. That made me feel a little better.
But the bigger lesson for me isn’t “we should have gotten more credit.”
It’s that this is how the frontier often feels from the inside: important things show up in messy forms first. They look like prototypes or hacks or “not the real version.” They’re expensive, slow, hard to demo, hard to productize. And because they don’t fit the narrative of how a capability is “supposed” to look, people discount them—including the people closest to them.
And video wasn’t the only example.
There were other areas I think about in retrospect—especially some of my early experiments around reasoning prompts: feeding prompts in a way that encouraged the model to break problems down, having it “talk to itself,” structuring intermediate steps. I sometimes wish I’d explored that more deeply. Not necessarily because it would have been one more thing on the scoreboard, but because it felt like there was something there.
The constraint was just attention. There were so many other areas to look into, so many experiments to run, and only so much time. That’s part of the job: deciding what to chase, what to set aside, and what to ignore because it doesn’t look “real” enough yet.
And sometimes the thing you ignore turns out to be the thing everyone talks about later.