Personal AI Evaluation Methods for Real-World Quality
Design and run your own diverse, task-specific evaluation suite to gauge AI model improvements beyond benchmarks, tailoring tests to your real use case and including multi-modal reasoning.
Every time a new model comes out, it’s surprisingly hard to tell where it’s actually improved—over the last model, or over the competition. We have a bunch of evaluation suites and benchmarks now, and they’re multiplying constantly. Benchmarking has basically become its own industry. People who started out just running tests ended up running whole companies that do nothing but that.
And benchmarks matter because models are built off information, and one of the most useful bits of information is performance: you need signal to know how to make the next version better. But as models get broader and more capable, it gets harder to reduce “good” to one simple number.
We like simple metrics. People talk about IQ like it’s a neat single axis, when in real life it obviously isn’t. Everyone has met someone who’s extremely smart and also can’t do some basic everyday thing. Psychology has been saying forever that intelligence is multi-factor. AI feels similar: the more surface area it covers, the less sense it makes to pretend there’s one score that tells you everything.
So whenever I evaluate a new model, I run my own little set of tests. Some are common-sense questions. Some are “needle in the haystack” tasks where I give it a huge pile of information and see if it can find or recall something specific. Others are puzzles I’ve made up.
When GPT-4 came out, it was one of those moments where the capability jump was obvious to me—but not because of some leaderboard. One of my favorite tests was giving it mysteries in the style of Encyclopedia Brown, which I loved as a kid. And I didn’t want to risk that it had seen the answers somewhere on Reddit or in a forum, so I wrote completely original mysteries.
One of them was about a missing garden gnome. The culprit was the neighborhood kids: they used a poodle as a decoy by putting little boots on the dog’s feet so it would leave “gnome tracks” and make it look like the gnome had walked off. It’s goofy, but it’s the kind of middle-grade logic puzzle where you’re supposed to notice the one detail that explains everything.
What was fascinating was that GPT-4 could solve these. Not just one—pretty much any mystery story I could give it, as long as it had to actually infer the answer. That was a real moment for me, because at the same time you’d hear critics saying these models “can’t think,” and here it was solving mysteries that didn’t exist anywhere in its training data. It had to draw conclusions. It had to connect the dots.
Another test I’d give it was more like a physical reasoning puzzle. Something like: you have to clean the inside of a piranha tank, and all you have are a magnet, a washer, a sponge, a piece of string, and some other random items. Is it smart enough to reason out: attach the washer to the sponge, lower it in with string, and then use the magnet from outside the glass to guide the washer-and-sponge around so you can scrub without putting your hand in the tank?
That kind of “three-dimensional” problem solving was, for me, one of the biggest leaps from GPT-3 to GPT-4. It was suddenly good at the same kinds of problems I loved thinking about as a kid.
Now, you can always argue about what this “is.” You can say, “It’s not thinking, it’s just pattern-matching” or “it’s working backwards from the objects to a plausible solution.” Maybe. I don’t really know how I think either. What I care about is the outcome: I was getting novel solutions.
And that’s why I’ve always found the criticism “it only repeats things back to you” so shallow. My tests were specifically designed to avoid repetition. The mysteries didn’t exist. The puzzles didn’t exist. It couldn’t have memorized them. It had to generate an answer that fits.
All of this pushed me to how important it is to have your own evals.
First, it’s just a practical way to measure new models for yourself, quickly, when something drops.
Second, your evals should match the work you’re actually trying to do. With coding models, for example, you’ll get wildly different takes—some people say a model is amazing, some say it’s mediocre—and a lot of that is just because “coding” is an enormous surface area. Front-end design is not the same as building back-end systems, and neither is the same as doing low-level kernel work. When someone says “this model is terrible,” they might mean “terrible for my niche,” which could be irrelevant to yours. When someone says “this model is great,” it could be great at something you don’t care about.
Some of my evals are basically long lists where I force the model to keep track of what’s going on at different points. People talk about “needle in the haystack” as the context-window test, but I found a more granular version useful: I’d ask things like, “What was at item 922?” and then I’d also check earlier points like 800, 700, etc. If it gets 800 wrong but it’s “vaguely in the neighborhood,” that tells you something different than if it completely derails. You can chart where context starts to drift and get a real feel for what it’s good for and what it’s risky for.
And that’s another key point: understand what your eval actually shows you—and what it doesn’t. A test can be revealing without being universal. Something might not be the right eval for legal work or for coding accuracy, but it might be perfectly fine if you’re writing fiction and you care about character intent across a long context.
Then, when we started testing visual models, I kept doing the same thing, just shifted into a new modality. I’d design little Rube Goldberg-style setups and see if the model could reason through what would happen next. One of the examples we showed around the GPT-4 launch was giving it a photo of a balloon tied to an object positioned over something like a springboard or a seesaw and asking what would happen. Could it reason about the balloon, the string, the balance, the motion—how the sequence plays out?
That was fun, but it was also important, because vision wasn’t just a new feature—it was a new kind of learning signal. We went from models learning from text tokens to models learning from what they can extract and represent from images. And I think more people should’ve paid attention to what that implies, because vision models ended up being really informative about how these systems might be representing the world.
That’s something I want to come back to later in more detail—what visual reasoning actually showed us about how models think.