What AI Benchmarks Don't Tell Us
0x41434f
For months now, I’ve been trying to make sense of something that’s been bothering me. On the surface, the AI industry looks unstoppable. There are flashy demos, soaring valuations, and a new benchmark achievement every other week. The language around it is supercharged: smarter, faster, more human. But the deeper I go, especially while building with LLMs myself, the more I’ve begun to question whether the progress we’re measuring is the progress that actually matters.
That doubt has only grown stronger after reading about recent AI scandals that all seem to follow a similar pattern. First, there’s a pitch about automation. Then, it turns out the AI was powered by humans the whole time.
Nate, a startup that raised over $50 million for a so-called AI shopping assistant, was recently exposed by the Department of Justice. Its "universal checkout" feature, marketed as AI-powered, was really being handled by workers in the Philippines. The app's automation rate was reportedly zero. Then came Presto, a drive-thru ordering startup that claimed to use AI to take food orders, but more than 70 percent of those were actually routed to offshore call centers. EvenUp, a legal tech unicorn, painted a picture of AI scaling legal work. But internal accounts revealed humans working 16-hour shifts just to make up for what the AI couldn’t do. And finally, 11x, an AI sales startup backed by a16z and Benchmark, was accused of inflating ARR, listing customers who never signed up, and hiding massive churn behind annualized projections that didn’t reflect reality.
These aren’t just cautionary tales. They are symptoms of something deeper. A gap between what we claim AI is doing, and what it’s actually doing under the hood.
To me, the root problem is how we evaluate these systems. Every release is benchmarked. Every model gets a leaderboard spot. But I’m starting to think these scores don’t mean much. Benchmarks like MMLU, FrontierMath, and GSM8K dominate model comparisons, yet the industry now knows they are heavily contaminated. These tests are freely available online. LLMs trained on massive web data have likely seen them before. So the models memorize the questions and regurgitate the answers. That’s not reasoning. That’s not understanding. It’s cheating with a good face.
When researchers tried to probe GPT-4’s reasoning, they found its performance dropped on coding problems published after its training cutoff. That means its high scores likely came from exposure during training, not from general skill. When tested on MMLU, it could even recall specific incorrect multiple-choice options, another sign of memorization. And OpenAI’s o3 model scored impressively on FrontierMath, a test it quietly helped fund and design. Even though OpenAI claimed the problems were held out from training, the optics are bad. And optics matter when you're asking the public to trust evaluations.
Meta got caught too. They tuned their Llama Maverick model to be unusually chatty in LMArena, a benchmark where humans vote on outputs. The extra charm helped the model score higher than expected. The problem isn’t just cherry-picking. It's that the whole benchmarking ecosystem is cracked. Models are trained on the test. Test authors quietly collaborate with labs. Leaderboards shift based on tone, not substance.
And the kicker? None of these benchmarks test what actually matters to most people using AI at work.
A benchmark for long division is not the same as provisioning accounts in an Identity Provider (IdP) system. A math puzzle is not the same as routing a ticket or writing a compliance memo. Enterprise tasks are messy, context-dependent, and deeply integrated into business processes. We need evaluations that reflect that. And we don't have them.
Even in security, most benchmark tasks feel like CTF puzzles, fun for capture-the-flag tournaments, but useless for actual cybersecurity incidents. I’ve seen it firsthand. I’ve worked jobs where securing real systems meant scripting IAM rules, not solving riddles. None of that shows up in the metrics these models are optimized for.
All this has shaped how I approach building QuickDesk. It’s a tool that uses LLMs to help employees handle internal support tasks: resetting passwords, requesting access, surfacing policies. I believe AI can help here, but I’ve also seen what happens when we pretend it works better than it does.
So I take a different approach.
First, I design around failure. If the LLM agent doesn’t know something, it should say so. If the query is too complex, it should escalate to a human. If it retrieves the wrong data, it should let the user correct it. I’m not trying to create an illusion of intelligence. I’m trying to build trust over time.
Second, I set expectations clearly. QuickDesk is helpful, but not perfect. It makes mistakes. Sometimes the retrieval fails. Sometimes the task execution breaks. I tell users this upfront. I'd rather underpromise and keep improving than overpromise and burn bridges.
Third, I think seriously about evaluation. QuickDesk doesn’t need to ace WebArena. It needs to handle five types of HR questions, two access workflows, and three security policies reliably in a specific company. That’s the benchmark. Custom evals, grounded in real company data, built with buy-in from stakeholders. That’s the path forward.
This is also why I believe verticalized LLM are the future, not generalist ones. When you narrow the scope, you get better data. And when you get better data, you can build smaller models fine-tuned on real tasks. That’s where enterprise AI wins. Not in chasing leaderboard points, but in solving actual problems that matter to real people at work.
That’s also why I’m skeptical of this idea that LLMs are about to spark a scientific revolution. HuggingFace Cofounder Thomas Wolf’s essay hit a nerve with me. He said current AI models are like “yes-men on servers,” trained to echo what they’ve seen, not challenge it. He’s right. The models are excellent students, but they are not researchers. They don’t ask new questions. They don’t reframe old assumptions. They interpolate between known facts, but they don’t push past them. And when they do make something new, we usually can’t tell whether it’s insight or noise.
We’re building models that ace tests, but fail the real world. We’re training models on human knowledge, then wondering why they don’t invent something beyond it. We’re chasing better benchmarks when we should be building better products.
So yes, the demos are exciting. But I’ve learned to look at what’s happening behind the curtain. If the system works only because ten people in the Philippines are manually routing requests, that’s not AI. That’s just repackaged outsourcing. If a model gets a perfect score on a test it saw during training, that’s not intelligence. That’s a magic trick with a good stage crew.
And if a company promises to replace entire departments with AI in six months, ask to see the logs.
This industry doesn’t need more hype. It needs more humility. It needs builders who are honest about what AI can do, and even more honest about what it can’t.
That’s what I’m trying to be. And that’s the bar I’m holding QuickDesk to.
We don’t need perfect models. We need honest ones. We don’t need leaderboard champions. We need systems that help people do their jobs and admit when they can’t.
That’s what real progress looks like to me.