A model tops the leaderboard. You wire it in. Your traffic doesn't care.
That gap, between the score on the chart and the answer your app actually gets back, is where a lot of money quietly goes to die. People pick the model with the biggest number next to its name and assume the number says something about their workload. Mostly it doesn't.
Here's the cleanest bit of evidence I know. Back in November 2023, a group of researchers out of Yale and a few other places (Deng, Zhao, Tang, Gerstein, Cohan) ran a dead simple test. They took multiple-choice questions from MMLU, the benchmark everyone quotes, blanked out one of the answer options, and asked the model to fill in the missing one. Not answer the question. Just guess the wording of an option nobody had told it. GPT-4 reconstructed the hidden option at a 57% exact-match rate. ChatGPT got 52%.
You can't do that by reasoning. The only way you pull that off is if you've already seen the test. The questions had leaked into the training data. So part of what that "score" measures is memory, not capability.
The benchmark is in the training set
This has a name. Contamination. And it's not some fringe thing a few purists worry about. A 2024 study built a clean math benchmark, GSM1k, hand-written to match the difficulty of the popular GSM8k set but kept off anywhere a crawler could reach it. Then they re-ran the whole field on it. Some models dropped by as much as 8% on the fresh questions. And the more likely a model was to cough up a verbatim GSM8k example, the further its score fell. Read that back: some of these models had quietly memorized the test.
Now, the frontier models from the big labs held up better. Which is exactly the problem. The leaderboard never told you which models were memorizing and which were actually thinking. You had to run a held-out test to find out. Almost nobody does.
The leaderboard itself can be gamed
The public preference leaderboards have a different problem, and it's baked into how they work.
April 2025, a team led by Shivalika Singh, with Sara Hooker and others, put out "The Leaderboard Illusion." They dug into Chatbot Arena, the head-to-head voting one. Turns out some providers were quietly testing a pile of private model variants and only publishing the best score. Meta ran 27 private variants in the run-up to Llama-4. Spin up enough versions, keep the one that got lucky, and suddenly the ranking loves you.
Then it gets worse. Google and OpenAI were estimated to have received around 19.2% and 20.4% of all the Arena's battle data between them, while 83 open-weight models put together scraped about 29.7%. If you've got the most data, you can tune toward the test distribution, and the paper showed that even a modest dose of extra Arena-shaped data could juice relative scores by up to 112% on that distribution. So the leaderboard is partly just measuring who tried hardest to win the leaderboard.
None of this makes the models bad. It means the score is busy answering a question that isn't yours.
Your traffic is the benchmark
Think about what an MMLU number even is. It's an average. Law, medicine, math, history, abstract algebra, a few dozen domains stirred together and weighted in a way that has zero to do with your app. If you spend all day classifying support tickets or yanking fields out of invoices, a model's grade-school arithmetic chops are noise. The model sitting tenth on the chart might be first on your prompts. At a fifth of the cost.
The only test that tells you how a model behaves for you is your own prompts, graded on whether the answer was actually usable. Everything else is a proxy, and proxies drift. There's a name for what happens when you turn a measure into a target. Goodhart's law. The minute a benchmark becomes the thing labs chase, it stops measuring anything else worth knowing.
So the honest way to pick a model is boring. Grab a slice of your real requests. Run them across a few candidates. Look at what worked and what each one cost you. Then do it again next month, because the cheap-and-good model from this month probably won't be the cheap-and-good model in four weeks. Slower than reading a chart, sure. It also doesn't lie to you.
This is the thing we built Flux around. Route each prompt to a model that's earned it on traffic that looks like yours, and show you the real cost per answer right there in the response. The benchmark that counts is the one running through your own pipe. Your traffic is the one test nobody else gets to study for.
