The cheapest model on the price sheet is often the most expensive thing in your whole stack. Sounds backwards. It isn't.
Here's the mechanism. You pick a model on per-token price, because that's the number everyone puts in front of you. $1.74 per million input tokens instead of $5. Looks like a 3x win, easy call, move on. Then you wire it into an agent, hand it a real task, and it gets the task slightly wrong. The agent reads the error, re-sends the whole context, asks again. And again. A single coding task on SWE-bench can fan out into up to 50 model calls and 49 tool calls before it lands, with some trajectories running past 40 turns. Every turn is a fresh charge. So that clean per-token price you optimized for just got multiplied by a number you never once looked at.
This multiplier is not a thought experiment. One analysis of agent loops laid the arithmetic out: a naive 10-step loop that keeps re-feeding accumulating context costs about 43.3x more than the same work done in a single pass. That's 472,500 input tokens versus 9,000. And that is before anything actually breaks. Now factor in reliability. Chain ten steps at 95% success each and your end-to-end success rate is roughly 60%, because 0.95 to the tenth power is brutal and nobody does that multiplication in their head. The failures kick off retries, and those retries burn about 40% more tokens on their own.
The retry tax is invisible until it isn't
The nasty part is you don't see it coming. The same writeup describes an incident where an API format change pushed an agent's token rate to 200x baseline, burning around $50 in 40 minutes before anyone caught it. Fifty bucks is nothing. Fifty bucks every 40 minutes, quietly, across a fleet of agents nobody is staring at? That's a budget meeting with your name on the calendar invite.
The unit of cost that actually matters is not the token. It's the finished task. A model that needs three tries to get there has tripled its real price, and that's before you count the tool calls in between and the human who eventually has to step in when the loop just gives up. Same task. Wildly different bill. The only thing that changed was how many times the model had to be called back.
Per-token price is a decoy
Run the comparison the honest way and the cheap model stops looking cheap. Take a real pairing from this spring. GPT-5.5 shipped April 23, 2026 at roughly $5 input and $30 output per million tokens, and scored 58.6% on SWE-Bench Pro, the harder benchmark that tracks how often a model actually resolves a real GitHub issue start to finish. DeepSeek V4-Pro runs about $1.74 input and $3.48 output, roughly a ninth of the output price.
On per-token math the call writes itself. Except it depends entirely on the job. Summarizing a doc, pulling a field out of some JSON, the gap in success rate is basically nil and the cheap model wins going away. Now hand both a gnarly multi-file refactor. The expensive one lands it first try. The cheap one loops four times and still hands you back something a human has to finish. On that job the "expensive" model was the one that saved you money. The per-token sticker told you nothing about which job you were sitting in.
That's the trap, the whole thing. The price sheet ranks models by a number that only means anything when the answer is right. Nobody bills you for wrong answers at a discount. You pay full freight for the wrong answer, then you pay again for the retry, and plenty of the time the retry misses too and the work falls back on a person anyway.
What you'd actually measure
Be honest about cost and you'd track dollars per task that worked. Not dollars per million tokens. That means knowing, per request, which model answered, what it cost, and whether the result held up. Most teams can't see a single one of those three. The model's buried behind an SDK. The cost arrives as one fat aggregate line at month-end. And "did it work" lives in somebody's head, or in a downstream error nobody bothers tracing back to its source.
So you optimize the one number you can actually see, the per-token price, and you push it in precisely the wrong direction. You send a hard problem to a cheap model, it loops, the bill climbs, and your dashboard cheerfully reports that you cut costs.
The fix is matching the model to the job, request by request, so the easy stuff goes cheap and the hard stuff goes to whatever nails it on the first try. And it cuts both ways. Firing your most expensive model at a one-line prompt is its own flavor of waste. This is what we built Flux to do: right-size each prompt, and show you which model answered and what that answer cost, so the thing you're optimizing is the answer instead of the token.
A thousand cheap tries that never land cost you more than the one call that did. The price sheet won't tell you which of those you're buying. Your bill will.
