Here's a number that should annoy you.
GPT-5.5 charges $30 per million output tokens. DeepSeek's V4-Flash charges $0.28 per million output tokens. That's not a typo, and it's not me sloppily rounding. One of those models costs roughly 107 times more than the other for the thing you actually pay the most for, which is the text the model writes back. GPT-5.5 pricing sits right next to DeepSeek's published rates. Go look for yourself.
Now think about your last hundred API calls. A few were genuinely hard. Most of them, if you're being honest with yourself, were not. "Summarize this email." "Pull the three line items out of this invoice and tell me which one looks padded." A decent cheap model nails both of those, and you would never spot the difference in the output. Not once.
So that's the Ferrari problem. You bought the fastest car on the lot because someday, maybe, you'd need it on the track. And now you're idling it through a school zone to go buy milk.
The real spread
Here are the prices. All from the providers' own pages, all current as of June 2026.
GPT-5.5, released April 23, 2026: $5 per million input tokens, $30 per million output. Claude Opus 4.8: $5 input, $25 output. These are the flagship rates. They're also the defaults a lot of teams quietly leave wired into production, because nobody wants to be the person who downgraded the model and then has to explain it when something breaks.
Against that, DeepSeek V4-Flash runs $0.14 input and $0.28 output. On output, the expensive line, that's the 107x gap against GPT-5.5 and an 89x gap against Opus 4.8. Input is gentler, a mere 36x against GPT-5.5.
And you don't even have to leave Anthropic to feel this. Opus 4.8 is $25 per million output. Claude Haiku 4.5 is $5. Same company, same tokenizer family, same API, and you're paying five times the price for the bigger brain. Here's the part that gets me: Anthropic's own pricing page tells you flat out to "Choose Haiku for simple tasks." They literally baked the cheaper model into their own cost advice. The defaults just don't listen.
"But my prompts are special"
Sure, some of them are. The real question is what share.
And the cleanest evidence for that comes straight out of the routing research. Back in July 2024, LMSYS published RouteLLM, a router that decides per query whether a prompt goes to a strong model or a cheap one. With LLM-judge data augmentation layered on top of Chatbot Arena preference data, it hit 95% of GPT-4's quality on MT-Bench while sending only 14% of queries to GPT-4. The other 86% went to a much cheaper model and nobody could tell. Even on Arena data alone, no augmentation, the router only needed to send 26% of queries to GPT-4 to hold that same quality bar.
Sit with that for a second. On a benchmark built on purpose to be hard, the strong model earned its keep on a minority of queries. Your real traffic is almost certainly tilted further in the direction that helps you. Production is mostly classification, extraction, formatting, short rewrites. It is not a stack of adversarial reasoning puzzles, no matter how it feels on the bad days. The annoying part is that the few prompts that genuinely do need the big model are buried in the pile with everything else, and from the client's point of view they all look identical.
Where the money actually leaks
This is the part that stings. The waste is not on the hard prompts. Paying Opus rates for a genuinely hard reasoning task is completely fine. That's the job. That's what it's for.
The waste is structural. It's the default doing exactly what you told it to. You picked one model, wired it into the client months ago, and now every single call pays the premium whether it earned it or not. The corner-shop runs and the track days bill at the same rate, because the system has no clue which is which. A 107x gap barely touches you on the thin slice of prompts that truly need the expensive model. It hammers you on all the rest, multiplied across every request, every day, until someone eventually notices and asks why the bill looks like that.
And here's the thing people forget: the cheap option isn't a toy anymore. That's what actually changed. Five years ago "use the cheaper model" was code for "accept worse answers." Not now. A sub-fifty-cent-per-million model handles the routine load at quality you cannot pick apart from the flagship on those tasks. The price gap is still enormous. For simple work, the quality gap has basically closed.
So the move was never "use the cheap model." It's "stop paying the premium on calls that don't need it." Match each prompt to a model that fits it. Send the hard ones up, the easy ones down, and let the bill follow the actual work instead of some decision you made once and forgot about.
Which is most of why Flux exists, honestly. One key, one endpoint. The router looks at each request and picks a model that fits it, instead of charging Ferrari rates for a milk run. You get one bill, plus a header on every response telling you which model answered and what it cost. So you can actually see it.
Because the worst thing about paying 100x more than you need to? It doesn't feel like anything. No error. No alert. The response comes back fine, every single time. You just get a bigger number at the end of the month and no idea which calls put it there.
