Nobody wrote it into a job description. But sometime in the last two years, a good chunk of your best engineering time stopped going into the product and started going into pipe. Wiring up a model. Re-wiring it when the provider shipped a new version. Juggling three different API formats because no single vendor does every job. Re-tuning a prompt that worked fine on Monday and quietly stopped working by Thursday.
That's plumbing. It keeps the water running. It doesn't ship features.
And the first integration was never the expensive part. It's everything after.
The prompt you tuned is a dependency you don't control
Tune a prompt against a specific model and you've just built a hidden dependency on that model's exact behavior. The vendor can change it underneath you, silently, and your "code" breaks without a single line of your code changing. Most teams learn this the hard way.
It's not a hunch. Somebody measured it. A November 2023 paper, "(Why) Is My Prompt Getting Worse?", put two dated versions of the same OpenAI model side by side, gpt-3.5-turbo-0301 and gpt-3.5-turbo-0613, on a real classification task. One prompt lost 9.6% accuracy after the update. A different prompt on the exact same task gained 5.1%. Same model name, same task, opposite directions, and the only thing that decided which way you went was how you'd happened to word things. Across every update they looked at, 55% didn't move performance consistently in either direction. Which is a polite way of saying you can't predict what an update does to your app without re-testing the whole thing.
So the update lands. Your output format drifts. A trailing period vanishes off a one-word answer and breaks a parser three calls downstream. JSON comes back valid but reshaped, and your hash-based dedup quietly starts treating identical answers as different ones. Nobody gets paged. Nothing errored. The request returned HTTP 200, the latency looked normal, and you find out three days later when a customer emails to say the results are wrong.
That's the worst kind of bug. It doesn't announce itself. It just sits there and waits.
And the model itself won't hold still
You can't pin your way out of this forever either, because eventually the models get retired.
It's already happened, more than once. Developers reported GPT-4o drifting in early 2025, outputs shifting in ways nobody got a heads-up about. And when a model gets fully retired, the request doesn't always fail loudly. Sometimes it keeps returning 200 while quietly falling back to a newer model that behaves differently. Which is, you guessed it, the silent-regression problem wearing a different hat.
You don't have to take a practitioner's word for any of this. After the GPT-5 launch in August 2025, Sam Altman said out loud that "suddenly deprecating old models that users depended on in their workflows was a mistake," and walked it back, restoring access to the older model. That's the vendor, in plain language, telling you the thing your app leans on is not a thing you control.
So every model you've hard-wired is really a maintenance ticket with a fuse already lit.
Now multiply that by every provider you touch
One model is a part-time plumbing job. The bill gets serious the moment you go multi-provider, and most serious teams do, because no single vendor is the best one (or the cheapest, or even the one that's up) for every task.
Every provider brings its own request shape, its own tool-calling format, its own streaming quirks, its own auth and rate-limit behavior. Hide all of that behind one clean layer and you're now owning routing logic, failover handling, and observability across every provider you touch. Building it once is the cheap part. The thing is alive. Every provider update, every new model, every deprecation is a change request against code that has nothing to do with the product you actually sell.
There's a reason gateways that sit in front of a pile of providers and soak up the retries and the cost tracking exist at all. The normalization work is real, it never ends, and nobody wants to own it forever. A whole product category showing up is just the market admitting out loud that hand-rolling this is a tax.
The bill nobody puts on a dashboard
So count it up. The integration is a one-time cost you can budget for, fine. But the re-integration on every model change, the prompt re-tuning after silent updates, the three-day-late regression hunts, the per-provider quirks, the deprecation migrations that land on the vendor's schedule and never yours? That part is recurring. And it lands squarely on the people you hired to build the product.
It never shows up on a cost dashboard, because it isn't billed in dollars. It's billed in your senior engineers' time, and they don't get it back.
This is the piece we built Flux to take off your plate. One endpoint that speaks both OpenAI and Anthropic, one key, and the routing across providers happens behind it, so the wiring and re-wiring stops being your team's problem. The models will keep moving. Somebody has to keep up with that. It just doesn't have to be the engineer you wanted shipping product this quarter.
