A chatbot cost Air Canada $812.
That number is a joke. The real bill came after. Back in February 2024, Canada's Civil Resolution Tribunal ruled that Air Canada had to pay damages to a grieving passenger named Jake Moffatt. His mistake was trusting the airline's support chatbot, which told him he could buy a full-price ticket and claim a bereavement discount later. That's not the policy. The actual policy says you have to ask for the discount before you fly. The bot just invented the rest.
Then it got better. Air Canada argued, apparently with a straight face, that the chatbot was its own legal entity, responsible for its own actions. Tribunal member Christopher Rivers wasn't having it. The chatbot "is still just a part of Air Canada's website," he wrote, and the company owns everything on it. So Air Canada paid.
The $812 is the receipt. The headlines, the case law, the years of lawyers dragging your name out as the cautionary tale, that's the real cost, and there's no line item for it.
Here's the bit nobody warns you about when you swap a strong model for a cheaper one to trim the API line. The failure doesn't surface where you're watching. Latency, cost-per-call, the dashboard you spent a week building. Green, green, green. The quality drop happens out in the wild, in the one conversation you'll never read, with the one customer who's already screenshotting it.
The cliff is invisible from your side
You don't get a warning when a model gets worse. And sometimes a model gets worse without anyone changing a thing.
Researchers at Stanford and UC Berkeley nailed this down in 2023. They ran the same tasks against GPT-4 and GPT-3.5, the March version versus the June version, a few months apart, same name, same API. On identifying prime numbers, GPT-4's accuracy fell from 84% to 51%. And the-decoder, covering the same study, reported that the share of directly executable code answers collapsed from 52% to 10%. Same name on the box. Same price. A different machine humming away underneath, and no changelog that meant anything to the people building on it.
OpenAI pushed back on how the study was framed. Fine, take the point. It survives anyway. Model behavior drifts, and you don't get to pick when. If a model you never touched can quietly halve its hit rate on you, think about what happens when you go and trade down to something cheaper and weaker on purpose, then walk off and stop looking.
You don't feel the cliff. Your customer does.
What a bad answer actually costs
DPD learned this in January 2024. A customer named Ashley Beauchamp couldn't get the delivery firm's chatbot to do anything useful, so he started poking at it to see how far it would bend. He told it to swear at him. It did. He asked it to write a poem about how useless DPD's chatbot was. It wrote one. He told it to slag off its own employer, and it called DPD "the worst delivery firm in the world," slow, unreliable, never recommend them. Every one of those was a thing he asked for, and the bot just rolled over and did it, no guardrail anywhere. He posted the screenshots. The thread hit 1.3 million views. DPD pulled the AI offline and blamed a system update.
That's the part that should keep you up. A cheap, unguarded bot wired to a public brand account will do whatever the person typing tells it to, and there is always someone bored enough on a Tuesday to find out where the edges are. The tokens you saved putting that bot in production? They'd vanish into the rounding on a monthly invoice. You swapped a rounding error for a brand-name punchline, and you didn't even know you'd made the trade until it was on Twitter.
That's the whole problem in one line. The savings are tiny and they land in your column. The damage is huge and it lands in your customer's, on a clock you don't get to set. You tuned the number you could see and left the one you couldn't wide open.
A floor, not a coin flip
The answer isn't "always reach for the most expensive model." That's just the same mistake pointed the other way, and most of your traffic doesn't need a frontier model anyway. The answer is making sure whatever handles a given request is actually good enough for that request. No request quietly sliding off the bottom of the range to save a fraction of a cent.
Cheap is fine. Cheap-and-unwatched is the trap. What separates the two is a floor. A guarantee that nothing gets routed below the bar for the job in front of it, even when a cheaper option is sitting right there, looking gorgeous on the cost report.
This is the problem we built Flux to chew on. Today it hands you one key across 30+ models and picks a sensible one per request, instead of making you hard-code a choice and cross your fingers. Every call tells you which model answered and what it cost, right in the response headers, so the routing isn't a black box you're praying to. The quality floor, the hard guarantee that nothing drops below the bar for the job, is the piece we're building next. The point of the whole thing is simple. Saving money shouldn't quietly mean shipping a worse answer to the person paying you.
Air Canada's receipt said $812. They can cover that all day. What they couldn't buy back was every future customer who read the story before they ever opened the app.
