Rate limits and retries

How FluxRouter signals rate limits with HTTP 429, how to retry with exponential backoff, idempotency at your layer, and being a good API client.

FluxRouter applies rate limits to keep the gateway fast and fair for everyone. When you exceed a limit, you get an HTTP 429 Too Many Requests response. This page explains how to handle that correctly and how to be a well-behaved client.

What does an HTTP 429 mean?

A 429 Too Many Requests means you sent requests faster than your current limit allows. It is a temporary, expected condition, not an error in your code. The right response is to wait briefly and try the request again, not to treat it as a hard failure. If a Retry-After header is present on the response, honor it: wait at least that many seconds before retrying.

How should I retry?

Retry with exponential backoff and jitter. The idea is to wait a little after the first failure, then progressively longer after each subsequent one, with a small random offset so many clients do not retry in lockstep.

python
import random
import time

import httpx


def call_with_retry(client, payload, max_attempts=5):
    delay = 0.5  # seconds
    for attempt in range(max_attempts):
        response = client.post("/v1/chat/completions", json=payload)
        if response.status_code != 429:
            response.raise_for_status()
            return response

        retry_after = response.headers.get("Retry-After")
        wait = float(retry_after) if retry_after else delay
        wait += random.uniform(0, 0.25)  # jitter
        time.sleep(wait)
        delay *= 2  # exponential backoff

    raise RuntimeError("Rate limited after retries")

Cap the number of attempts and the maximum delay so a request cannot retry forever. Many SDKs (including the OpenAI and Anthropic clients) have built-in retry with backoff; if you use one, you often get this behavior for free.

What is idempotency and why does it matter here?

When you retry, you risk sending the same request twice if the first one actually succeeded but the response was lost. Make retries safe at your own layer:

  • Deduplicate at your application. Track which units of work have completed so a retry of an already-finished job does not double-process it.
  • Use a stable key per unit of work. Tie each logical request to an id you generate, and check that id before acting on a response, so a duplicate delivery is a no-op.
  • Make side effects safe to repeat. If a successful call writes to your database or sends a message, guard that write so running it twice does not cause harm.

Designing for idempotency means a retry is always safe, which is what lets you retry confidently on a 429.

How do I avoid getting rate limited?

  • Smooth your request rate. Spread bursts out instead of firing everything at once. A steady stream is friendlier than spikes.
  • Limit concurrency. Cap how many requests you have in flight at the same time rather than launching unbounded parallel calls.
  • Back off on the first 429. Treat the first rate-limit response as a signal to slow down, not to retry harder.
  • Batch where you can. Combine work into fewer, larger requests when it makes sense for your use case.

What about other transient errors?

The same backoff-with-jitter approach applies to other transient failures, such as temporary network errors or a brief upstream hiccup. Retry those a small number of times. For errors that are clearly your fault (for example a 400 from a malformed request), do not retry; fix the request instead. Keep the X-Flux-Request-Id from the response (see Transparency headers) so support can trace any request that keeps failing.