Latency and timeouts

Why FluxRouter requests feel slow or time out, and how to fix it with client timeout settings, streaming, and retries.

Slow responses and timeout errors are usually client configuration, not the gateway. The fixes are timeouts, streaming, and sensible retries.

Why does my request time out?

Symptom: Your client raises a timeout error before any response arrives.

Cause: The client read timeout is shorter than the time the model needs to generate. Long or reasoning-heavy completions can exceed default timeouts.

Fix:

  • Increase your client read timeout. Many SDKs default to 30 to 60 seconds, which is too short for long generations.

    python
    from openai import OpenAI
    
    client = OpenAI(
        api_key="sk-...",
        base_url="https://api.fluxrouter.ai/v1",
        timeout=120.0,  # seconds
    )
    
    ts
    import OpenAI from "openai";
    
    const client = new OpenAI({
      apiKey: process.env.FLUX_API_KEY,
      baseURL: "https://api.fluxrouter.ai/v1",
      timeout: 120_000, // milliseconds
    });
    
  • Set the timeout to comfortably exceed your longest expected response.

Why does the first token take so long?

Symptom: There is a long pause before any output, especially for large prompts.

Cause: Time to first token grows with prompt size and model. Without streaming, you wait for the entire response before seeing anything.

Fix:

  • Turn on streaming (stream: true). You receive tokens as they are generated, so the perceived latency drops sharply and you avoid hitting a single large-response timeout.
  • Trim unnecessary context from your prompt. Smaller inputs start faster.

Why are responses slower than I expect?

Symptom: Latency is higher than a direct call to a small model.

Cause: Latency tracks the model handling the request and the output length. Longer outputs and heavier models take more time.

Fix:

  • If you need speed over depth, pin a faster tier instead of flux-auto (for example flux-fast). See Models.
  • Lower max_tokens if you do not need long output.
  • Stream so users see progress immediately.

How should I handle retries?

Symptom: Intermittent 429 or 5xx responses interrupt your workflow.

Cause: Transient rate limiting or upstream hiccups.

Fix:

  • Retry transient 429 and 5xx responses with exponential backoff and jitter (for example 1s, 2s, 4s).
  • Do not retry 401 or 402. Those are auth and spend-ceiling errors that retrying will not fix. See the Error reference.
  • Cap retries (3 to 5 attempts) so a hard failure surfaces instead of looping.
  • Keep your retry timeout aligned with your read timeout so a slow-but-succeeding request is not cancelled mid-flight.