Run batch and high-volume jobs on Flux

Run high-volume, cost-sensitive batch work through FluxRouter. Use flux-auto or pin a cheap lane, watch your spend ceiling, and handle concurrency, retries, and idempotency.

Batch work, classifying thousands of records, summarizing a backlog, tagging or enriching rows, generating embeddings-adjacent text, is high-volume and cost-sensitive. Every request is multiplied by your row count, so the per-request model choice matters more here than anywhere else. FluxRouter is OpenAI-compatible, so you send the same chat-completions requests in a loop: base URL https://api.fluxrouter.ai/v1, your Flux key (sk-...), model flux-auto.

Choosing a model for batch work

You have two sensible options:

  • Use flux-auto when the items vary in difficulty. FluxRouter right-sizes each one, so easy rows land on a cheap model and only the hard rows reach a stronger one. The batch pays a blended rate.
  • Pin a cheap lane when the work is uniformly simple and you want predictable per-row cost. Set the model to flux-fast to keep every request on a lightweight, low-cost model. This is the cheapest and most deterministic option for repetitive, simple jobs.

Pinning a single model also makes cost estimation trivial: you know the per-token rate up front and multiply by your token volume. See Models for the aliases, and Routing and pricing for the lane rates (Express Lane starts at $1 / 1M tokens).

Minimal batch loop

A batch job is the same single request, run over many inputs, with bounded concurrency.

python
import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    api_key="sk-...",                          # your Flux key
    base_url="https://api.fluxrouter.ai/v1",   # the one line you change
)

# Cap concurrency so you don't hammer the API or your own budget.
semaphore = asyncio.Semaphore(10)

async def classify(item):
    async with semaphore:
        response = await client.chat.completions.create(
            model="flux-fast",  # pin a cheap lane for uniform, simple work
            messages=[
                {"role": "system", "content": "Classify the sentiment as positive, negative, or neutral."},
                {"role": "user", "content": item["text"]},
            ],
        )
        return {"id": item["id"], "label": response.choices[0].message.content.strip()}

async def main(items):
    results = await asyncio.gather(*(classify(i) for i in items))
    return results

Practical tips

Bound your concurrency

Fire requests concurrently to get throughput, but cap how many are in flight at once (the semaphore above). Unbounded concurrency on a large batch spikes your spend and is more likely to hit rate limits. Start conservative and raise the limit only if you are not seeing 429s.

Retry with backoff on 429

A high-volume job will occasionally hit rate limits (429) or transient errors. Do not drop those items. Retry them with exponential backoff and a cap on attempts:

python
import asyncio
import random

async def with_retries(coro_factory, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return await coro_factory()
        except Exception as e:
            status = getattr(e, "status_code", None)
            # Retry on rate limits and transient 5xx; re-raise anything else.
            if status not in (429, 500, 502, 503) or attempt == max_attempts - 1:
                raise
            backoff = (2 ** attempt) + random.random()
            await asyncio.sleep(backoff)

Wrap each request in with_retries(...). The jitter (random.random()) keeps retries from synchronizing into a thundering herd.

Make retries idempotent at your layer

LLM calls are not naturally idempotent: retrying re-runs the model and may produce different text. Track which input ids you have already completed (a set, a database column, a results file) and skip them on a re-run. That way a crashed or restarted batch resumes instead of double-processing rows and double-paying for them.

Watch your spend ceiling

Batch jobs are where surprise bills happen, because one mistake is multiplied by every row. Before a large run, estimate cost: rows times tokens-per-row times the lane rate. During the run, watch usage so a misconfigured prompt or a runaway loop does not blow past your budget. Plans set a monthly spend ceiling; pay-as-you-go bills usage directly. See Routing and pricing for how usage is priced. Billing and spend-limit details will live under Billing.

Confirm the model and cost per request

Each response carries X-Flux-Model (what served the request) and, on non-streaming responses, X-Flux-Cost-Usd (what that request cost). Logging these during a batch lets you verify you are actually landing on the cheap lane and tally exact spend. See Routing and pricing.

Next steps