Run batch and high-volume jobs on Flux
Run high-volume, cost-sensitive batch work through FluxRouter. Use flux-auto or pin a cheap lane, watch your spend ceiling, and handle concurrency, retries, and idempotency.
Batch work, classifying thousands of records, summarizing a backlog, tagging or enriching rows, generating embeddings-adjacent text, is high-volume and cost-sensitive. Every request is multiplied by your row count, so the per-request model choice matters more here than anywhere else. FluxRouter is OpenAI-compatible, so you send the same chat-completions requests in a loop: base URL https://api.fluxrouter.ai/v1, your Flux key (sk-...), model flux-auto.
Choosing a model for batch work
You have two sensible options:
- Use
flux-autowhen the items vary in difficulty. FluxRouter right-sizes each one, so easy rows land on a cheap model and only the hard rows reach a stronger one. The batch pays a blended rate. - Pin a cheap lane when the work is uniformly simple and you want predictable per-row cost. Set the model to
flux-fastto keep every request on a lightweight, low-cost model. This is the cheapest and most deterministic option for repetitive, simple jobs.
Pinning a single model also makes cost estimation trivial: you know the per-token rate up front and multiply by your token volume. See Models for the aliases, and Routing and pricing for the lane rates (Express Lane starts at $1 / 1M tokens).
Minimal batch loop
A batch job is the same single request, run over many inputs, with bounded concurrency.
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(
api_key="sk-...", # your Flux key
base_url="https://api.fluxrouter.ai/v1", # the one line you change
)
# Cap concurrency so you don't hammer the API or your own budget.
semaphore = asyncio.Semaphore(10)
async def classify(item):
async with semaphore:
response = await client.chat.completions.create(
model="flux-fast", # pin a cheap lane for uniform, simple work
messages=[
{"role": "system", "content": "Classify the sentiment as positive, negative, or neutral."},
{"role": "user", "content": item["text"]},
],
)
return {"id": item["id"], "label": response.choices[0].message.content.strip()}
async def main(items):
results = await asyncio.gather(*(classify(i) for i in items))
return results
Practical tips
Bound your concurrency
Fire requests concurrently to get throughput, but cap how many are in flight at once (the semaphore above). Unbounded concurrency on a large batch spikes your spend and is more likely to hit rate limits. Start conservative and raise the limit only if you are not seeing 429s.
Retry with backoff on 429
A high-volume job will occasionally hit rate limits (429) or transient errors. Do not drop those items. Retry them with exponential backoff and a cap on attempts:
import asyncio
import random
async def with_retries(coro_factory, max_attempts=5):
for attempt in range(max_attempts):
try:
return await coro_factory()
except Exception as e:
status = getattr(e, "status_code", None)
# Retry on rate limits and transient 5xx; re-raise anything else.
if status not in (429, 500, 502, 503) or attempt == max_attempts - 1:
raise
backoff = (2 ** attempt) + random.random()
await asyncio.sleep(backoff)
Wrap each request in with_retries(...). The jitter (random.random()) keeps retries from synchronizing into a thundering herd.
Make retries idempotent at your layer
LLM calls are not naturally idempotent: retrying re-runs the model and may produce different text. Track which input ids you have already completed (a set, a database column, a results file) and skip them on a re-run. That way a crashed or restarted batch resumes instead of double-processing rows and double-paying for them.
Watch your spend ceiling
Batch jobs are where surprise bills happen, because one mistake is multiplied by every row. Before a large run, estimate cost: rows times tokens-per-row times the lane rate. During the run, watch usage so a misconfigured prompt or a runaway loop does not blow past your budget. Plans set a monthly spend ceiling; pay-as-you-go bills usage directly. See Routing and pricing for how usage is priced. Billing and spend-limit details will live under Billing.
Confirm the model and cost per request
Each response carries X-Flux-Model (what served the request) and, on non-streaming responses, X-Flux-Cost-Usd (what that request cost). Logging these during a batch lets you verify you are actually landing on the cheap lane and tally exact spend. See Routing and pricing.