Context windows and long inputs
Why models differ in how much input they accept, how flux-auto considers fit, and how to handle very long prompts including pinning a long-context model.
A context window is the maximum amount of text a model can take in and reason over at once, counted in tokens. It covers everything you send (system prompt, conversation history, documents) plus the room the model needs to generate its reply. This page explains how that affects routing and what to do with very long inputs.
What is a context window?
Every model has a fixed context window. Some models accept only a few thousand tokens; others accept hundreds of thousands. A token is roughly three quarters of a word in English, so a long document can be tens of thousands of tokens on its own. If your input plus the expected output exceeds a model's window, that model cannot serve the request.
Does flux-auto consider context size?
Yes. When you send flux-auto, Flux considers whether a model can fit your request when it routes. You do not have to know each model's limit to get a working result for a reasonably sized prompt. Send the request and read the transparency headers to see which model answered.
How do I handle very long inputs?
If you are sending large documents or long histories, a few practices help:
- Trim what you send. Remove parts of the conversation or document that are not needed for the answer. Smaller inputs are cheaper and route to more models.
- Summarize earlier turns. For long chats, replace old turns with a short summary instead of resending everything verbatim.
- Split the work. Break a very large job into chunks, process each, then combine the results. This keeps each request inside a comfortable window.
- Pin a long-context model. When you genuinely need to send a very large input in one request, pin a model with a large context window using a
flux-pinned-*alias, so you are guaranteed a model that can hold it. See flux-auto vs pinning a model and the per-model limits in Models.
How do I know a model's context limit?
The Models page lists each model and its context window. Use it to choose a flux-pinned-* id when you have a hard requirement for a specific size. The authoritative list of aliases is always GET /v1/models.
What happens if my input is too long?
If a request is larger than a model can accept, that request will not succeed against that model. The fix is to reduce the input using the practices above, or pin a model with a large enough window. Trimming and summarizing also lower your token cost, since you pay per token.