Batch Inference
A processing mode where you submit a batch of requests and the provider returns results over minutes or hours instead of seconds. In exchange, you pay fifty percent less. Anthropic and OpenAI both offer it. For non-urgent work, it cuts your bill in half.
Batch inference is the AI equivalent of overnight mail. You send a file of requests, the provider processes them when capacity is free, and you get results back within twenty-four hours at half the price. For any workload that does not need to respond to a live user, this is found money.
The Simple Version
If you can wait, you save fifty percent. If you cannot wait, you pay full price. Simple tradeoff. Most enterprise workloads can wait. Most consumer workloads cannot.
Why It Matters
Operators building internal tools, analytics pipelines, content generation queues, or any workflow where the user is not staring at a loading spinner should default to batch. Billing drops in half, capacity becomes more available (batch queues are less constrained), and the engineering overhead is minimal. Using synchronous inference for async workloads is one of the most common ways AI budgets get wasted.
How It’s Used on This Site
TWO’s digest pipeline runs synchronously because the approval flow is live. The Scrolls research phase, when a Scroll is commissioned, will use batch. The research is not user-facing. Half the cost, same output.