Offline batch inference

Offline batch inference is the process of running models on large, static datasets to generate predictions in batches, rather than one at a time in real-time (online inference). It’s called "offline" because it doesn’t happen interactively; instead, it’s done as a bulk processing job.

By contrast, online inference means that the model only makes predictions on demand, for example, when a client requests a prediction.

Key benefits of offline batch inference:

Precomputing predictions reduces the load on real-time systems
More flexibility to use complex models that would be too slow for real-time inference.
Supports post-processing and validation of predictions before using them in production.

You may want to use offline batch inference in the following cases:

Your data doesn’t change often, so you don’t need real-time predictions.
You have a large dataset to process, and the predictions can be stored and reused later.
Your model is too big or slow for real-time predictions but works fine if run in advance.
You want to validate or review predictions before serving them to users (e.g., for quality or compliance checks).