Stay updated on AI infrastructure, inference techniques, and performance optimization.

<p>Subscribe to our <u>newsletter</u></p>

Expert how-tos, deep-dive guides, and real-world stories from the Bento team, to help you build and scale AI at blazing speed.

Sherlock Xu

<p>You load a 70B LLM on 2× NVIDIA A100 80 GB GPUs and think:</p><p><i>“This has more memory than I need. My model should run fine.”</i></p><p>The model loads successfully. But the moment you start generating text, VRAM usage spikes. Suddenly, you hit the Out of Memory (OOM) error.</p><p>Your batch size collapses to 1. Your context length has to be cut in half. Your inference throughput drops to a crawl.</p><p>So you start searching for fixes like KV-aware routing and cache offloading. Every solution sounds promising, but brings new dependencies, complex optimization tricks, and performance trade-offs. Before you know it, you’re buried in GitHub issues and CUDA error logs. This is all because your GPU memory didn’t behave the way the spec sheet said it would.</p><p>It doesn't have to be this hard.</p><p>In this post, we’ll clear up the confusion. You’ll learn:</p><ul><li>What GPU memory is</li><li>How LLMs use GPU memory during inference</li><li>How to calculate memory requirements</li><li>How to optimize memory usage and how the Bento Inference Platform can help you</li></ul><h2>What is GPU memory (VRAM)?</h2><p>When people talk about GPU memory (e.g., A100 80 GB) for inference, they often mean dedicated VRAM (video random access memory). It is high-speed memory physically attached to the GPU chip, such as HBM3 or GDDR6X.</p><p>key facts:</p><ul><li><strong>Extremely high speed.</strong> Modern <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/nvidia-data-center-gpus-explained-a100-h200-b200-and-beyond">data center GPUs</a> can read and write data at terabytes per second. For example, an NVIDIA H200 reaches up to 4.8 TB/s of memory bandwidth.</li><li><strong>Exclusive to the GPU.</strong> VRAM isn’t shared with the CPU. It’s a dedicated workspace.</li><li><strong>Stores what the model needs at runtime</strong>. That mainly includes model weights, activations, and KV cache.</li><li><strong>Directly impacts LLM inference performance.</strong> VRAM capacity and bandwidth influence throughput, latency, maximum context length, and how many concurrent requests you can serve.</li></ul><p>Note that strictly speaking, GPU memory and VRAM <strong>are not exactly the same thing</strong>.</p><p>GPU memory is a broader term that means whatever memory the GPU is currently using. It usually means VRAM (and people often use the terms interchangeably), but can refer to other sources depending on the system architecture.</p><p>For example, integrated GPUs are the most common case where GPU memory ≠ VRAM. They carve out a portion of system RAM, and there's no dedicated VRAM hardware at all. We’ll break that down in the next section.</p><h2>What is shared GPU memory?</h2><p>Shared GPU memory refers to system RAM that the GPU can use when it runs out of its own dedicated VRAM.</p><ul><li>It’s not physically located on the GPU chip.</li><li>It’s part of the CPU’s memory (DRAM) that’s dynamically allocated to the GPU when needed.</li><li>It’s common in integrated GPUs (like Intel Iris, Apple M-series, or AMD APUs), where GPU and CPU share the same physical memory pool.</li></ul><p>If you’re using data center GPUs like A100 or H200 for LLM inference, <strong>you don’t need to care about shared GPU memory</strong>. They rely entirely on their own dedicated HBM, which is fast and optimized for large-scale AI workloads.</p><h2>How GPU memory is used in LLM inference?</h2><p>GPU memory isn’t just a container for model weights. It’s actively consumed throughout the entire inference process. At a high level, VRAM is used for:</p><ol><li>Loading model weights</li><li>Building and expanding the KV cache during prefill and decode</li><li>CUDA, framework and runtime overhead</li></ol><p>Let’s break down each stage.</p><h3>Model weights</h3><p>Before inference starts, all model weights must be loaded into GPU memory (or sharded across GPUs if using tensor parallelism).</p><p>You can calculate the baseline memory required just to load the model like this:</p><pre><code class="language-bash language-python">Model Memory ≈ num_parameters × bytes_per_parameter</code></pre><p>For example, a 70B model in FP16 precision requires:</p><pre><code class="language-python">70B × 2 = 140 GB</code></pre><blockquote><p>Note: FP32 → 4 bytes, FP16/BF16 → 2 bytes, INT8/FP8 → 1 byte, INT4/FP4 → 0.5 bytes</p></blockquote><p>This is why a single A100 80 GB cannot hold a 70B FP16 model without quantization or multi-GPU sharding.</p><h3>KV cache</h3><p>Once inference begins, memory consumption continues to grow due to the KV cache.</p><p>During prefill, the model processes the full input prompt.</p><ul><li>It builds a KV cache to store the key and value vectors for every token at every layer</li><li>The KV cache grows linearly with the input sequence length.</li><li>Memory usage can increase rapidly for long prompts.</li></ul><p>During decoding, the model generates tokens one by one autoregressively. Each new token appends to the KV cache. Even if the prompt was short, long generations would still expand the KV cache significantly. Learn more about <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/llm-inference-basics/how-does-llm-inference-work">LLM inference in our handbook</a>.</p><p>For chat applications with multi-turn conversations, every new turn must include all previous messages. This means:</p><ul><li>The effective context window grows with each turn</li><li>The KV cache becomes the dominant VRAM consumer</li><li>Eventually, memory fills up even if the model weights fit comfortably</li></ul><figure class="image image_resized" style="width:75%;"><img alt="llm-context-window.png" src="/uploads/llm_context_window_7eab1a7b66.png" srcset="/uploads/xsmall_llm_context_window_7eab1a7b66.png 64w, /uploads/thumbnail_llm_context_window_7eab1a7b66.png 245w, /uploads/small_llm_context_window_7eab1a7b66.png 500w, /uploads/medium_llm_context_window_7eab1a7b66.png 750w, /uploads/large_llm_context_window_7eab1a7b66.png 1000w, /uploads/xlarge_llm_context_window_7eab1a7b66.png 1920w" sizes="100vw" width="1920"></figure><p>This is why models with long context windows require GPUs with large VRAM.</p><h3>Framework + CUDA overhead</h3><p>Even after accounting for weights and KV cache, a few more items consume VRAM:</p><ul><li>CUDA kernels and workspaces</li><li>Framework-specific allocators (e.g., PyTorch fragmentation)</li><li>Runtime overhead from vLLM, TensorRT-LLM, or SGLang</li><li>Temporary activation memory during prefill</li></ul><p>The specific overhead size depends on the framework and GPU.</p><h2>How to calculate the required GPU memory for an LLM</h2><p>A rough estimate for the required GPU memory can be calculated by multiplying the number of parameters by the bytes per parameter and then add extra overhead (%).</p><pre><code class="language-bash language-python">Memory (GB) = num_parameters × bytes_per_parameter × (1 + Overhead)</code></pre><p>For LLM inference, <strong>the real variable you must watch is the KV cache</strong>, not the model weights. This is what grows with context length, batch size, and multi-turn interactions.</p><p><a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/getting-started/calculating-gpu-memory-for-llms">Learn more in our handbook</a>.</p><h2>How to calculate the KV cache size</h2><p>The KV cache stores the key/value vectors for every token the model has processed. Its size grows linearly with sequence length, batch size, and number of layers.</p><p>A simplified formula:</p><pre><code class="language-bash language-python">KV Cache Size (GB) =
    2 × batch_size × seq_len × num_layers × hidden_dim × bytes_per_parameter / 1024³</code></pre><p>You can experiment with different batch sizes or context windows using the interactive <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/kv-cache-offloading#how-to-calculate-the-kv-cache-size">KV cache calculator online</a>.</p><h2>Why an LLM “fits on paper” but still runs out of VRAM</h2><p>Before LLMs became common, AI teams were used to serving small NLP or vision models. They were only a few hundred MBs to a few GBs, so a single GPU could easily host multiple models at once with techniques like MIG (Multi-Instance GPU).</p><p>But note that in such cases each model requires very little memory and compute usage is low.</p><p>This led to the expectation that <strong>“If my LLM is 60 GB, an 80 GB GPU should be enough and I might be able to run another small model on the same GPU.”</strong></p><p>However, LLMs behave very differently. They don’t just load weights; they grow at runtime.</p><p>As mentioned in previous sections, even if an LLM’s weights fit comfortably in VRAM, it still needs a large and growing amount of memory for the KV cache. It expands with every token in the prompt and every token generated during decoding.</p><p>This results in the problems people commonly hit:</p><ol><li><strong>You can’t use large context windows.</strong> The KV cache scales linearly with sequence length. A model may load fine, but the moment you process a 4k, 8k, or 32k prompt, VRAM usage explodes.</li><li><strong>Concurrency stays low.</strong> If one request consumes a large amount of KV memory, you simply can’t serve many users at once.</li><li><strong>KV cache hit rate drops</strong>. Without enough room to store cache states, caches are evicted early and the model recomputes previous tokens repeatedly. As a result, your latency and inference cost will increase dramatically.</li></ol><p>Ultimately, if you try inference, you’ll hit the classic OOM issue.</p><p>This is why an “LLM that needs 60 GB for weights” often cannot run reliably on an 80 GB GPU with long prompts, high concurrency, or multi-turn conversations.</p><p>The limiting factor isn’t the weights; it’s the runtime memory growth.</p><h2>How to optimize GPU memory usage</h2><p>Once you understand why LLMs consume so much memory, the next question is: How do you reduce memory pressure and scale?</p><p>Below are the some of the most effective strategies used in production. All of them work, but several require complex engineering to deploy correctly.</p><p>And that’s exactly why we built the Bento inference platform: <strong>to give AI teams all of these optimizations out-of-the-box, without hiring a dedicated infra team.</strong></p><h3>Quantization</h3><p><a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/getting-started/llm-quantization">Quantization</a> reduces the precision of model weights (e.g., from FP16 to INT8 or INT4), which lowers memory usage.</p><p>For example, a 70B model in FP16 requires 140 GB for model weights alone, plus additional memory for KV cache and overhead. After INT4 quantization, the weights need 35 GB of VRAM. This is a 4× reduction in memory footprint, enough to run a previously multi-GPU model on a single H200 (141 GB) with room left for KV cache and other overhead.</p><p>Note that quantization does introduce accuracy degradation. However, modern quantization methods like GPTQ and AWQ minimize this loss. For many workloads (chatbots, RAG, internal tools), the gains in memory savings and throughput outweigh the precision trade-off.</p><p><strong>How Bento helps</strong>:</p><p>Bento supports any open-source, fine-tuned, or custom model, including quantized formats (INT8, FP8, INT4, GGUF, AWQ, GPTQ, etc.). You can load and serve quantized models directly without extra engineering work, container hacks, or custom inference wrappers.</p><h3>Distributed inference techniques</h3><p>Distributed inference lets you spread the workload across multiple GPUs when a model or context window exceeds the capacity of a single device.</p><h4>Tensor parallelism</h4><p><a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/data-tensor-pipeline-expert-hybrid-parallelism">Tensor parallelism</a> splits the internal computations of each layer across multiple GPUs. Instead of placing entire layers on different devices, it slices the layer's tensors into smaller chunks and distributes them across multiple GPUs.</p><p>This is useful when:</p><ul><li>the model is too large to fit on a single GPU, or</li><li>long-context workloads cause memory usage to exceed a single device’s capacity</li></ul><figure class="image image_resized" style="width:75%;"><img alt="tp-inference-6f52474ab788bb99f32150072b34921c.png" src="/uploads/tp_inference_6f52474ab788bb99f32150072b34921c_e8b519d825.png" srcset="/uploads/xsmall_tp_inference_6f52474ab788bb99f32150072b34921c_e8b519d825.png 64w, /uploads/thumbnail_tp_inference_6f52474ab788bb99f32150072b34921c_e8b519d825.png 221w, /uploads/small_tp_inference_6f52474ab788bb99f32150072b34921c_e8b519d825.png 500w, /uploads/medium_tp_inference_6f52474ab788bb99f32150072b34921c_e8b519d825.png 750w, /uploads/large_tp_inference_6f52474ab788bb99f32150072b34921c_e8b519d825.png 1000w, /uploads/xlarge_tp_inference_6f52474ab788bb99f32150072b34921c_e8b519d825.png 1920w" sizes="100vw" width="1920"></figure><p>However, tensor parallelism introduces significant communication overhead, which:</p><ul><li>increases latency</li><li>demands high-bandwidth interconnects</li><li>adds complexity to your deployment setup</li></ul><p>The overhead can become a bottleneck and affect LLM performance if not carefully optimized.</p><h4>KV cache optimizations</h4><p>KV cache is often the real memory bottleneck. You can integrate multiple runtime strategies to control its growth and reduce VRAM usage.</p><ul><li><a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/prefix-caching">Prefix caching</a>: Reuses shared prompts across requests and users (e.g., system prompts). This cuts redundant KV computation and is especially useful for chat applications, RAG systems, and AI agents.</li><li><a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/kv-cache-utilization-aware-load-balancing">KV-aware routing</a>: Ensures tokens for the same request are executed on the same GPU or worker, preventing fragmented KV allocations and improving the cache hit rate. In our internal tests with long context data (&gt;20K), KV-aware routing delivered:<ul><li>12× higher input throughput</li><li>3.5× higher output throughput</li><li>4× lower TTFT</li><li>90%+ prefix cache hit rates</li></ul></li><li><a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/kv-cache-offloading">KV cache offloading</a>: Moves less frequently used KV blocks to lower-cost storage like CPU memory or disk, freeing up VRAM without blocking token generation.</li></ul><p>&nbsp;</p><hr><p>&nbsp;</p><p><strong>These strategies work, but they are hard to implement yourself.</strong> To configure all of them manually, you’d need to handle:</p><ul><li>multi-GPU communication and synchronization</li><li>Custom CUDA kernels or inference backends</li><li>routing logic across workers to maximize cache reuse</li><li>KV cache eviction and offloading logic</li></ul><p>Most AI teams don’t have the engineering bandwidth to build and maintain this infrastructure.</p><p><strong>How Bento helps</strong>:</p><p>Bento packages these optimizations into a production-ready inference platform so you don’t have to reinvent the entire inference stack:</p><ul><li>Automatic distributed inference with frameworks like MAX, vLLM, and SGLang</li><li>Built-in KV-aware routing, offloading, and prefix caching</li><li>Running models with these optimizations on any cloud (including BYOC) or on-prem cluster</li></ul><p>Instead of spending time building your own inference system, you get an optimized one that handles GPU memory intelligently. With it, your models run faster, cheaper, and with higher throughput, and your engineering team can focus on AI development.</p><h2>Conclusion</h2><p>VRAM is far more than a static storage specification. It is a dynamic resource that fluctuates with every token generated, every new user connected, and every millisecond of your inference runtime. Understanding the math behind model weights and KV cache growth is the first step to avoiding those deployment-day crashes.</p><p>The next step, namely solving it with quantization and different optimization techniques, is an engineering marathon.</p><p>That is why we built Bento. We believe your team should focus on building better AI products, not fighting with infrastructure issues.</p><p>If you want to run LLMs reliably, efficiently, and at scale:</p><ul><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/join-slack-from-blog-gpu-memory-explained">Join our Slack community</a> to learn from thousands of AI engineers</li><li><a target="_blank" rel="noopener noreferrer" href="https://link.bentoml.com/contact-us-from-blog-gpu-memory-explained">Schedule a call with our team</a> to explore your deployment needs</li><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/signup-from-blog-gpu-memory-explained">Sign up for the Bento Inference Platform</a> and start serving LLMs today</li></ul>

A complete guide to GPU memory for LLMs: VRAM, KV cache, context windows, quantization, parallelism, and inference optimizations for faster, more efficient inference.

gpu-memory-vram.png

What is GPU Memory and Why it Matters for LLM Inference

Learn ChatGPT usage limits for Free, Plus, Business, and Pro plans (2025 update). Understand why they exist and how to remove them with self-hosted LLMs.

chatgpt-usage-limits.png

ChatGPT Usage Limits: What They Are and How to Get Rid of Them

<p>For AI teams that want to self-host Generative AI (GenAI) models like LLMs, one of the most important choices is which GPU to use.</p><p>In the GPU industry, NVIDIA has established itself as the undisputed leader, especially for AI workloads. It offers a wide range of GPUs that can run AI workloads of different sizes. With so many options, it can be difficult to compare performance and cost, and making the wrong choice can be costly.</p><p>This is particularly true for workloads like <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/inference-platform-the-missing-layer-in-on-prem-llm-deployments">on-prem LLM deployments</a>. Once you invest in the hardware, it’s hard to undo the decision. Nobody wants to sink money into GPUs that don’t deliver the expected value.</p><p>In this post, we’ll break down NVIDIA’s GPU product lines, focusing on data center GPUs for AI inference. We’ll compare the most common ones and explain how to weigh performance, memory, and cost for your use case.</p><h2>What are NVIDIA data center GPUs</h2><p>When people talk about GPUs for AI inference, they usually refer to NVIDIA data center GPUs. These cards are the backbone of modern AI infrastructure and you’ll often find them in cloud servers and enterprise data centers. They are optimized for large-scale AI workloads and can be shared across teams or scaled to massive clusters.</p><p>NVIDIA updates its data center lineup every one to three years. Each new generation improves memory, throughput, and efficiency. Well-known examples include the T4, L4, A100, H100, H200, and the new B200. These GPUs dominate AI inference benchmarks and show up on almost every GPU comparison chart for LLMs and GenAI use cases.</p><p>Many people confuse data center GPUs with other NVIDIA products:</p><ul><li><strong>GeForce (Consumer/Gaming)</strong>. Graphic cards like the RTX 5090 or 4090 are built for gaming and general use. They perform well, and some users do run GenAI workloads on them. However, they aren’t made for enterprise-scale AI deployment.</li><li><strong>RTX (Professional).</strong> Formerly branded as Quadro, these GPUs target creative professionals in fields like visualization, architecture, and 3D design. They’re built for stability and certified software support, but they’re not aimed at large-scale AI training or inference.</li></ul><p>In short, GeForce is for gamers, RTX is for professional creators, and data center GPUs are for AI and large-scale computing. If your goal is to train or serve GenAI models, data center GPUs are the right choice.</p><h2>GPU architectures: Turing, Ampere, Hopper, Lovelace, Blackwell</h2><p>Each time NVIDIA launches a new GPU generation, it brings a new <a target="_blank" rel="noopener noreferrer" href="https://www.nvidia.com/en-us/technologies/">architecture</a> that improves performance, efficiency, and support for AI workloads. The latest Blackwell architecture packs 208 billion transistors, built on TSMC’s custom 4NP process. Blackwell GPUs feature two reticle-limited dies connected by a 10 TB/s chip-to-chip interconnect, allowing them to operate as a unified single GPU.</p><p>A quick way to identify NVIDIA data center GPUs is through their names. The naming convention follows a simple pattern that reveals key information about the chip:</p><ul><li><strong>First letter</strong>: Indicates the architecture generation. NVIDIA names each generation after a well-known scientist or mathematician, like Ada Lovelace and Grace Hopper.<ul><li>Turing (T4)</li><li>Ampere (A10, A100)</li><li>Hopper (H100, H200)</li><li>Lovelace (L4)</li><li>Blackwell (B200)</li></ul></li><li><strong>Numbers</strong>: The number following the letter usually indicates the model’s relative position in the lineup. Within the same generation, higher numbers generally mean more powerful or feature-rich GPUs.</li><li><strong>Memory suffix</strong>: Some models include memory capacity directly in the name, such as the A100 80GB, which helps distinguish variants.</li></ul><h2>Best NVIDIA GPUs for AI inference: A100, H100, H200, B200 compared</h2><p>Not all NVIDIA GPUs are created for the same purpose. Within each architecture, NVIDIA offers several models aimed at different price, performance, and power targets. In general, higher-numbered models provide more compute power, memory, and features, but they also come at a higher cost.</p><p>Here are some of the most common NVIDIA data center GPUs used for AI today:</p><figure class="table"><table style="border-collapse:collapse;border-color:hsl(0, 0%, 0%);border-width:1.5px;text-align:left;"><thead><tr><th style="background-color:hsl(0, 0%, 90%);border-bottom-style:solid;border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;width:15%;">GPU</th><th style="background-color:hsl(0, 0%, 90%);border-bottom-style:solid;border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;width:20%;">Architecture</th><th style="background-color:hsl(0, 0%, 90%);border-bottom-style:solid;border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;width:15%;">Memory</th><th style="background-color:hsl(0, 0%, 90%);border-bottom-style:solid;border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;width:50%;">Best For</th></tr></thead><tbody><tr><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;"><strong>T4</strong></td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Turing</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">16 GB</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Entry-level inference</td></tr><tr><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;"><strong>L4</strong></td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Lovelace</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">24 GB</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Energy-efficient inference</td></tr><tr><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;"><strong>A10</strong></td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Ampere</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">24 GB</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Mid-range inference, AI training</td></tr><tr><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;"><strong>A100</strong></td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Ampere</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">40 &amp; 80 GB</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">High-performance LLM training &amp; inference, HPC</td></tr><tr><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;"><strong>H100</strong></td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Hopper</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">80 GB</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Advanced LLM training &amp; inference, FP8</td></tr><tr><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;"><strong>H200</strong></td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Hopper</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">141 GB</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Ultra-large models, long-context inference</td></tr><tr><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;"><strong>B100</strong></td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Blackwell</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">192 GB</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Next-gen AI training, inference, HPC</td></tr><tr><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;"><strong>B200</strong></td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Blackwell</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">192 GB</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">Frontier-scale AI, multi-trillion parameter models</td></tr></tbody></table></figure><h3>GPU benchmarks and comparisons</h3><p>When choosing between GPUs, benchmark data is one of the most useful points of reference. For LLM inference, the most common metrics are throughput (e.g., Tokens Per Second) and latency (e.g., Time to First Token). NVIDIA publishes general performance numbers for each generation, but they don’t always reflect the nuances of real-world GenAI workloads.</p><p>That’s why independent GPU comparisons can be especially useful. Benchmarking platforms and inference frameworks (e.g., vLLM and SGLang) often measure GPUs against specific LLMs or inference scenarios. These results can help you see how different compute accelerators perform under practical conditions.</p><h3>GPU price considerations (T4 vs. A100 vs. H100)</h3><p>Performance is only part of the equation, and pricing is another key factor. GPU costs can vary widely depending on the provider and purchase model. Many vendors offer on-demand, 1-year, and 3-year committed plans, with large discounts for longer commitments.</p><p>For example, here’s how the cost of a T4 vs. A100 vs. H100 compares on Google Cloud (us-central1):</p><figure class="table"><table style="border-collapse:collapse;border-color:hsl(0, 0%, 0%);border-width:1.5px;text-align:left;"><thead><tr><th style="background-color:hsl(0, 0%, 90%);border-bottom-style:solid;border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;width:35%;">GPU</th><th style="background-color:hsl(0, 0%, 90%);border-bottom-style:solid;border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;width:20%;">On-Demand</th><th style="background-color:hsl(0, 0%, 90%);border-bottom-style:solid;border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;width:22.5%;">1-Year Commitment</th><th style="background-color:hsl(0, 0%, 90%);border-bottom-style:solid;border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;width:22.5%;">3-Year Commitment</th></tr></thead><tbody><tr><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">NVIDIA T4 (16 GB)</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">$255.50/mo</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">$160.60/mo</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">$116.80/mo</td></tr><tr><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">NVIDIA A100 (40 GB) — 1× A100 in <code>a2-highgpu-1g</code> VM</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">$2,681.57/mo</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">$1,689.37/mo</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">$938.57/mo</td></tr><tr><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">NVIDIA A100 (80 GB) — 1× A100 in <code>a2-ultragpu-1g</code> VM</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">$3,700.22/mo</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">N/A</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">N/A</td></tr><tr><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">NVIDIA H100 (80 GB) — 8× H100s in <code>a3-highgpu-8g</code> VM</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">$64,597.70/mo (≈$8,074.71 per GPU)</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">$44,810.08/mo (≈$5,601.26 per GPU)</td><td style="border-color:hsl(0, 0%, 0%);border-width:1.5px;padding:8px;">$28,371.00/mo (≈$3,546.38 per GPU)</td></tr></tbody></table></figure><p>&nbsp;</p><blockquote><p>Note: Pricing was collected on August 27, 2025, and may change. For the latest details, check <a target="_blank" rel="noopener noreferrer" href="https://cloud.google.com/products/calculator?hl=en">Google Cloud's pricing calculator</a>.</p></blockquote><p>These numbers make the contrast clear. An entry-level <strong>NVIDIA T4 GPU</strong> costs just a few hundred dollars per month; in contrast, a high-end <strong>H100 cluster</strong> can exceed <strong>$60,000/month</strong>. Ultimately, choosing the right GPU is about balancing performance requirements and cost efficiency, especially for long-term AI deployments.</p><h3>Why GPU memory matters</h3><p>When running LLM inference, GPU memory (VRAM) often matters more than raw compute power. One of the biggest reasons is the <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/llm-inference-basics/how-does-llm-inference-work">KV cache</a>. It stores attention keys and values for every token in the input sequence. This way, the model doesn’t need to recalculate them at each decoding step. The result is faster inference, but at the cost of high memory usage.</p><p>The KV cache size increases linearly with sequence length. This means that long-context scenarios can quickly use up available GPU memory, turning VRAM into the main bottleneck. The amount of GPU memory you have directly determines the size of the models and context windows you can support. For example:</p><ul><li>H200 (141 GB): Large enough to handle extended context windows without offloading the KV cache to external memory systems.</li><li>T4 and A10 (16–24 GB): Ideal for smaller inference jobs, but they can struggle when sequences get longer.</li></ul><p>Because of these constraints, the AI community is actively exploring ways to get around memory limits, such as <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/prefill-decode-disaggregation">prefill-decode disaggregation</a>, <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/kv-cache-offloading">KV cache offloading</a>, and <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/pagedattention">memory-efficient attention</a> techniques.</p><p>In short, if your workloads involve long prompts or large batch inference, GPU memory capacity is often just as important as raw compute power when choosing a graphic card.</p><p><a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/what-is-gpu-memory-and-why-it-matters-for-llm-inference">Learn more about GPU memory</a>.</p><h2>AMD GPU alternatives</h2><p>While NVIDIA currently dominates the AI GPU market, it’s not the only option. AMD’s MI-series accelerators have made steady progress and now offer competitive performance for certain tasks, particularly in AI training and inference. The challenge for AMD is less about hardware capability and more about software support. The NVIDIA CUDA ecosystem has become the default standard for AI development, with broad framework integration and community adoption.</p><p>We covered the MI-series GPUs in more detail <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/amd-data-center-gpus-mi250x-mi300x-mi350x-and-beyond">in a separate blog post</a>.</p><hr><p>&nbsp;</p><p>Now let’s take a quick look at some of the common questions about NVIDIA data center GPUs.</p><h2>What is the best NVIDIA GPU for data centers?</h2><p>The “best” GPU depends on your workload. For entry-level inference, the T4 or L4 may be sufficient. For fast, high-throughput inference, the A100 and H100 remain industry standards. The H200 and B200 are more advanced and ideal for long-context LLMs and frontier-scale AI models.</p><h2>What are the GPU data center companies?</h2><p>NVIDIA supplies data center GPUs to all major cloud providers (AWS, Google Cloud, Microsoft Azure) and to enterprise customers building their own clusters. Other companies like AMD (MI-series GPUs) and Intel (Gaudi accelerators) also offer data center GPUs or accelerators. However, NVIDIA currently dominates the market with the broadest ecosystem and adoption.</p><h2>Should I build my own GPU data center for AI inference?</h2><p>Building your own GPU data center can give you more control and potentially lower costs at scale. However, it also means complexity in deployment, scaling, and operations. If you’re considering this path, check out these deep dives:</p><ul><li><a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/how-to-beat-the-gpu-cap-theorem-in-ai-inference">How to Beat the GPU CAP Theorem in AI Inference</a></li><li><a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/inference-platform-the-missing-layer-in-on-prem-llm-deployments">On-Prem LLM Deployments</a></li></ul><h2>Where can I buy or rent GPUs?</h2><p>You can either buy on-prem GPU servers or rent GPUs from the cloud. It depends on your scale, control needs, and budget.</p><p>Cloud providers like AWS, Google Cloud, Azure, and Oracle Cloud let you rent GPUs such as NVIDIA A100, H100 and H200 on demand. They are good choices for fast experimentation, distributed workloads, and global access.</p><p>Specialized GPU cloud providers like CoreWeave and Nebius provide lower prices, flexible hourly billing, and better GPU availability.</p><p>If you need full control and compliance, you can buy your own GPU servers from OEM partners such as Dell and GIGABYTE. Buying gives you long-term cost efficiency and complete infrastructure control, but it also means higher upfront investment and maintenance.</p><p>Read the&nbsp;<a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/where-to-buy-or-rent-gpus-for-llm-inference">2026 GPU Procurement Guide</a> to learn more.</p><h2>Conclusion</h2><p>NVIDIA data center GPUs cover a wide spectrum, from entry-level inference on the T4 to frontier-scale inference on the B200. We hope this post gives you a clearer view of the product line and the key considerations when choosing GPUs.</p><p>At Bento, we offer a wide range of GPU options across regions and cloud providers for AI inference. Our inference platform automatically matches your workloads with the best available GPUs at the most competitive rates, and scales resources up or down as your traffic changes.</p><p>Learn more:</p><ul><li><a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/llm/getting-started/choosing-the-right-gpu">Choose the right GPU for different LLMs</a></li><li>Choose the right deployment patterns: <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/llm/infrastructure-and-operations/bring-your-own-cloud">BYOC</a>, <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/llm/infrastructure-and-operations/multi-cloud-and-cross-region-inference">multi-cloud and cross-region</a>, <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/llm/infrastructure-and-operations/on-prem-llms">on-prem and hybrid</a></li><li>Read our <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/">LLM Inference Handbook</a> to learn practical strategies for faster, cheaper inference.</li><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/join-slack-from-blog-nvidia-data-center-gpus">Join our Slack community</a> and connect with other AI teams working on GPU inference.</li><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/contact-us-from-blog-nvidia-data-center-gpus">Schedule a call with our experts</a> to discuss the right GPU strategy for your workloads.</li></ul>

Understand NVIDIA data center GPUs for AI inference. Compare T4, L4, A100, H100, H200, and B200 on use cases, memory, and pricing to choose the right GPU.

nvidia-data-center-gpus.jpg

NVIDIA Data Center GPUs Explained: From A100 to B200 and Beyond

Understand the differences between DeepSeek-V3, R1, V3.1, V3.2, and distilled models. Learn how to choose the right model and deploy them securely with BentoML.

deepseek-models.png

The Complete Guide to DeepSeek Models: V3, R1, V3.1, V3.2 and Beyond

<p>Running large reasoning models no longer means relying on third-party APIs. With <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/navigating-the-world-of-open-source-large-language-models">open-source models</a> like DeepSeek-R1 and gpt-oss, you can now self-host powerful reasoning models and build your private inference API.</p><p>Unlike closed-source APIs that operate as black boxes, open-source models let you customize inference logic, serve them efficiently with frameworks like vLLM, and apply advanced techniques such as prefill–decode disaggregation for optimized performance. The result: more control, flexibility, and lower cost.</p><p>In this post, you’ll learn how to self-host gpt-oss using vLLM and BentoML. We will deploy it to BentoCloud, our fully managed inference platform with fast autoscaling, LLM-specific observability, and production-grade security built in.</p><h2>What is vLLM?</h2><p>vLLM is a fast and efficient open-source library designed for LLM inference and serving. Developed by researchers at UC Berkeley, vLLM stands out for its high-performance capabilities in handling LLMs. It features advanced techniques such as continuous batching, speculative decoding, disaggregated prefilling, and automatic prefix caching.</p><h2>What is gpt-oss?</h2><p>gpt-oss is an open-source reasoning model developed by OpenAI. It comes in two main variants:&nbsp;<code>gpt-oss-120B</code>&nbsp;and&nbsp;<code>gpt-oss-20B</code>. Trained using reinforcement learning and insights from OpenAI’s frontier models like o3, gpt-oss is good at complex reasoning tasks. It’s also practical for general use cases.</p><h2>Set up the environment</h2><p>I suggest you create a virtual environment to keep your dependencies organized:</p><pre><code class="language-bash">python -m venv vllm-gpt-oss
source vllm-gpt-oss/bin/activate</code></pre><p>Next, clone the <a target="_blank" rel="noopener noreferrer" href="https://github.com/bentoml/BentoVLLM">BentoVLLM repo</a> and install the dependencies:</p><pre><code class="language-bash">git clone https://github.com/bentoml/BentoVLLM.git
cd BentoVLLM/gpt-oss-20b
pip install -r requirements.txt</code></pre><h2>Build a gpt-oss inference API with vLLM and BentoML</h2><p>Everything you need is in the repo cloned. Before deploying to the cloud, let’s go through the key code implementations.</p><h3>Define model and GPU configurations</h3><p>First, specify the model and GPU. In this example, we’re using gpt-oss-20b with a single NVIDIA H100 GPU, but you can switch to any LLM supported by vLLM.</p><pre><code class="language-python">import pydantic
import bentoml

# Use Pydantic to validate data
class BentoArgs(pydantic.BaseModel):
  name: str = 'gpt-oss-20b'
  gpu_type: str = 'nvidia-h100-80gb' # GPU type on BentoCloud
  tp: int = 1 # One GPU here for tensor parallelism
  model_id: str = 'openai/gpt-oss-20b'
  port: int = 8000
  # Other optional fields omitted for brevity

# Make the args injectable
bento_args = bentoml.use_arguments(BentoArgs)</code></pre><p>Here we use <a target="_blank" rel="noopener noreferrer" href="https://docs.bentoml.com/en/latest/build-with-bentoml/template-arguments.html">template arguments</a> from BentoML to inject dynamic and validated parameters at serve, build, and deploy time. These arguments can be referenced just like normal Python variables.</p><h3>Configure the runtime environment</h3><p>BentoML allows you to package your code, dependencies, and model references into a unified Bento artifact, which simplifies deployment across different environments.</p><p>Here’s how you define the <a target="_blank" rel="noopener noreferrer" href="https://docs.bentoml.com/en/latest/build-with-bentoml/runtime-environment.html">runtime environment</a> for your model using a Bento image:</p><pre><code class="language-python">image = bentoml.images.Image(python_version="3.12") \
    .system_packages("curl", "git") \
    .requirements_file("requirements.txt")</code></pre><p>For AMD GPU support, you can customize the image setup as follows:</p><pre><code class="language-python">image.base_image = 'rocm/vllm:rocm6.4.1_vllm_0.10.1_20250909'
# Disable locking of Python packages for AMD GPUs to exclude nvidia-* dependencies
image.lock_python_packages = False
# The GPU device is accessible by group 992
image.run('groupadd -g 992 -o rocm &amp;&amp; usermod -aG rocm bentoml &amp;&amp; usermod -aG render bentoml')
# Remove the vllm and torch deps to reuse the pre-installed ones in the base image
image.run('uv pip uninstall vllm torch torchvision torchaudio triton')</code></pre><h3>Launch a vLLM server within a BentoML Service</h3><p>Set up a vLLM server that listens on port 8000 and serves requests to our model.</p><pre><code class="language-python">class LLM:
  # Download the model weights from HF, skipping large checkpoints
  hf_model = bentoml.models.HuggingFaceModel(bento_args.model_id.lower(), exclude=[".pth", ".pt", "original/**/*"])

  def __command__(self) -&gt; list[str]:
    return [
      'vllm',
      'serve',
      self.hf_model,
      '--port',
      str(bento_args.port),
			# ...extra CLI args
      '--served-model-name',
      bento_args.model_id,
    ]</code></pre><p>Inside the class, we use the <code>HuggingFaceModel</code> API to load model weights from Hugging Face. Traditionally, model loading can be time-consuming, especially for large models.</p><p>BentoML optimizes this process by preloading models during image build instead of at service startup. The downloaded weights are cached and mounted directly into the container at runtime, which reduces cold start latency and accelerates scaling on BentoCloud. This mechanism ensures faster deployments and smoother autoscaling for large models like gpt-oss. Learn more about <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/25x-faster-cold-starts-for-llms-on-kubernetes">LLM cold starts</a>.</p><p>Next, use the&nbsp;<code>@bentoml.service</code>&nbsp;decorator to wrap the class, marking it as a <a target="_blank" rel="noopener noreferrer" href="https://docs.bentoml.com/en/latest/build-with-bentoml/services.html">BentoML Service</a>. This starts vLLM as the serving backend within the Bento.</p><pre><code class="language-python">@bentoml.service(
  name=bento_args.name,
  image=bento_args.image,
  traffic={'timeout': 300},
  resources={'gpu': bento_args.tp, 'gpu_type': bento_args.gpu_type},
)
class LLM:
 ...</code></pre><p>This basic setup is all you need to get your gpt-oss model running with vLLM and BentoML. For more advanced configurations like KV cache tuning, refer to the <a target="_blank" rel="noopener noreferrer" href="https://github.com/bentoml/BentoVLLM/blob/main/gpt-oss-20b/service.py">complete source code</a>. BentoML supports full customization to tailor inference for your use case.</p><h2>Deploy gpt-oss to BentoCloud</h2><p>BentoCloud provides fast and scalable infrastructure for building and scaling AI applications.</p><p>Install BentoML and log in to BentoCloud from the CLI. If you don’t have an account yet, <a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/signup-from-blog-vllm-llm">sign up here for free</a>.</p><pre><code class="language-bash language-python">pip install bentoml
bentoml cloud login</code></pre><p>Create a secret to store your HF token and reference it when running the deployment command:</p><pre><code class="language-bash language-python">bentoml secret create huggingface HF_TOKEN=$HF_TOKEN

bentoml deploy --secret huggingface</code></pre><p>Once it is up and running, go to the Deployment details page, where it displays the exposed <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/llm-inference-basics/openai-compatible-api">OpenAI-compatible API</a>. As you interact with it, you’ll see real-time <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/llm-inference-metrics">inference metrics</a> such as tokens per second and Time to First Token (TTFT).</p><h2>Test the OpenAI-compatible API</h2><p>You can call the endpoint directly with any OpenAI client. Just set the <code>base_url</code> to your BentoCloud Deployment URL.</p><pre><code class="language-python">from openai import OpenAI

client = OpenAI(base_url='your_deployment_endpoint', api_key='na')

# Use the following func to get the available models
# client.models.list()

chat_completion = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {
            "role": "user",
            "content": "Who are you? Please respond in pirate speak!"
        }
    ],
    stream=True,
)
for chunk in chat_completion:
    # Extract and print the content of the model's reply
    print(chunk.choices[0].delta.content or "", end="")</code></pre><h2>Enable scale-to-zero</h2><p>By default, each Deployment runs a single replica. To optimize cost, set minimum replicas to 0 so it scales down during idle time and scales up automatically when new requests arrive.</p><pre><code class="language-bash language-python">bentoml deployment update deployment_name --scaling-min 0 --scaling-max 3</code></pre><h2>Run gpt-oss locally</h2><p>If you have a powerful GPU available, you can serve gpt-oss locally using BentoML.</p><p>For better performance on NVIDIA GPUs, we recommend installing the FlashInfer library before serving:</p><pre><code class="language-bash language-python">pip install flashinfer-python --extra-index-url https://flashinfer.ai/whl/cu124/torch2.6
export HF_TOKEN=&lt;your-api-key&gt;
bentoml serve</code></pre><p>The server will start at <a target="_blank" rel="noopener noreferrer" href="http://localhost:3000">http://localhost:3000</a>. You can send requests directly to your local vLLM + BentoML service.</p><p>To package your project for deployment or portability, build a Bento and containerize it into a Docker image:</p><pre><code class="language-bash language-python">bentoml build
bentoml containerize</code></pre><p>This lets you run the same gpt-oss setup on any infrastructure.</p><h2>More on BentoML and vLLM</h2><p>To go deeper into LLM inference, model serving, and deployment best practices, check out the following resources:</p><ul><li>[Doc] <a target="_blank" rel="noopener noreferrer" href="https://docs.bentoml.com/en/latest/examples/vllm.html">vLLM inference</a></li><li>Read our <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/">LLM Inference Handbook</a> to learn more about inference optimization and LLM deployment techniques.</li><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/join-slack-from-blog-vllm-llm">Join our Slack community</a> to connect with other AI engineers and get help from the Bento team.</li><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/contact-us-from-blog-vllm-llm">Talk to us</a> about enterprise LLM deployment, BYOC, or on-prem solutions.</li><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/signup-from-blog-vllm-llm">Sign up for BentoCloud</a> to deploy your private gpt-oss model today with one command.</li></ul><p>&nbsp;</p><hr><h2>FAQs</h2><h3>What is BentoCloud?</h3><p><a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/introducing-bentocloud">BentoCloud</a> is a fully managed inference platform for deploying and scaling AI models in production. It provides features like GPU autoscaling, built-in observability, <a target="_blank" rel="noopener noreferrer" href="https://docs.bentoml.com/en/latest/scale-with-bentocloud/deployment/sandboxes.html">Sandboxes</a>, and <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/accelerate-ai-application-development-with-bentoml-codespaces">Codespaces</a> for fast and cost-efficient inference.</p><p>You can deploy gpt-oss or any other model with one command, without any Kubernetes or infrastructure setup.</p><p>BentoCloud is part of the Bento Inference Platform, which also provides <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/infrastructure-and-operations/on-prem-llms">on-prem</a> and <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/infrastructure-and-operations/bring-your-own-cloud">BYOC (Bring Your Own Cloud)</a> solutions.</p><p>BentoCloud is ideal for AI teams that want to prototype quickly without infrastructure overhead. For teams needing more control or strict data governance, <a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/contact-us-from-blog-vllm-llm">contact us</a> to explore Bento On-Prem or BYOC.</p><h3>Why use vLLM for gpt-oss?</h3><p>vLLM is one of the fastest open-source inference engines for LLMs. It supports <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/static-dynamic-continuous-batching">continuous batching</a>, <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/prefix-caching">prefix caching</a>, and <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/prefill-decode-disaggregation">prefill–decode disaggregation</a>, all of which improve throughput and reduce latency when serving gpt-oss at scale.</p><h3>What is the difference between running gpt-oss with vLLM and Ollama?</h3><p>vLLM is a high-performance inference engine optimized for server-side deployment and large-scale workloads. It supports distributed GPU setups, OpenAI-compatible APIs, and advanced batching techniques.</p><p>Ollama, on the other hand, is a local model runner designed for desktops and small servers. It’s great for lightweight experimentation but not ideal for production-scale serving or multi-GPU deployments.</p><p>If you need speed, scalability, and control in production environments, vLLM + BentoML is the better choice.</p><h3>Can gpt-oss generate images?</h3><p>No. gpt-oss is a language model focused on reasoning and text generation, not image synthesis.</p><p>If you want to generate images, you can integrate gpt-oss with a <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/a-guide-to-open-source-image-generation-models">text-to-image model</a> such as Stable Diffusion and FLUX. BentoML supports deploying these multimodal pipelines together, so you can combine reasoning and generation in one API.</p>

Self-host gpt-oss with vLLM and BentoML. Learn to build a fast, private reasoning API and deploy it on BentoCloud with autoscaling.

run-gpt-oss-vllm.png

Deploying gpt-oss with vLLM and BentoML

Find the best GPUs for LLM inference. Compare hyperscaler, GPU cloud, and on-prem options, understand pricing and availability, and learn how Bento simplifies cross-region and multi-cloud GPU management.

gpu-procurement-guide.jpg

Where to Buy or Rent GPUs for LLM Inference: The 2026 GPU Procurement Guide

Chaoyu Yang

The question every AI team now faces is how to deploy models wherever they’re needed, without rebuilding infrastructure for each environment or compromising on security, compliance, or performance.

Bento x Deploy AI Anywhere_Header.png

Deploy AI Anywhere with One Unified Inference Platform

For Heads of AI, choosing the right inference platform isn’t just a technical decision, but a strategic one.

Bento vs SageMaker Header.png

Bento vs. SageMaker: Which Inference Platform Is Right for Enterprise AI?

<p>DeepSeek has once again set the AI world buzzing with its new model, DeepSeek-OCR.</p><p>At first glance, DeepSeek-OCR looks like just another <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/multimodal-ai-a-guide-to-open-source-vision-language-models">vision language model (VLM)</a>. But as their research paper shows, it is much more than that. DeepSeek-OCR introduces a completely new way to think about how AI models store, process, and compress information.</p><p>Rather than simply improving OCR, it challenges a fundamental assumption: LLMs must process information as long sequences of text tokens. DeepSeek-OCR shows that this doesn’t have to be the case. A model can “see” information instead of just reading it, achieving the same understanding with a fraction of the computation.</p><p>In this blog post, we’ll look at:</p><ul><li>What Contexts Optical Compression is</li><li>How DeepSeek-OCR works</li><li>Why it matters for the future of long-context and efficient LLMs</li><li>How to deploy DeepSeek-OCR with BentoML</li></ul><h2>What is Contexts Optical Compression?</h2><p>LLMs today face a growing computational bottleneck when handling long pieces of text. Every token they process consume resources: floating-point operations per second (FLOPs), memory, time, and energy. A 10,000-token article means 10,000 discrete processing steps. That’s like forcing a model to read every single word in sequence, even when much of the content is repetitive or predictable.</p><p>DeepSeek-OCR rethinks what a token can be with Contexts Optical Compression. Instead of treating long text sequences as endless strings of small, low-information text tokens, it uses the visual modality as a more efficient compression channel for textual information.</p><p>In this framework, it compresses the same content into a smaller set of dense visual tokens. Each visual token carries much richer information, such as typography, layout, and spatial relationships between words. This allows the model to encode and understand entire chunks of text at once.</p><p>The result: The model achieved the same semantic understanding with an order-of-magnitude fewer computation steps. What once required 1,000 text tokens might now be represented by just 100 visual ones. This can reduce processing time and cost dramatically while preserving context.</p><p>DeepSeek-OCR is thus more than just an open-source OCR model. It is a proof of concept for a new paradigm in AI efficiency: let models see information instead of merely reading it.</p><p>Even <a target="_blank" rel="noopener noreferrer" href="https://x.com/karpathy/status/1980397031542989305">Andrej Karpathy</a> noted that DeepSeek-OCR raises a deeper question: are pixels better inputs to LLMs than text? He suggests that text tokens might be inherently wasteful and that “historical baggage” could eventually be replaced by visual inputs for efficiency.</p><figure class="image image_resized" style="width:75%;"><img src="/uploads/andrej_deepseek_ocr_9d36164077.png" alt="andrej-deepseek-ocr.png" srcset="/uploads/xsmall_andrej_deepseek_ocr_9d36164077.png 64w, /uploads/thumbnail_andrej_deepseek_ocr_9d36164077.png 245w, /uploads/small_andrej_deepseek_ocr_9d36164077.png 500w, /uploads/medium_andrej_deepseek_ocr_9d36164077.png 750w, /uploads/large_andrej_deepseek_ocr_9d36164077.png 1000w" sizes="100vw" width="1000"></figure><h2>How does DeepSeek-OCR work?</h2><p>DeepSeek-OCR features a unified end-to-end VLM architecture built around two brains that work together: a visual encoder and a language decoder.</p><figure class="image image_resized" style="width:75%;"><img src="/uploads/deepseek_ocr_architecture_9d41aeb8d3.png" alt="deepseek-ocr-architecture.png" srcset="/uploads/xsmall_deepseek_ocr_architecture_9d41aeb8d3.png 64w, /uploads/thumbnail_deepseek_ocr_architecture_9d41aeb8d3.png 245w, /uploads/small_deepseek_ocr_architecture_9d41aeb8d3.png 500w, /uploads/medium_deepseek_ocr_architecture_9d41aeb8d3.png 750w, /uploads/large_deepseek_ocr_architecture_9d41aeb8d3.png 1000w" sizes="100vw" width="1000"><figcaption>Image Source: <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2510.18234">DeepSeek-OCR Research Paper</a></figcaption></figure><h3>DeepEncoder</h3><p>The DeepEncode is where the compression happens. It handles high-resolution inputs more efficiently in terms of memory and token counts.</p><p>The DeepSeek team built it from the ground up because no existing open-source encoder met their requirements. They needed a model that could:</p><ul><li>Process high resolutions efficiently</li><li>Maintain low activation at high resolutions</li><li>Create a small number of vision tokens</li><li>Support multi-resolution inputs</li><li>Keep a moderate parameter count</li></ul><p>To meet these conditions, the team designed a 380M parameter encoder that achieves high compression ratios and can output a manageable number of vision tokens. It combines three main components:</p><ul><li>SAM-base (80M) for visual perception using window attention</li><li>CLIP-large (300M) for knowledge with dense global attention</li><li>A 16× Token Compressor bridges the two, capable of reducing thousands of patch tokens into a few hundred vision tokens</li></ul><h3>DeepSeek-3B-MoE Decoder</h3><p>Once the encoder compresses the input into visual tokens, the DeepSeek-3B-MoE Decoder turns them back into text.</p><p>This decoder uses a MoE design. During inference, it activates only 6 of 64 experts plus 2 shared ones, totaling about 570M activated parameters. This gives it the power of a 3B model but the inference cost of one under 600M, an ideal balance between performance and efficiency.</p><p>The decoder receives the vision tokens and prompts, and generates the final output, including chemical structures and planar geometric figures.</p><h2>How well does DeepSeek-OCR perform?</h2><p>DeepSeek-OCR demonstrates strong efficiency and accuracy across benchmarks.</p><p>At a 10× compression ratio, it retains around 97% accuracy. Even at higher compression levels (up to 20×), it can still produce usable results at roughly 60% accuracy.</p><figure class="image image_resized" style="width:75%;"><img src="/uploads/deepseek_ocr_performance_ef8b74a2b2.png" alt="deepseek-ocr-performance.png" srcset="/uploads/xsmall_deepseek_ocr_performance_ef8b74a2b2.png 64w, /uploads/thumbnail_deepseek_ocr_performance_ef8b74a2b2.png 245w, /uploads/small_deepseek_ocr_performance_ef8b74a2b2.png 500w, /uploads/medium_deepseek_ocr_performance_ef8b74a2b2.png 750w, /uploads/large_deepseek_ocr_performance_ef8b74a2b2.png 1000w" sizes="100vw" width="1000"><figcaption>Image Source: <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2510.18234">DeepSeek-OCR Research Paper</a></figcaption></figure><p>On OmniDocBench, a leading benchmark for document understanding, DeepSeek-OCR outperforms established baselines such as GOT-OCR 2.0 and MinerU 2.0, achieving higher accuracy with far fewer tokens.</p><figure class="image"><img src="/uploads/deepseek_ocr_benchmarks_a12d4e8510.png" alt="deepseek-ocr-benchmarks.png" srcset="/uploads/xsmall_deepseek_ocr_benchmarks_a12d4e8510.png 64w, /uploads/thumbnail_deepseek_ocr_benchmarks_a12d4e8510.png 245w, /uploads/small_deepseek_ocr_benchmarks_a12d4e8510.png 500w, /uploads/medium_deepseek_ocr_benchmarks_a12d4e8510.png 750w, /uploads/large_deepseek_ocr_benchmarks_a12d4e8510.png 1000w, /uploads/xlarge_deepseek_ocr_benchmarks_a12d4e8510.png 1920w" sizes="100vw" width="1920"><figcaption>Image Source: <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2510.18234">DeepSeek-OCR Research Paper</a></figcaption></figure><p>In deployment, it also proves highly practical. With a single A100 40GB GPU, DeepSeek-OCR can process more than 200K pages per day. This makes it a viable solution for large-scale document processing and training data generation for LLMs and VLMs.</p><h2>Why DeepSeek-OCR matters?</h2><p>By turning text into visual representations, DeepSeek-OCR shows that LLMs don’t always have to process information through text tokens. A single image can carry far more meaning with far fewer tokens. This idea opens up a promising direction for building long-context and more efficient LLMs.</p><p>What this means:</p><ul><li><strong>Massive efficiency gains</strong>: DeepSeek-OCR can represent the same content using up to 20× fewer tokens, cutting computation time and memory usage.</li><li><strong>Lower cost</strong>: With fewer tokens to process, models become cheaper and quicker to run, especially at scale.</li><li><strong>Open source</strong>: The model code and weights are public and available for anyone to experiment and extend.</li><li><strong>Potential for long-context LLMs</strong>: Vision-based tokens could help future LLMs handle far more context (theoretically unlimited context).</li></ul><p>For the last point, the paper introduces a fascinating concept inspired by human memory. As human beings, we gradually forget details while keeping what’s important. DeepSeek-OCR proposes something similar; it uses optical compression to shrink and blur older conversation history over time:</p><ul><li>Recent information remains sharp and detailed.</li><li>Older context fades and consumes fewer resources.</li></ul><figure class="image"><img src="/uploads/deepseek_ocr_forgetting_curve_f5fea65386.png" alt="deepseek-ocr-forgetting-curve.png" srcset="/uploads/xsmall_deepseek_ocr_forgetting_curve_f5fea65386.png 64w, /uploads/thumbnail_deepseek_ocr_forgetting_curve_f5fea65386.png 245w, /uploads/small_deepseek_ocr_forgetting_curve_f5fea65386.png 500w, /uploads/medium_deepseek_ocr_forgetting_curve_f5fea65386.png 750w, /uploads/large_deepseek_ocr_forgetting_curve_f5fea65386.png 1000w, /uploads/xlarge_deepseek_ocr_forgetting_curve_f5fea65386.png 1920w" sizes="100vw" width="1920"><figcaption>Image Source: <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2510.18234">DeepSeek-OCR Research Paper</a></figcaption></figure><p>This “visual forgetting” mechanism could enable models to manage ultra-long conversations more efficiently. It preserves what matters most while letting less relevant details fade naturally. It’s an early but promising step toward more powerful multimodal AI and theoretically unlimited context architectures.</p><h2>Deploying DeepSeek-OCR with BentoML</h2><p>BentoML allows you to deploy DeepSeek-OCR in your private cloud or on-prem environment with custom inference logic.</p><p>Here is an example of serving DeepSeek-OCR with vLLM and BentoML:</p><pre><code class="language-python">PROMPT_FREE_OCR = "&lt;image&gt;\nFree OCR."
# for doc-to-markdown with layout grounding:
# PROMPT_FREE_OCR = "&lt;image&gt;\n&lt;|grounding|&gt;Convert the document to markdown."

# Define runtime
bento_image = bentoml.images.Image(python_version="3.12") \
    .python_packages("Pillow", "numpy&lt;2.3") \
    .run('pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly')

# Create a BentoML Service
@bentoml.service(
    name="deepseek-ocr-vllm",
    image=bento_image,
    resources={
        "gpu": 1,
        "gpu_type": "nvidia-a100-80gb",
    },
)
class DeepSeekOCR:
    # Load model from HF
    model_path = HuggingFaceModel("deepseek-ai/DeepSeek-OCR")

    def __init__(self) -&gt; None:
    	# Logic to serve with vLLM here
    	self.llm = LLM(
          model=self.model_path,
          ...
      )
        
      self.sampling = SamplingParams(
          ...
      )

    # Define APIs to process images
    @bentoml.api
    def ocr_image(self, image: PILImage, prompt: str = PROMPT_FREE_OCR) -&gt; str:
        # convert to RGB
        if image.mode != "RGB":
            image = image.convert("RGB")

        model_input = [{"prompt": prompt, "multi_modal_data": {"image": image}}]
        outputs = self.llm.generate(model_input, self.sampling)
        return outputs[0].outputs[0].text

    @bentoml.api
    def ocr_batch(self, images: list[PILImage], prompt: str = PROMPT_FREE_OCR) -&gt; list[str]:
        rgb_images = [(img if img.mode == "RGB" else img.convert("RGB")) for img in images]
        model_input = [{"prompt": prompt, "multi_modal_data": {"image": img}} for img in rgb_images]
        outputs = self.llm.generate(model_input, self.sampling)
        return [o.outputs[0].text for o in outputs]</code></pre><p>After you deploy it to BentoCloud, you will obtain a secure and scalable API.&nbsp;</p><figure class="image"><img alt="deploy-deepseek-ocr-on-bentocloud.png" src="/uploads/deploy_deepseek_ocr_on_bentocloud_603f00d46f.png" srcset="/uploads/xsmall_deploy_deepseek_ocr_on_bentocloud_603f00d46f.png 64w, /uploads/thumbnail_deploy_deepseek_ocr_on_bentocloud_603f00d46f.png 245w, /uploads/small_deploy_deepseek_ocr_on_bentocloud_603f00d46f.png 500w, /uploads/medium_deploy_deepseek_ocr_on_bentocloud_603f00d46f.png 750w, /uploads/large_deploy_deepseek_ocr_on_bentocloud_603f00d46f.png 1000w" sizes="100vw" width="1000"><figcaption>Deploy DeepSeek-OCR on Bento Inference Platform</figcaption></figure><p>Check out <a target="_blank" rel="noopener noreferrer" href="https://github.com/bentoml/BentoOCR/blob/main/deepseek-ocr/service.py">the complete code</a> here.</p><h2>Conclusion</h2><p>DeepSeek-OCR is more than an OCR breakthrough. It’s a glimpse into a new way of thinking about efficiency and memory in AI systems. By replacing text tokens with compact visual representations, it redefines how models can scale, reason, and remember.</p><p>Learn more:</p><ul><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/join-slack-from-blog-deepseek-ocr">Join our Slack community</a> to stay updated on the latest frontier models</li><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/contact-us-from-blog-deepseek-ocr">Contact us</a> if you need help integrating DeepSeek-OCR into your own workflows</li><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/signup-from-blog-deepseek-ocr">Sign up for Bento Inference Platform</a> to build, deploy, and scale AI applications with DeepSeek-OCR</li></ul>

Learn how DeepSeek-OCR redefines AI efficiency with Contexts Optical Compression, turning text into vision for faster, cheaper, long-context LLMs.

deepseek-ocr-blog-image.png

DeepSeek-OCR Explained: How Contexts Optical Compression Redefines AI Efficiency

To deliver production-grade performance, inference has to move from a secondary concern to a first-class operational discipline.

Bento x Scaling Enterprise AI_Header.png

InferenceOps: The Strategic Foundation For Scaling Enterprise AI

Explore the best open-source LLMs and find answers to common FAQs about performance, inference optimization, and self-hosted deployment.

best-open-source-llms.png

Top-Rated LLMs for Chat in 2025

Explore the top open-source VLMs and find answers to some FAQs about them.

vision-language-models.png

Multimodal AI: A Guide to Open-Source Vision Language Models

Larme Zhao

Fog Dong

<p>At Bento, we work to help AI teams deploy and scale models with tailored optimization. That means giving them the ability to tune performance for their specific use cases with ease and speed, so they can achieve better price-performance ratios.</p><p>As part of that effort, we’re excited to introduce <a target="_blank" rel="noopener noreferrer" href="https://github.com/bentoml/llm-optimizer">llm-optimizer</a>, an open-source tool for benchmarking and optimizing LLM inference. It works across multiple inference frameworks and supports any open-source LLM.</p><p>Unlike traditional benchmarking tools that only generate raw numbers, llm-optimizer allows you to define constraints, such as “TTFT under 200ms” or “P99 ITL below 10ms.” This makes it easy to quickly identify the configurations that meet your specific requirements without endless trial and error.</p><p>At the same time, we’re launching the <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/llm-perf/">LLM Performance Explorer</a>, a companion website powered by llm-optimizer. It lets you browse LLM benchmark results directly without running experiments yourself. You can compare different configurations, apply constraints, and quickly see which setup works best for your specific use case.</p><figure class="image"><img></figure><figure class="image"><img src="/uploads/llm_perf_explorer_7131fbb216.png" alt="llm-perf-explorer.png" srcset="/uploads/xsmall_llm_perf_explorer_7131fbb216.png 64w, /uploads/thumbnail_llm_perf_explorer_7131fbb216.png 245w, /uploads/small_llm_perf_explorer_7131fbb216.png 500w, /uploads/medium_llm_perf_explorer_7131fbb216.png 750w, /uploads/large_llm_perf_explorer_7131fbb216.png 1000w, /uploads/xlarge_llm_perf_explorer_7131fbb216.png 1920w" sizes="100vw" width="1920"></figure><h2>Motivation</h2><p>When it comes to self-hosting LLMs, performance tuning can be very tricky. There are so many factors to balance:</p><ul><li><strong>Server&nbsp;parameters</strong>: <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/data-tensor-pipeline-expert-hybrid-parallelism">Parallelism strategies (e.g., data, tensor, pipeline)</a>, batch sizes, <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/kv-cache-offloading">caching</a> strategies, etc.</li><li><strong>Client parameters</strong>: Request rates, concurrency&nbsp;levels,&nbsp;number of prompts, etc.</li><li><strong>Framework&nbsp;differences</strong>: Each&nbsp;<a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/getting-started/choosing-the-right-inference-framework">inference&nbsp;framework</a>&nbsp;(vLLM,&nbsp;SGLang, TensorRT-LLM, etc.) has&nbsp;its own optimization&nbsp;strategies and tuning knobs.</li><li><strong>Workload variations</strong>: Different models, input&nbsp;lengths, and request patterns require different optimizations.</li></ul><p>For most AI teams, the only way to figure it out has been trial and error: endless tests, parameter tweaks, and results that are hard to compare. It’s tedious, time-consuming, and often error-prone.</p><p>This led us to build llm-optimizer. We want to provide engineers a systematic way to test configurations, and give them a clear view of what really works with just a few commands. More importantly, it must support constraint filtering, so that engineers don’t need to sift through mountains of raw result data. This means they can focus on the right parameters for their specific optimization goals. Ultimately, engineers spend less time tuning and more time building.</p><h2>Key features</h2><p>llm-optimizer was built to take the tedious manual work out of LLM performance optimization. You can run structured experiments, apply constraints, and visualize results in one place - all with a few commands.</p><h3>Systematic parameter testing</h3><p>llm-optimizer supports systematic tests of different combinations of server and client parameters across multiple inference frameworks. It works with any open-source LLM. Simply provide the Hugging Face model tag.</p><p>Here is an example with vLLM:</p><pre><code class="language-bash">llm-optimizer \
  --framework vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --server-args "tensor_parallel_size=[1,2];max_num_batched_tokens=[4096,8192]" \
  --client-args "max_concurrency=[32,64];num_prompts=1000;dataset_name=sharegpt" \
  --output-json vllm_results.json</code></pre><p>Parameters like <code>tensor_parallel_size</code> and <code>max_num_batched_tokens</code> come directly from the respective frameworks (currently support vLLM and SGLang). If you’re familiar with other native parameters, you can add them as well.</p><p>Here is another example with SGLang:</p><pre><code class="language-bash">llm-optimizer \
  --framework sglang \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --server-args "tp_size*dp_size=[(1,4),(2,2),(4,1)];chunked_prefill_size=[2048,4096]" \
  --client-args "max_concurrency=[32,64,128];num_prompts=[500,1000]" \
  --output-dir optimization_results \
  --output-json results.json \
  --continue \
  --rest 10 \
  --warmup-requests 3</code></pre><p>This command generates <strong>36 unique test runs</strong>:</p><ul><li>3 tensor/data parallelism combinations × 2 prefill sizes = 6 server configs</li><li>3 concurrency values × 2 prompt counts = 6 client configs</li><li>6 × 6 = 36 total configurations, each with 3 warmup requests and a 10-second rest between runs</li></ul><p>This means 36 results you can analyze or filter further using constraints.</p><h3>Constraint settings</h3><p>Not every benchmark result is useful. What really matters is finding the configuration that meets your performance goals, especially when you have specific SLOs for inference. Without filtering, large test runs can generate too much noise and make it harder to analyze results.</p><p>With llm-optimizer, you can:</p><ul><li><strong>Filter results</strong> to only include configurations that meet your requirements</li><li><strong>Identify optimal setups</strong> for latency or throughput targets</li></ul><p>The following is a typical use case where you can optimize for chatbots or other low-latency workloads.</p><pre><code class="language-bash">llm-optimizer \
  --framework vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --server-args "tensor_parallel_size*data_parallel_size=[(1,2),(2,1)];max_num_seqs=[16,32,64]" \
  --client-args "max_concurrency=[8,16,32];num_prompts=500" \
  --constraints "ttft&lt;200ms;itl:p99&lt;10ms" \
  --output-json chat_optimized.json</code></pre><p>This command will only return results where:</p><ul><li>Time to First Token (TTFT) is below&nbsp;200ms</li><li>99% of Inter-Token Latency (ITL)&nbsp;are below 10ms</li></ul><h3>Comprehensive benchmarking results</h3><p>Benchmarks are only useful if the results tell you something actionable. llm-optimizer provides detailed inference-specific metrics, giving you a clear view of how your model behaves under different conditions.</p><p>For LLMs, performance is best measured at the token level. <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/llm-inference-metrics">Key metrics</a> include:</p><ul><li><strong>Time to First Token (TTFT):</strong> How long it takes for the first response token to appear after a request is sent</li><li><strong>Inter-Token Latency (ITL):</strong> The average time between tokens in a response.</li><li><strong>Input/Output Token Throughput:</strong> How many tokens are processed/generated per second during the benchmark.</li></ul><p>In addition, you also get traditional request-level metrics such as end-to-end latency, request throughput, and optimal concurrency. These measurements combined help you make trade-offs for your performance goals.</p><p>Results are output in JSON, easy to filter and compare. Here’s a simplified example:</p><pre><code class="language-yaml language-json">[
  {
    "config": {
      "client": {
        "max_concurrency": 8,
        "num_prompts": 1000,
        "dataset_name": "sharegpt",
        "sharegpt_output_len": 256
      },
      "server": {
        "tensor_parallel_size": 1,
        "max_num_batched_tokens": 4096,
        "max_num_seqs": 16
      },
      "server_args": [
        "--tensor-parallel-size=1",
        "--max-num-batched-tokens=4096",
        "--max-num-seqs=16"
      ]
    },
    "results": {
      "backend": "vllm",
      "dataset_name": "sharegpt",
      "max_concurrency": 8,
      "total_input_tokens": 296523,
      "total_output_tokens": 256000,
      "total_output_tokens_retokenized": 255910,
      "request_throughput": 2.323720163971585,
      "input_throughput": 689.0364741813463,
      "output_throughput": 594.8723619767258,
      "mean_e2e_latency_ms": 3441.463320413,
      "median_e2e_latency_ms": 3413.3226294999304,
      "std_e2e_latency_ms": 130.47813647173385,
      "p99_e2e_latency_ms": 4003.9951376500244,
			...
    },
    "cmd": "vllm serve meta-llama/Llama-3.1-8B-Instruct --host 127.0.0.1 --port 8000 --tensor-parallel-size=1 --max-num-batched-tokens=4096 --max-num-seqs=16"
  },
  
// More results below</code></pre><h3>Performance estimation</h3><p>Sometimes you want to understand how a model will perform before running a full benchmark. Maybe you don’t have the required hardware, your time is limited, or you just want a quick estimate of what to expect. llm-optimizer supports performance estimation that analyzes a model and generates theoretical results without running live tests.</p><p>Here is an example:</p><pre><code class="language-bash">llm-optimizer estimate \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-len 1024 \
  --output-len 512 \
  --gpu A100 \
  --num-gpus 2</code></pre><p>This command:</p><ul><li>Analyzes the 8B Llama model on 2x A100 GPUs</li><li>Estimates performance for 1024 input + 512 output tokens</li><li>Shows best latency and throughput configurations</li><li>Provides roofline analysis and identifies bottlenecks</li><li>Generates the commands to run actual benchmarks with SGLang and vLLM</li></ul><p>Expected output:</p><pre><code class="language-bash">=== Configuration ===
Model: meta-llama/Llama-3.1-8B-Instruct
GPU: 2x A100
Precision: fp16
Input/Output: 1024/512 tokens
Target: throughput

Fetching model configuration...
Model: 8029995008.0B parameters, 32 layers

=== Performance Analysis ===
Best Latency (concurrency=1):
  TTFT: 43.1 ms
  ITL: 2.6 ms
  E2E: 1.39 s

Best Throughput (concurrency=512):
  Output: 18873.3 tokens/s
  Input: 23767.8 tokens/s
  Requests: 14.24 req/s
  Bottleneck: Memory

=== Roofline Analysis ===
Hardware Ops/Byte Ratio: 142.5 ops/byte
Prefill Arithmetic Intensity: 52205.5 ops/byte
Decode Arithmetic Intensity: 50.9 ops/byte
Prefill Phase: Compute Bound
Decode Phase: Memory Bound

=== Concurrency Analysis ===
KV Cache Memory Limit: 688 concurrent requests
Prefill Compute Limit: 8 concurrent requests
Decode Capacity Limit: 13 concurrent requests
Theoretical Overall Limit: 8 concurrent requests
Empirical Optimal Concurrency: 16 concurrent requests

=== Tuning Commands ===

--- SGLANG ---
Simple (concurrency + TP/DP):
  llm-optimizer --framework sglang --model meta-llama/Llama-3.1-8B-Instruct --gpus 2 --host 127.0.0.1 --server-args "tp_size*dp_size=[(1, 2), (2, 1)]" --client-args "num_prompts=1000;dataset_name=sharegpt;random_input=1024;random_output=512;num_prompts=1000;max_concurrency=[256, 512, 768]" --output-dir tuning_results --output-json tuning_results/config_1_sglang.json
Advanced (additional parameters):
  llm-optimizer --framework sglang --model meta-llama/Llama-3.1-8B-Instruct --gpus 2 --host 127.0.0.1 --server-args "tp_size*dp_size=[(1, 2), (2, 1)];chunked_prefill_size=[1434, 2048, 2662];schedule_conservativeness=[0.3, 0.6, 1.0];schedule_policy=fcfs" --client-args "num_prompts=1000;dataset_name=sharegpt;random_input=1024;random_output=512;num_prompts=1000;max_concurrency=[256, 512, 768]" --output-dir tuning_results --output-json tuning_results/config_1_sglang.json

--- VLLM ---
Simple (concurrency + TP/DP):
  llm-optimizer --framework vllm --model meta-llama/Llama-3.1-8B-Instruct --gpus 2 --host 127.0.0.1 --server-args "tensor_parallel_size*data_parallel_size=[(1, 2), (2, 1)]" --client-args "num_prompts=1000;dataset_name=sharegpt;random_input=1024;random_output=512;num_prompts=1000;max_concurrency=[256, 512, 768]" --output-dir tuning_results --output-json tuning_results/config_1_vllm.json
Advanced (additional parameters):
  llm-optimizer --framework vllm --model meta-llama/Llama-3.1-8B-Instruct --gpus 2 --host 127.0.0.1 --server-args "tensor_parallel_size*data_parallel_size=[(1, 2), (2, 1)];max_num_batched_tokens=[1024, 1177, 1331]" --client-args "num_prompts=1000;dataset_name=sharegpt;random_input=1024;random_output=512;num_prompts=1000;max_concurrency=[256, 512, 768]" --output-dir tuning_results --output-json tuning_results/config_1_vllm.json</code></pre><p>If you’re not sure what parameters to provide, you can use the interactive mode. It walks you through the setup step by step and generates the same results as if you had provided them directly:</p><pre><code class="language-bash">$ llm-optimizer estimate --interactive

=== LLM Performance Estimation (Interactive Mode) ===

🤖 Model Selection
Popular options: meta-llama/Llama-3.2-1B, meta-llama/Meta-Llama-3-8B, meta-llama/Meta-Llama-3-70B
HuggingFace model ID: meta-llama/Llama-3.2-1B

📏 Sequence Length Configuration
Typical values: 512 (short), 1024 (medium), 2048 (long), 4096 (very long)
Input sequence length (tokens) [1024]: 512
Output sequence length (tokens) [512]: 512

## More options...</code></pre><figure class="image"><img src="/uploads/interactive_demo_14433d2ba1.gif" alt="interactive-demo.gif" srcset="/uploads/xsmall_interactive_demo_14433d2ba1.gif 64w, /uploads/thumbnail_interactive_demo_14433d2ba1.gif 245w, /uploads/small_interactive_demo_14433d2ba1.gif 500w, /uploads/medium_interactive_demo_14433d2ba1.gif 750w, /uploads/large_interactive_demo_14433d2ba1.gif 1000w" sizes="100vw" width="1000"></figure><h3>Interactive user interface</h3><p>Raw numbers are useful, but they’re much easier to interpret when you can see patterns and trade-offs. llm-optimizer offers a visualization tool that turns benchmark results into interactive dashboards.</p><p>After running benchmarks, you don’t have to parse JSON by hand. Simply launch a local dashboard and explore your results visually:</p><pre><code class="language-bash"># Visualize results with Pareto frontier analysis
llm-optimizer visualize --data-file results.json --port 8080

# Combine multiple result files
llm-optimizer visualize --data-file "sglang_results.json,vllm_results.json" --port 8080</code></pre><p>The dashboard will be served at <code>http://localhost:8080/pareto_llm_dashboard.html</code>.</p><blockquote><p>Note: <span style="background-color:rgb(255,255,255);color:rgb(31,35,40);font-family:-apple-system, &quot;system-ui&quot;, &quot;Segoe UI&quot;, &quot;Noto Sans&quot;, Helvetica, Arial, sans-serif, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;;font-size:16px;"><span style="-webkit-text-stroke-width:0px;display:inline !important;float:none;font-style:normal;font-variant-caps:normal;font-variant-ligatures:normal;font-weight:400;letter-spacing:normal;orphans:2;text-align:start;text-decoration-color:initial;text-decoration-style:initial;text-decoration-thickness:initial;text-indent:0px;text-transform:none;white-space:normal;widows:2;word-spacing:0px;">This feature is still experimental, and we’ll continue improving it in the coming days. For visualized results, check out the </span></span><a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/llm-perf/"><a style="-webkit-text-stroke-width:0px;background-color:rgb(255, 255, 255);box-sizing:border-box;color:rgb(9, 105, 218);font-family:-apple-system, &quot;system-ui&quot;, &quot;Segoe UI&quot;, &quot;Noto Sans&quot;, Helvetica, Arial, sans-serif, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;;font-size:16px;font-style:normal;font-variant-caps:normal;font-variant-ligatures:normal;font-weight:400;letter-spacing:normal;orphans:2;text-align:start;text-indent:0px;text-transform:none;text-underline-offset:0.2rem;white-space:normal;widows:2;word-spacing:0px;" rel="nofollow"><u>LLM Performance Explorer</u></a></a><span style="background-color:rgb(255,255,255);color:rgb(31,35,40);font-family:-apple-system, &quot;system-ui&quot;, &quot;Segoe UI&quot;, &quot;Noto Sans&quot;, Helvetica, Arial, sans-serif, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;;font-size:16px;"><span style="-webkit-text-stroke-width:0px;display:inline !important;float:none;font-style:normal;font-variant-caps:normal;font-variant-ligatures:normal;font-weight:400;letter-spacing:normal;orphans:2;text-align:start;text-decoration-color:initial;text-decoration-style:initial;text-decoration-thickness:initial;text-indent:0px;text-transform:none;white-space:normal;widows:2;word-spacing:0px;">.</span></span></p></blockquote><h2>Getting started</h2><p>Install llm-optimizer:</p><pre><code class="language-bash">git clone https://github.com/bentoml/llm-optimizer.git
pip install -e .</code></pre><p>Run a quick estimation:</p><pre><code class="language-bash">llm-optimizer estimate \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu A100 \
  --num-gpus 4 \
  --input-len 1024 \
  --output-len 512</code></pre><p>Run your first benchmark (make sure you have enough GPU resources):</p><pre><code class="language-bash">llm-optimizer \
  --framework vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --server-args "tensor_parallel_size*data_parallel_size=[(1,4),(2,2),(4,1)]";max_num_seqs=[16,32,64]" \
  --client-args "max_concurrency=[8,16,32];num_prompts=500" \
  --constraints "ttft&lt;200ms;itl:p99&lt;10ms" \
  --output-json latency_optimized.json</code></pre><p>LLM inference optimization doesn’t have to mean endless trial and error. llm-optimizer makes it easy to benchmark different configurations, apply constraints, estimate performance, and visualize results — all with a few commands.</p><p>More resources:</p><ul><li>View the benchmark results of top open-source LLMs on the <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/llm-perf/">LLM Performance Explorer</a></li><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/join-slack-from-blog-announcing-llm-optimizer">Join our Slack community</a> to connect with other AI engineers and share feedback</li><li>Read our <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/llm/">LLM Inference Handbook</a> for deeper dives into performance optimization</li><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/contact-us-from-blog-announcing-llm-optimizer">Schedule a call</a> with our experts to discuss your use case and performance goals</li></ul>

Benchmark and optimize LLM inference performance with SLO constraints across frameworks like vLLM and SGLang.

large_llm-optimizer-cover-image.png

medium_llm-optimizer-cover-image.png

small_llm-optimizer-cover-image.png

thumbnail_llm-optimizer-cover-image.png

xsmall_llm-optimizer-cover-image.png

llm-optimizer-cover-image.png

llm-optimizer: An Open-Source Tool for LLM Inference Benchmarking and Performance Optimization

📢 Introducing [llm-optimizer](/blog/announcing-llm-optimizer) & [LLM Performance Explorer](/llm-perf/) — benchmark and optimize LLM inference with ease.

Bento Blog

llm-optimizer: An Open-Source Tool for LLM Inference Benchmarking and Performance Optimization