Calculating GPU memory for serving LLMs
If you're planning to self-host an LLM, one of the first things you'll need to figure out is how much GPU memory (VRAM) it requires. This depends mainly on the model’s size and the precision used during inference.
- Model size (number of parameters). Larger models need more memory. Models with tens or hundreds of billions of parameters usually require high-end GPUs like NVIDIA H100 or H200.
- Bit precision. The precision used (e.g., FP16, FP8, INT8) affects memory consumption. Lower precision formats can significantly reduce memory footprint, but may have accuracy drops. See LLM quantization for details.
A rough formula to estimate how much memory is needed to load an LLM is:
Memory (GB) = P * (Q / 8) * (1 + Overhead)
- P: Number of parameters (in billions)
- Q: Bit precision (e.g., 16, 32), division by 8 converts bits to bytes
- Overhead (%): Additional memory or temporary usage during inference (e.g., KV cache, activation buffers, optimizer states)
Use the calculator below to estimate GPU memory requirements for your model:
GPU Memory Calculator
Estimate the GPU memory (VRAM) required to load and run an LLM
Quick preset:
Model Parameters
Parameters (P)
Bit Precision (Q)
Overhead (%)
Typical range: 10–30%. Covers KV cache, activations, and framework buffers.
Required VRAM: 19.2 GB
Estimated GPU memory to load and run this model at the chosen precision.
Formula:
Memory (GB) = P × (Q / 8) × (1 + Overhead)
= 8 × (16 / 8) × (1 + 0.2) = 19.20 GB
Example GPU configs with enough VRAM:
1×
NVIDIA L4 / A10G24 GB VRAM each
24 GB totalNote:
This estimate covers model weights plus the specified overhead. Actual usage varies with batch size, sequence length, and inference framework. Always benchmark on your target hardware.
note
Not all GPUs support all precision formats natively. A100 and other Ampere GPUs support INT8 but do not support FP8 in hardware. Native FP8 requires Hopper, Ada, or newer architectures. If your inference stack relies on FP8 kernels, make sure your GPU supports them. Some 4-bit models use INT4 quantization, while native FP4 support relies on newer architectures and software stacks.