Planning your deployment
Before you can run an LLM in production, you first need to make a few key decisions. These early choices will shape your infrastructure needs, costs, and how well the model performs for your use case.
📄️ Serverless vs. self-hosted LLM inference
Understand the differences between serverless LLM APIs and self-hosted LLM deployments.
📄️ Choosing the right model
Select the right models for your use case.
📄️ Choosing the right GPU
Select the right NVIDIA or AMD GPUs (e.g., L4, A100, H100, B200, MI250X, MI300X, MI350X) for LLM inference.
📄️ Calculating GPU memory for serving LLMs
Learn how to calculate GPU memory for serving LLMs.
📄️ Choosing the right inference framework
Learn what LLM inference frameworks do, why raw model execution is not enough for production, and how to choose the right inference frameworks for your use case.
📄️ Bring Your Own Cloud (BYOC)
Bring Your Own Cloud (BYOC) is a deployment model where vendors run software in your cloud, combining managed orchestration with complete data control.
📄️ On-prem LLM deployments
On-prem LLMs are large language models deployed within an organization’s own infrastructure, such as private data centers or air-gapped environments. This pattern offers full control over data, models, performance, and cost.
Stay updated with the handbook
Get the latest insights and updates on LLM inference and optimization techniques.
- Monthly insights
- Latest techniques
- Handbook updates