Inference optimization
📄️ Key metrics for LLM inference
Measure key metrics like latency and throughput to optimize LLM inference performance.
📄️ Static, dynamic and continuous batching
Optimize LLM inference with static, dynamic, and continuous batching for better GPU utilization.
📄️ PagedAttention
Improve LLM memory usage with block-based KV cache storage via PagedAttention.
📄️ Speculative decoding
Speculative decoding accelerates LLM inference with draft model predictions verified by the target model.
📄️ Prefill-decode disaggregation
Disaggregate prefill and decode for better parallel execution, resource allocation, and scaling.
📄️ KV cache utilization-aware load balancing
Route LLM requests based on KV cache usage for faster, smarter inference.
📄️ Prefix caching
Prefix caching speeds up LLM inference by reusing shared prompt KV cache across requests.
📄️ Data, tensor, pipeline, expert and hybrid parallelisms
Understand the differences between data, tensor, pipeline, expert and hybrid parallelisms.
📄️ Offline batch inference
Run predictions at scale with offline batch inference for efficient, non-real-time processing.