September 4, 2024 • Written By Rick Zhou
In our previous benchmarking blog post, we compared the performance of different inference backends using two key metrics: Time to First Token and Token Generation Rate. We intentionally did not tune the inference configurations, such as GPU memory utilization, maximum number of sequences, and paged KV cache block size, to implicitly measure the performance and ease-of-use of each backend, highlighting their practicality in real-world applications.
In this blog post, the BentoML engineering team shifts focus to the impact of performance tuning, specifically examining how tuning inference configurations can significantly enhance the serving performance of large language models (LLMs) using TensorRT-LLM (TRT-LLM). By adjusting key parameters like batch size and prefix chunking, we aim to demonstrate the substantial improvements that can be achieved.
This post serves as a comprehensive guide for optimizing TRT-LLM settings, offering practical insights and detailed steps to help you achieve superior performance. Specifically, it will cover
Similar to the previous blog post, we evaluated TensorRT-LLM serving performance with two key metrics:
We compared the performance of TRT-LLM serving Llama-3 8B before and after configuration tuning.
The hardware specifications are as follows:
Here are the performance comparisons before and after tuning:
Now, let's explore how we enhanced performance by tuning various build-time and runtime parameters to maximize efficiency and throughput.
Build-time parameters
Name | Fine-tuned Value | Notes |
---|---|---|
max_batch_size | 2048 | Maximum number of input sequences to pass through the engine concurrently. |
max_num_tokens | 2048 | Maximum number of input tokens to be processed concurrently in one pass. |
gemm_plugin | bfloat16 | Use NVIDIA cuBLASLt to perform GEMM (General Matrix Multiply) operations. |
multiple_profiles | enable | Enables multiple TensorRT optimization profiles. |
max_batch_size
and max_num_tokens
: Before this commit, the default values for max_batch_size
and max_num_tokens
were set to 1 and unspecified, respectively, which significantly underutilized the parallel processing capabilities for handling requests. After the commit, the default values were updated to max_batch_size
= 256 and max_num_tokens
= 8192. Our experiments revealed that setting max_batch_size
to a relatively large value, such as 2048, maximizes throughput by fully leveraging in-flight sequence batching. Simultaneously, max_num_tokens
should be limited to 2048 to ensure GPU memory usage remains within bounds.gemm_plugin
: The GEMM plugin utilizes NVIDIA cuBLASLt to perform GEMM (General Matrix Multiply) operations. On FP16 and BF16, it’s recommended to be enabled for better performance and smaller GPU memory usage. On FP8, it’s recommended to be disabled.multiple_profiles
: This option enables multiple TensorRT optimization profiles in the built engines. Utilizing more optimization profiles enhances performance by providing TensorRT with greater opportunities to select the most efficient kernels. However, this also increases the engine build time.Run-time parameters
Name | Fine-tuned Value | Notes |
---|---|---|
enable_chunked_context | True | Turning on context chunking for higher throughput. |
postprocessing_instance_count | 8 | Allow more concurrency in de-tokenization process. |
enable_chunked_context
: In the original setup, all prefill tokens (i.e. prompt tokens) were processed in one go. With this feature, the context is divided into several smaller chunks. This allows more tokens to be batched together during the generation phase, which is expected to increase overall throughput. Additionally, chunking the context removes limitations on input length.postprocessing_instance_count
: Setting postprocessing_instance_count
to a value greater than 1 enables concurrent de-tokenization, which speeds up the conversion of output tokens to output strings.In our tests, we served LLMs with TensorRT-LLM on BentoCloud. Refer to the project BentoTRTLLM for detailed instructions.
The main steps include:
TRT-LLM Model Compilation
Start by compiling your TRT-LLM model. This step involves converting your model into a format optimized for TensorRT. Refer to Figure 3 for a visual guide on the compilation process.
Serving Models with Triton Inference Server
Utilize the trtllm-backend
to serve TensorRT-LLM models using the Triton Inference Server. This backend is specifically designed to handle TensorRT-LLM models efficiently, ensuring optimal performance during inference.
Defining and Running a BentoML Service
Define your BentoML Service by specifying the model and the API endpoints. This involves creating a service file where you set up the model, load the compiled TensorRT-LLM model, and define the functions that will handle incoming requests. You can run the BentoML Service locally to test model serving.
Deploying with BentoCloud
Deploy your BentoML Service to BentoCloud. This step involves packaging your Service, configuring deployment settings, and launching it on the BentoCloud platform. BentoCloud provides a scalable and reliable environment to host your model, ensuring it can handle production-level traffic.
To accurately assess the performance of LLM backends, we created a custom benchmark script. This script simulates real-world scenarios by varying user loads and sending generation requests under different levels of concurrency.
Our benchmark client can spawn up to the target number of users within 20 seconds, after which it stress tests the LLM backend by sending concurrent generation requests with randomly selected prompts. We tested with 10, 50, and 100 concurrent users to evaluate the system under varying loads.
Each stress test ran for 5 minutes, during which time we collected inference metrics every 5 seconds. This duration was sufficient to observe potential performance degradation, resource utilization bottlenecks, or other issues that might not be evident in shorter tests.
For more information, see the source code of our benchmark client.
Check out the following resources to learn more: