Deploying gpt-oss with vLLM and BentoML

Running large reasoning models no longer means relying on third-party APIs. With open-source models like DeepSeek-R1 and gpt-oss, you can now self-host powerful reasoning models and build your private inference API.

Unlike closed-source APIs that operate as black boxes, open-source models let you customize inference logic, serve them efficiently with frameworks like vLLM, and apply advanced techniques such as prefill–decode disaggregation for optimized performance. The result: more control, flexibility, and lower cost.

In this post, you’ll learn how to self-host gpt-oss using vLLM and BentoML. We will deploy it to BentoCloud, our fully managed inference platform with fast autoscaling, LLM-specific observability, and production-grade security built in.

What is vLLM?#

vLLM is a fast and efficient open-source library designed for LLM inference and serving. Developed by researchers at UC Berkeley, vLLM stands out for its high-performance capabilities in handling LLMs. It features advanced techniques such as continuous batching, speculative decoding, disaggregated prefilling, and automatic prefix caching.

What is gpt-oss?#

gpt-oss is an open-source reasoning model developed by OpenAI. It comes in two main variants: gpt-oss-120B and gpt-oss-20B. Trained using reinforcement learning and insights from OpenAI’s frontier models like o3, gpt-oss is good at complex reasoning tasks. It’s also practical for general use cases.

Set up the environment#

I suggest you create a virtual environment to keep your dependencies organized:

python -m venv vllm-gpt-oss
source vllm-gpt-oss/bin/activate

Next, clone the BentoVLLM repo and install the dependencies:

git clone https://github.com/bentoml/BentoVLLM.git
cd BentoVLLM/gpt-oss-20b
pip install -r requirements.txt

Build a gpt-oss inference API with vLLM and BentoML#

Everything you need is in the repo cloned. Before deploying to the cloud, let’s go through the key code implementations.

Define model and GPU configurations#

First, specify the model and GPU. In this example, we’re using gpt-oss-20b with a single NVIDIA H100 GPU, but you can switch to any LLM supported by vLLM.

import pydantic
import bentoml

# Use Pydantic to validate data
class BentoArgs(pydantic.BaseModel):
  name: str = 'gpt-oss-20b'
  gpu_type: str = 'nvidia-h100-80gb' # GPU type on BentoCloud
  tp: int = 1 # One GPU here for tensor parallelism
  model_id: str = 'openai/gpt-oss-20b'
  port: int = 8000
  # Other optional fields omitted for brevity

# Make the args injectable
bento_args = bentoml.use_arguments(BentoArgs)

Here we use template arguments from BentoML to inject dynamic and validated parameters at serve, build, and deploy time. These arguments can be referenced just like normal Python variables.

Configure the runtime environment#

BentoML allows you to package your code, dependencies, and model references into a unified Bento artifact, which simplifies deployment across different environments.

Here’s how you define the runtime environment for your model using a Bento image:

image = bentoml.images.Image(python_version="3.12") \
    .system_packages("curl", "git") \
    .requirements_file("requirements.txt")

For AMD GPU support, you can customize the image setup as follows:

image.base_image = 'rocm/vllm:rocm6.4.1_vllm_0.10.1_20250909'
# Disable locking of Python packages for AMD GPUs to exclude nvidia-* dependencies
image.lock_python_packages = False
# The GPU device is accessible by group 992
image.run('groupadd -g 992 -o rocm && usermod -aG rocm bentoml && usermod -aG render bentoml')
# Remove the vllm and torch deps to reuse the pre-installed ones in the base image
image.run('uv pip uninstall vllm torch torchvision torchaudio triton')

Launch a vLLM server within a BentoML Service#

Set up a vLLM server that listens on port 8000 and serves requests to our model.

class LLM:
  # Download the model weights from HF, skipping large checkpoints
  hf_model = bentoml.models.HuggingFaceModel(bento_args.model_id.lower(), exclude=[".pth", ".pt", "original/**/*"])

  def __command__(self) -> list[str]:
    return [
      'vllm',
      'serve',
      self.hf_model,
      '--port',
      str(bento_args.port),
			# ...extra CLI args
      '--served-model-name',
      bento_args.model_id,
    ]

Inside the class, we use the HuggingFaceModel API to load model weights from Hugging Face. Traditionally, model loading can be time-consuming, especially for large models.

BentoML optimizes this process by preloading models during image build instead of at service startup. The downloaded weights are cached and mounted directly into the container at runtime, which reduces cold start latency and accelerates scaling on BentoCloud. This mechanism ensures faster deployments and smoother autoscaling for large models like gpt-oss. Learn more about LLM cold starts.

Next, use the @bentoml.service decorator to wrap the class, marking it as a BentoML Service. This starts vLLM as the serving backend within the Bento.

@bentoml.service(
  name=bento_args.name,
  image=bento_args.image,
  traffic={'timeout': 300},
  resources={'gpu': bento_args.tp, 'gpu_type': bento_args.gpu_type},
)
class LLM:
 ...

This basic setup is all you need to get your gpt-oss model running with vLLM and BentoML. For more advanced configurations like KV cache tuning, refer to the complete source code. BentoML supports full customization to tailor inference for your use case.

Deploy gpt-oss to BentoCloud#

BentoCloud provides fast and scalable infrastructure for building and scaling AI applications.

Install BentoML and log in to BentoCloud from the CLI. If you don’t have an account yet, sign up here for free.

pip install bentoml
bentoml cloud login

Create a secret to store your HF token and reference it when running the deployment command:

bentoml secret create huggingface HF_TOKEN=$HF_TOKEN

bentoml deploy --secret huggingface

Once it is up and running, go to the Deployment details page, where it displays the exposed OpenAI-compatible API. As you interact with it, you’ll see real-time inference metrics such as tokens per second and Time to First Token (TTFT).

Test the OpenAI-compatible API#

You can call the endpoint directly with any OpenAI client. Just set the base_url to your BentoCloud Deployment URL.

from openai import OpenAI

client = OpenAI(base_url='your_deployment_endpoint', api_key='na')

# Use the following func to get the available models
# client.models.list()

chat_completion = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {
            "role": "user",
            "content": "Who are you? Please respond in pirate speak!"
        }
    ],
    stream=True,
)
for chunk in chat_completion:
    # Extract and print the content of the model's reply
    print(chunk.choices[0].delta.content or "", end="")

Enable scale-to-zero#

By default, each Deployment runs a single replica. To optimize cost, set minimum replicas to 0 so it scales down during idle time and scales up automatically when new requests arrive.

bentoml deployment update deployment_name --scaling-min 0 --scaling-max 3

Run gpt-oss locally#

If you have a powerful GPU available, you can serve gpt-oss locally using BentoML.

For better performance on NVIDIA GPUs, we recommend installing the FlashInfer library before serving:

pip install flashinfer-python --extra-index-url https://flashinfer.ai/whl/cu124/torch2.6
export HF_TOKEN=<your-api-key>
bentoml serve

The server will start at http://localhost:3000. You can send requests directly to your local vLLM + BentoML service.

To package your project for deployment or portability, build a Bento and containerize it into a Docker image:

bentoml build
bentoml containerize

This lets you run the same gpt-oss setup on any infrastructure.

FAQs#

What is BentoCloud?#

BentoCloud is a fully managed inference platform for deploying and scaling AI models in production. It provides features like GPU autoscaling, built-in observability, Sandboxes, and Codespaces for fast and cost-efficient inference.

You can deploy gpt-oss or any other model with one command, without any Kubernetes or infrastructure setup.

BentoCloud is part of the Bento Inference Platform, which also provides on-prem and BYOC (Bring Your Own Cloud) solutions.

BentoCloud is ideal for AI teams that want to prototype quickly without infrastructure overhead. For teams needing more control or strict data governance, contact us to explore Bento On-Prem or BYOC.

Why use vLLM for gpt-oss?#

vLLM is one of the fastest open-source inference engines for LLMs. It supports continuous batching, prefix caching, and prefill–decode disaggregation, all of which improve throughput and reduce latency when serving gpt-oss at scale.

What is the difference between running gpt-oss with vLLM and Ollama?#

vLLM is a high-performance inference engine optimized for server-side deployment and large-scale workloads. It supports distributed GPU setups, OpenAI-compatible APIs, and advanced batching techniques.

Ollama, on the other hand, is a local model runner designed for desktops and small servers. It’s great for lightweight experimentation but not ideal for production-scale serving or multi-GPU deployments.

If you need speed, scalability, and control in production environments, vLLM + BentoML is the better choice.

Can gpt-oss generate images?#

No. gpt-oss is a language model focused on reasoning and text generation, not image synthesis.

If you want to generate images, you can integrate gpt-oss with a text-to-image model such as Stable Diffusion and FLUX. BentoML supports deploying these multimodal pipelines together, so you can combine reasoning and generation in one API.

Deploying gpt-oss with vLLM and BentoML

Authors

Last Updated

Share