
Running large reasoning models no longer means relying on third-party APIs. With open-source models like DeepSeek-R1 and gpt-oss, you can now self-host powerful reasoning models and build your private inference API.
Unlike closed-source APIs that operate as black boxes, open-source models let you customize inference logic, serve them efficiently with frameworks like vLLM, and apply advanced techniques such as prefill–decode disaggregation for optimized performance. The result: more control, flexibility, and lower cost.
In this post, you’ll learn how to self-host gpt-oss using vLLM and BentoML. We will deploy it to BentoCloud, our fully managed inference platform with fast autoscaling, LLM-specific observability, and production-grade security built in.
vLLM is a fast and efficient open-source library designed for LLM inference and serving. Developed by researchers at UC Berkeley, vLLM stands out for its high-performance capabilities in handling LLMs. It features advanced techniques such as continuous batching, speculative decoding, disaggregated prefilling, and automatic prefix caching.
gpt-oss is an open-source reasoning model developed by OpenAI. It comes in two main variants: gpt-oss-120B and gpt-oss-20B. Trained using reinforcement learning and insights from OpenAI’s frontier models like o3, gpt-oss is good at complex reasoning tasks. It’s also practical for general use cases.
I suggest you create a virtual environment to keep your dependencies organized:
python -m venv vllm-gpt-oss source vllm-gpt-oss/bin/activate
Next, clone the BentoVLLM repo and install the dependencies:
git clone https://github.com/bentoml/BentoVLLM.git cd BentoVLLM/gpt-oss-20b pip install -r requirements.txt
Everything you need is in the repo cloned. Before deploying to the cloud, let’s go through the key code implementations.
First, specify the model and GPU. In this example, we’re using gpt-oss-20b with a single NVIDIA H100 GPU, but you can switch to any LLM supported by vLLM.
import pydantic import bentoml # Use Pydantic to validate data class BentoArgs(pydantic.BaseModel): name: str = 'gpt-oss-20b' gpu_type: str = 'nvidia-h100-80gb' # GPU type on BentoCloud tp: int = 1 # One GPU here for tensor parallelism model_id: str = 'openai/gpt-oss-20b' port: int = 8000 # Other optional fields omitted for brevity # Make the args injectable bento_args = bentoml.use_arguments(BentoArgs)
Here we use template arguments from BentoML to inject dynamic and validated parameters at serve, build, and deploy time. These arguments can be referenced just like normal Python variables.
BentoML allows you to package your code, dependencies, and model references into a unified Bento artifact, which simplifies deployment across different environments.
Here’s how you define the runtime environment for your model using a Bento image:
image = bentoml.images.Image(python_version="3.12") \ .system_packages("curl", "git") \ .requirements_file("requirements.txt")
For AMD GPU support, you can customize the image setup as follows:
image.base_image = 'rocm/vllm:rocm6.4.1_vllm_0.10.1_20250909' # Disable locking of Python packages for AMD GPUs to exclude nvidia-* dependencies image.lock_python_packages = False # The GPU device is accessible by group 992 image.run('groupadd -g 992 -o rocm && usermod -aG rocm bentoml && usermod -aG render bentoml') # Remove the vllm and torch deps to reuse the pre-installed ones in the base image image.run('uv pip uninstall vllm torch torchvision torchaudio triton')
Set up a vLLM server that listens on port 8000 and serves requests to our model.
class LLM: # Download the model weights from HF, skipping large checkpoints hf_model = bentoml.models.HuggingFaceModel(bento_args.model_id.lower(), exclude=[".pth", ".pt", "original/**/*"]) def __command__(self) -> list[str]: return [ 'vllm', 'serve', self.hf_model, '--port', str(bento_args.port), # ...extra CLI args '--served-model-name', bento_args.model_id, ]
Inside the class, we use the HuggingFaceModel API to load model weights from Hugging Face. Traditionally, model loading can be time-consuming, especially for large models.
BentoML optimizes this process by preloading models during image build instead of at service startup. The downloaded weights are cached and mounted directly into the container at runtime, which reduces cold start latency and accelerates scaling on BentoCloud. This mechanism ensures faster deployments and smoother autoscaling for large models like gpt-oss. Learn more about LLM cold starts.
Next, use the @bentoml.service decorator to wrap the class, marking it as a BentoML Service. This starts vLLM as the serving backend within the Bento.
@bentoml.service( name=bento_args.name, image=bento_args.image, traffic={'timeout': 300}, resources={'gpu': bento_args.tp, 'gpu_type': bento_args.gpu_type}, ) class LLM: ...
This basic setup is all you need to get your gpt-oss model running with vLLM and BentoML. For more advanced configurations like KV cache tuning, refer to the complete source code. BentoML supports full customization to tailor inference for your use case.
BentoCloud provides fast and scalable infrastructure for building and scaling AI applications.
Install BentoML and log in to BentoCloud from the CLI. If you don’t have an account yet, sign up here for free.
pip install bentoml bentoml cloud login
Create a secret to store your HF token and reference it when running the deployment command:
bentoml secret create huggingface HF_TOKEN=$HF_TOKEN bentoml deploy --secret huggingface
Once it is up and running, go to the Deployment details page, where it displays the exposed OpenAI-compatible API. As you interact with it, you’ll see real-time inference metrics such as tokens per second and Time to First Token (TTFT).
You can call the endpoint directly with any OpenAI client. Just set the base_url to your BentoCloud Deployment URL.
from openai import OpenAI client = OpenAI(base_url='your_deployment_endpoint', api_key='na') # Use the following func to get the available models # client.models.list() chat_completion = client.chat.completions.create( model="openai/gpt-oss-20b", messages=[ { "role": "user", "content": "Who are you? Please respond in pirate speak!" } ], stream=True, ) for chunk in chat_completion: # Extract and print the content of the model's reply print(chunk.choices[0].delta.content or "", end="")
By default, each Deployment runs a single replica. To optimize cost, set minimum replicas to 0 so it scales down during idle time and scales up automatically when new requests arrive.
bentoml deployment update deployment_name --scaling-min 0 --scaling-max 3
If you have a powerful GPU available, you can serve gpt-oss locally using BentoML.
For better performance on NVIDIA GPUs, we recommend installing the FlashInfer library before serving:
pip install flashinfer-python --extra-index-url https://flashinfer.ai/whl/cu124/torch2.6 export HF_TOKEN=<your-api-key> bentoml serve
The server will start at http://localhost:3000. You can send requests directly to your local vLLM + BentoML service.
To package your project for deployment or portability, build a Bento and containerize it into a Docker image:
bentoml build bentoml containerize
This lets you run the same gpt-oss setup on any infrastructure.
To go deeper into LLM inference, model serving, and deployment best practices, check out the following resources:
Â
BentoCloud is a fully managed inference platform for deploying and scaling AI models in production. It provides features like GPU autoscaling, built-in observability, Sandboxes, and Codespaces for fast and cost-efficient inference.
You can deploy gpt-oss or any other model with one command, without any Kubernetes or infrastructure setup.
BentoCloud is part of the Bento Inference Platform, which also provides on-prem and BYOC (Bring Your Own Cloud) solutions.
BentoCloud is ideal for AI teams that want to prototype quickly without infrastructure overhead. For teams needing more control or strict data governance, contact us to explore Bento On-Prem or BYOC.
vLLM is one of the fastest open-source inference engines for LLMs. It supports continuous batching, prefix caching, and prefill–decode disaggregation, all of which improve throughput and reduce latency when serving gpt-oss at scale.
vLLM is a high-performance inference engine optimized for server-side deployment and large-scale workloads. It supports distributed GPU setups, OpenAI-compatible APIs, and advanced batching techniques.
Ollama, on the other hand, is a local model runner designed for desktops and small servers. It’s great for lightweight experimentation but not ideal for production-scale serving or multi-GPU deployments.
If you need speed, scalability, and control in production environments, vLLM + BentoML is the better choice.
No. gpt-oss is a language model focused on reasoning and text generation, not image synthesis.
If you want to generate images, you can integrate gpt-oss with a text-to-image model such as Stable Diffusion and FLUX. BentoML supports deploying these multimodal pipelines together, so you can combine reasoning and generation in one API.