Bento: Run Inference at Scale

Inference Platform built for speed and control. Deploy any model anywhere, with tailored optimization, efficient scaling, and streamlined operations.

Start Building

Get a Demo

Start Building

Get a Demo

Trusted by visionary AI teams worldwide

Explore MoreExplore More

Price-Performance at Scale

Lower LLM cost by 6x versus managed APIs on real-world benchmarks with fine-grained performance controls and customizable serving architecture.

Blazing Fast Autoscaling

Spin up inference capacity in seconds across clouds, regions, or data centers to maximize GPU utilization, ensure high availability, and optimize costs.

Unified Inference Management

Deploy, scale, and manage any model—LLMs, embeddings, multi-modal pipelines, or custom architectures with a single, unified platform.

Private & Compliant by design

Private model deployments in your VPC or air-gapped on-prem cluster, built for enterprise-grade security. No data ever leaves your environment.

InferenceOps, Your AI Backbone

Power your teams to ship and maintain AI inference just like they do software—fully automated, reliably scalable, and always secure.

Choose any cloud or on-prem GPU vendor with zero lock-in

Fast autoscaling from 0 to 100s of GPUs on demand

Built-in observability & cost analytics for full visibility

Simplify deployments and eliminate ops overhead

From Models, To AI Systems

Build scalable AI systems with any model, using our Open-Source AI serving framework.

01. Build Inference APIs in Minutes

Instant local testing, interactive debugging, and composable pipelines make it easy to go from dev to prod in minutes, not days.

Llama

RAG

Function Calling

LLM Structured Outputs

ControlNet

@openai_endpoints(
   model_id=MODEL_ID,
   default_chat_completion_parameters=dict(stop=["<|eot_id|>"]),
)
@bentoml.service(
   name="bentovllm-llama3.1-405b-instruct-awq-service",
   traffic={
       "timeout": 1200,
       "concurrency": 256,  # Matches the default max_num_seqs in the VLLM engine
   },
   resources={
       "gpu": 4,
       "gpu_type": "nvidia-a100-80gb",
   },
)
class VLLM:
   def __init__(self) -> None:
       from transformers import AutoTokenizer
       from vllm import AsyncEngineArgs, AsyncLLMEngine
       ENGINE_ARGS = AsyncEngineArgs(
           model=MODEL_ID,
           max_model_len=MAX_TOKENS,
           enable_prefix_caching=True,
           tensor_parallel_size=4,
       )
       self.engine = AsyncLLMEngine.from_engine_args(ENGINE_ARGS)
       tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
       self.stop_token_ids = [
           tokenizer.eos_token_id,
           tokenizer.convert_tokens_to_ids("<|eot_id|>"),
       ]
   @bentoml.api
   async def generate(
       self,
       prompt: str = "Explain superconductors in plain English",
       system_prompt: Optional[str] = SYSTEM_PROMPT,
       max_tokens: Annotated[int, Ge(128), Le(MAX_TOKENS)] = MAX_TOKENS,
   ) -> AsyncGenerator[str, None]:
       from vllm import SamplingParams
       SAMPLING_PARAM = SamplingParams(
           max_tokens=max_tokens,
           stop_token_ids=self.stop_token_ids,
       )
       if system_prompt is None:
           system_prompt = SYSTEM_PROMPT
       prompt = PROMPT_TEMPLATE.format(user_prompt=prompt, system_prompt=system_prompt)
       stream = await self.engine.add_request(uuid.uuid4().hex, prompt, SAMPLING_PARAM)
       cursor = 0
       async for request_output in stream:
           text = request_output.outputs[0].text
           yield text[cursor:]
           cursor = len(text)

Deploy LLMs with BentoMLDeploy LLMs with BentoML Read our docsRead our docs

02. Scale with Confidence

Burst from 1 to 100s of GPUs across regions with Bento’s adaptive autoscaling and ultra-fast cold starts—never over‑provision or miss a traffic spike.

bentoml deploy .

🍱 Built bento "vllm:7ftwkpztah74bdwk"
✅ Pushed Bento "vllm:7ftwkpztah74bdwk"
✅ Created deployment "vllm:7ftwkpztah74bdwk" in cluster "gcp-us-central-1"
💻 View Dashboard: https://ss-org-1.cloud.bentoml.com/deployments/vllm-t1y6

03. Ship Secure AI Endpoints

Auto‑generated REST API, Python Client and OpenAI‑compatible endpoints with built‑in auth, SLA monitoring, and cost analytics.

curl

python

curl -s -X POST \
    'https://bentovllm-llama3-1-405b-instruct-awq-service.mt-guc1.bentoml.ai/generate' \
    -H 'Content-Type: application/json' \
    -d '{
        "max_tokens": 4096,
        "prompt": "Explain superconductors in plain English",
        "system_prompt": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don'"'"'t know the answer to a question, please don'"'"'t share false information."
    }'

Own Your AI Sovereignty

Self-host the Bento Inference platform in your VPC or on-prem—retain full control over your AI stack with flexible choice of models, clouds, and accelerators.

Deploy anywhere: AWS, Azure, GCP, or On-premise

SOC 2 Type II report, ISO 27001 / HIPAA control mappings

Keep IAM, KMS, and audit logs under your control

No data ever leaves your environment

Explore MoreExplore More

Start a Free Trial

Schedule a Demo

What our customers say

“BentoML enables our Data Science and Engineering teams to work independently, without the need for constant coordination. This allows us to build and deploy AI services with incredible efficiency while giving the ML Engineering team the flexibility to refactor when needed. What used to take days, now takes just hours. In the first four months alone, we deployed over 40 models, and now run over 150 in production, thanks to BentoML's standardized platform.”

Michael Misiewicz, Director of Data Science, Yext

"BentoML’s infrastructure gave us the platform we needed to launch our initial product and scale it without hiring any infrastructure engineers. As we grew, features like scale-to-zero and BYOC have saved us a considerable amount of money."

Patric Fulop, CTO, Neurolabs

“BentoML is helping us future-proof our machine learning deployment infrastructure at Mission Lane. It is enabling us to rapidly develop and test our model scoring services , and to seamlessly deploy them into our dev, staging, and production Kubernetes clusters.”

Mike Kuhlen, Data Science & Machine Learning Solutions and Strategy, Mission Lane

"BentoML is an excellent tool for saving resources and running ML at scale in production"

Woongkyu Lee, Data and ML Engineer, LINE

"Bento have given us the tools and confidence to build our own Voice AI Agent solution. We are really excited to be working with Bento. They have made our development path to production much easier."

Mark Brooker, CEO, MBit

Latest updates

Read our blogRead our blog

Get 3× Faster LLM Inference with Speculative Decoding Using the Right Draft Model

Learn what speculative decoding is, how it speeds up LLM inference, why draft model choice matters, and when training your own delivers up to 3× performance gains.

Inference Platform: The Missing Layer in On-Prem LLM Deployments

Learn why an inference platform is the missing layer in on-prem LLM deployments and how it helps optimize inference performance, reduce compute cost, and guarantee reliability.

What is InferenceOps?

Learn what InferenceOps is, why it matters, and how leading AI teams scale, optimize, and manage LLM inference for production-grade performance and reliability.

The Shift to Distributed LLM Inference: 3 Key Technologies Breaking Single-Node Bottlenecks

Explore 3 key strategies — prefill/decode disaggregation, KV cache utilization-aware load balancing, and prefix-aware routing — to optimize distributed LLM inference at scale.

Deploying Phi-4-reasoning with BentoML: A Step-by-Step Guide

A step-by-step guide to deploy and scale Phi-4-reasoning in the cloud with BentoML.

How to Beat the GPU CAP Theorem in AI Inference

Learn how to solve the GPU CAP Theorem for AI inference by leveraging BentoML’s unified compute fabric for better control, on-demand availability, and cost efficiency across on-prem and cloud environments.

Accelerating AI Innovation at Yext with BentoML

Learn how Yext achieved 2x faster time-to-market and reduced compute costs by 80% with BentoML’s unified inference platform.

25x Faster Cold Starts for LLMs on Kubernetes

Discover how we optimized LLM container cold starts on Kubernetes with object storage, FUSE mounts, and stream-based model loading.

Join the InferenceOps Community

Join in conversations, ask for help, and find resources.

Start a free trial

Get a demo

Stay updated with the latest news

Inference
On Your Terms

Trusted by visionary AI teams worldwide

Dedicated Deployments
For your mission-critical

Private LLM Endpoint,Distributed LLM Inference,Custom Inference APIs,Real-Time Voice Agent,Sandboxed Code Execution,ComfyUI Workflows,Multi-LLM Gateway,Batch LLM Inference,Embedding Models,Video Generation Model,

Price-Performance at Scale

Blazing Fast Autoscaling

Unified Inference Management

Private & Compliant by design

InferenceOps, Your AI Backbone

From Models, To AI Systems

01. Build Inference APIs in Minutes

02. Scale with Confidence

03. Ship Secure AI Endpoints

Own Your AI Sovereignty

What our customers say

Latest updates

Get 3× Faster LLM Inference with Speculative Decoding Using the Right Draft Model

Inference Platform: The Missing Layer in On-Prem LLM Deployments

What is InferenceOps?

The Shift to Distributed LLM Inference: 3 Key Technologies Breaking Single-Node Bottlenecks

Deploying Phi-4-reasoning with BentoML: A Step-by-Step Guide

How to Beat the GPU CAP Theorem in AI Inference

Accelerating AI Innovation at Yext with BentoML

25x Faster Cold Starts for LLMs on Kubernetes

Freedom To Build

Products

Resources

Company

Join our community

InferenceOn Your Terms

Trusted by visionary AI teams worldwide

Dedicated DeploymentsFor your mission-critical

Private LLM Endpoint,Distributed LLM Inference,Custom Inference APIs,Real-Time Voice Agent,Sandboxed Code Execution,ComfyUI Workflows,Multi-LLM Gateway,Batch LLM Inference,Embedding Models,Video Generation Model,

Price-Performance at Scale

Blazing Fast Autoscaling

Unified Inference Management

Private & Compliant by design

InferenceOps, Your AI Backbone

From Models, To AI Systems

01. Build Inference APIs in Minutes

02. Scale with Confidence

03. Ship Secure AI Endpoints

Own Your AI Sovereignty

What our customers say

Latest updates

Get 3× Faster LLM Inference with Speculative Decoding Using the Right Draft Model

Inference Platform: The Missing Layer in On-Prem LLM Deployments

What is InferenceOps?

The Shift to Distributed LLM Inference: 3 Key Technologies Breaking Single-Node Bottlenecks

Deploying Phi-4-reasoning with BentoML: A Step-by-Step Guide

How to Beat the GPU CAP Theorem in AI Inference

Accelerating AI Innovation at Yext with BentoML

25x Faster Cold Starts for LLMs on Kubernetes

Freedom To Build

Products

Resources

Company

Join our community

Inference
On Your Terms

Dedicated Deployments
For your mission-critical