Power your teams to ship and maintain AI inference just like they do software—fully automated, reliably scalable, and always secure.
Choose any cloud or on-prem GPU vendor with zero lock-in
Fast autoscaling from 0 to 100s of GPUs on demand
Built-in observability & cost analytics for full visibility
Simplify deployments and eliminate ops overhead
Build scalable AI systems with any model, using our Open-Source AI serving framework.
Instant local testing, interactive debugging, and composable pipelines make it easy to go from dev to prod in minutes, not days.
@openai_endpoints( model_id=MODEL_ID, default_chat_completion_parameters=dict(stop=["<|eot_id|>"]), ) @bentoml.service( name="bentovllm-llama3.1-405b-instruct-awq-service", traffic={ "timeout": 1200, "concurrency": 256, # Matches the default max_num_seqs in the VLLM engine }, resources={ "gpu": 4, "gpu_type": "nvidia-a100-80gb", }, ) class VLLM: def __init__(self) -> None: from transformers import AutoTokenizer from vllm import AsyncEngineArgs, AsyncLLMEngine ENGINE_ARGS = AsyncEngineArgs( model=MODEL_ID, max_model_len=MAX_TOKENS, enable_prefix_caching=True, tensor_parallel_size=4, ) self.engine = AsyncLLMEngine.from_engine_args(ENGINE_ARGS) tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) self.stop_token_ids = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>"), ] @bentoml.api async def generate( self, prompt: str = "Explain superconductors in plain English", system_prompt: Optional[str] = SYSTEM_PROMPT, max_tokens: Annotated[int, Ge(128), Le(MAX_TOKENS)] = MAX_TOKENS, ) -> AsyncGenerator[str, None]: from vllm import SamplingParams SAMPLING_PARAM = SamplingParams( max_tokens=max_tokens, stop_token_ids=self.stop_token_ids, ) if system_prompt is None: system_prompt = SYSTEM_PROMPT prompt = PROMPT_TEMPLATE.format(user_prompt=prompt, system_prompt=system_prompt) stream = await self.engine.add_request(uuid.uuid4().hex, prompt, SAMPLING_PARAM) cursor = 0 async for request_output in stream: text = request_output.outputs[0].text yield text[cursor:] cursor = len(text)
Burst from 1 to 100s of GPUs across regions with Bento’s adaptive autoscaling and ultra-fast cold starts—never over‑provision or miss a traffic spike.
bentoml deploy . 🍱 Built bento "vllm:7ftwkpztah74bdwk" ✅ Pushed Bento "vllm:7ftwkpztah74bdwk" ✅ Created deployment "vllm:7ftwkpztah74bdwk" in cluster "gcp-us-central-1" 💻 View Dashboard: https://ss-org-1.cloud.bentoml.com/deployments/vllm-t1y6
Auto‑generated REST API, Python Client and OpenAI‑compatible endpoints with built‑in auth, SLA monitoring, and cost analytics.
curl -s -X POST \ 'https://bentovllm-llama3-1-405b-instruct-awq-service.mt-guc1.bentoml.ai/generate' \ -H 'Content-Type: application/json' \ -d '{ "max_tokens": 4096, "prompt": "Explain superconductors in plain English", "system_prompt": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don'"'"'t know the answer to a question, please don'"'"'t share false information." }'
Self-host the Bento Inference platform in your VPC or on-prem—retain full control over your AI stack with flexible choice of models, clouds, and accelerators.
Deploy anywhere: AWS, Azure, GCP, or On-premise
SOC 2 Type II report, ISO 27001 / HIPAA control mappings
Keep IAM, KMS, and audit logs under your control
No data ever leaves your environment