Over the past year, multimodal AI has moved from buzzword to baseline. Models no longer stop at text; they now interpret images, audio, video, and even user interfaces, fusing perception with reasoning. The latest wave, from GLM-4.5V to Qwen3-VL, pushes open-source multimodality into new territory once dominated by proprietary systems like GPT-5 and Gemini-2.5-Pro.
Compared with closed-source APIs, open-source models remain the top choice for developers and enterprises seeking control, privacy, and customization. They allow teams to fine-tune, self-host, and integrate multimodal capabilities directly into their products without vendor lock-in.
In this blog post, we’ll introduce some of the most popular open-source multimodal models available today. Since the world of multimodal AI is broad, we will be focused on vision language models (VLMs). These models are designed to understand and process both visual and text information. At the same time, we will explore some FAQs about VLMs.
Gemma 3 is a family of lightweight, state-of-the-art open models developed by Google, built on the same research behind Gemini 2.0. It supports advanced text, image, and short video understanding, with strong reasoning capabilities across tasks and languages.
Available in 1B, 4B, 12B, and 27B sizes, Gemma 3 offers flexibility for a range of hardware, from laptops to cloud clusters. With a 128K-token context window (32K for 1B), it can handle long-form input for more complex tasks.
Key features:
Points to be cautious about:
GLM-4.5V is the latest open-source multimodal model developed by Z.ai, the team behind the GLM family of LLMs. With 106B parameters (12B active), it achieves state-of-the-art performance on 42 public vision language benchmarks, surpassing models like Gemma-3-27B and Qwen2.5-VL-72B.
GLM-4.5V also comes with a Thinking Mode. This dual-mode design allows the model to trade speed for depth when tackling complex, multi-step visual reasoning tasks.
Key features:
Points to be cautious about:
Qwen3-VL is the latest and most capable VLM in Alibaba’s Qwen series, which represents a major leap over its predecessor Qwen2.5-VL. It delivers stronger multimodal reasoning, agentic capabilities, and long-context comprehension.
Two main editions are currently available: Qwen3-VL-235B-A22B and Qwen3-VL-30B-A3B. Both provide Instruct and Thinking variants and official FP8 versions for efficient inference.
The flagship Qwen3-VL-235B-A22B-Instruct rivals top-tier proprietary models such as Gemini-2.5-Pro and GPT-5 across multimodal benchmarks covering general Q&A, 2D/3D grounding, video understanding, OCR, and document comprehension. In text-only tasks, it performs on par with or surpasses frontier models like DeepSeek-V3-0324 and Claude-Opus-4 on leading benchmarks like MMLU, AIME25, and LiveBench1125.
Key features:
For more practical examples and use cases, explore the official Qwen3-VL cookbooks.
Molmo is a family of open-source VLMs developed by the Allen Institute for AI. Available in 1B, 7B, and 72B parameters, Molmo models deliver state-of-the-art performance for their class. According to the benchmarks, they can perform on a par with proprietary models like GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet.
The key to Molmo’s performance lies in its unique training data, PixMo. This highly curated dataset consists of 1 million image-text pairs and includes two main types of data:
Interestingly, Molmo researchers used an innovative approach to data collection: Asked annotators to provide spoken descriptions of images within 60 to 90 seconds. Specifically, these detailed descriptions included everything visible, even the spatial positioning and relationships among objects. The results show that annotators provided detailed captions far more efficiently than traditional methods (writing them down). Overall, they collected high-quality audio descriptions for 712k images that were sampled from 50 high-level topics.
Key features:
Points to be cautious about:
Pixtral is a 12 billion parameter open-source model developed by Mistral, marking the company's first foray into multimodal capabilities. Pixtral is designed to understand both images and text, released with open weights under the Apache 2.0 license.
As an instruction-tuned model, Pixtral is pre-trained on a large-scale dataset of interleaved image and text documents. Therefore, it is capable of multi-turn, multi-image conversations. Unlike previous open-source models, Pixtral maintains excellent text benchmark performance while excelling in multimodal tasks.
Key features:
Outstanding instruction following capability: Benchmark results indicate that Pixtral 12B significantly outperforms other open-source multimodal models like Qwen2-VL 7B, LLaVa-OneVision 7B, and Phi-3.5 Vision in instruction following tasks. Mistral has created new benchmarks, MM-IF-Eval and MM-MT-Bench, to further assess performance in multimodal contexts, where Pixtral also excels. These benchmarks are expected to be open-sourced for the community in the near future.
Multi-image processing: Pixtral can handle multiple images in a single input, processing them at their native resolution. The model supports a context window of 128,000 tokens and can ingest images with varied sizes and aspect ratios.
Points to be cautious about:
To deploy Pixtral 12B, you can run openllm serve pixtral:12b
with OpenLLM.
This is probably the first question you should ask yourself. Also, think about the type of data your application needs to process. If your use case only requires text, an LLM is often sufficient. However, if you need to analyze both text and images, a VLM is a reasonable choice.
If you choose a VLM, be aware that certain models may compromise their text-only performance to excel in multimodal tasks. This is why some model developers emphasize that their new models, such as NVLM and Pixtral, do not sacrifice text performance for multimodal capabilities.
For other modalities, note that different models may be specialized for particular fields, such as document processing and audio analyzing. These are more suited for multimodal scenarios beyond just text and images.
Consider the following factors to ensure optimal performance and usability:
VLMs often require significant computational resources due to their large size. Top-performing open-source models like the above-mentioned ones can reach over 70 billion parameters. This means you need high-performance GPUs to run them, especially for real-time applications.
If you are looking for a solution that simplifies this process, you can try BentoCloud. It seamlessly integrates cutting-edge AI infrastructure into enterprises’ private cloud environment. Its cloud-agnostic approach allows AI teams to select the cloud regions with the most competitive GPU rates. As BentoCloud offloads the infrastructure burdens, you can focus on building the core functions with your VLM.
Not all model serving and deployment frameworks are designed to handle multimodal inputs, such as text, images, and videos. To leverage the full potential of your VLM, ensure your serving and deployment framework can accommodate and process multiple data types simultaneously.
BentoML supports a wide range of data types, such as text, images, audios and documents. You can easily integrate it with your existing ML workflow without a custom pipeline for handling multimodal inputs.
VLMs are often used in demanding applications such as:
In these use cases, traffic can spike unpredictably based on user behavior. This means your deployment framework should support fast scaling during peak hours. BentoML provides easy building blocks to create scalable APIs, allowing you to deploy and run any VLMs on BentoCloud. Its autoscaling feature makes sure you only pay for the resources you use.
Each benchmark serves a specific purpose and can highlight different capabilities of models. Here are five popular benchmarks for VLMs:
One thing to note is that you should always treat benchmarks with caution. They are important, but by no means the only reference for choosing the right model for your use case.
Over the past year, we’ve seen a wave of powerful open-source VLMs emerge. Is this a coincidence, or are LLMs moving towards multimodal capabilities as a trend? It may be too early to say for sure. What remains unchanged is the need for robust solutions to quickly and securely deploy these models into production at scale.
If you have questions about productionizing VLMs, check out the following resources: