October 11, 2024 • Written By Sherlock Xu
You wake up in the morning, and there is another new AI model making headlines. It’s not surprising anymore, right? These days, it feels like a new model drops every other day, each promising more powerful than the last.
Take Llama 3.2 Vision as an example, the first multimodal models in Meta’s open-source Llama series. They push the boundaries beyond text understanding to now include images. But don’t get it twisted. Multimodal AI is about more than just images and text. These models are capable of processing multiple types of information, from images and audio to video and text. And not just open-source AI, proprietary models like GPT-4 already expanded their capabilities by integrating these modalities.
Compared with proprietary models, open-source models remain a favorite for those looking for more secure, affordable, and customizable solutions. In this blog post, we’re introducing some of the most popular open-source multimodal models available today.
Since the world of multimodal AI is broad, we will be focused on vision language models (VLMs). These models are designed to understand and process both visual and text information. At the same time, we will explore some FAQs about VLMs.
Llama 3.2 Vision, developed by Meta, is a collection of multimodal LLM designed to process both text and images. Available in 11B and 90B parameter sizes, Llama 3.2 Vision outperforms many open-source and proprietary models in image-text tasks.
To support image input, Meta integrates a pre-trained image encoder into the language model using adapters, which connect image data to the text-processing layers. This allows the models to handle both image and text inputs simultaneously.
Key features:
Points to be cautious about:
To deploy Llama 3.2 Vision, check out our blog post or simply run openllm serve llama3.2:11b-vision
with OpenLLM.
NVLM is a family of multimodal LLMs developed by NVIDIA, representing a frontier-class approach to VLMs. It achieves state-of-the-art results in tasks that require a deep understanding of both text and images. The first public iteration, NVLM 1.0, rivals top proprietary models like GPT-4o, as well as open-access models like Llama 3-V 405B.
Key features:
Distinct architectures: The NVLM 1.0 family consists of three unique architectures for different use cases.
Powerful image reasoning: NVLM 1.0 surpasses many proprietary and open-source models in tasks such as OCR, multimodal reasoning, and high-resolution image handling. It demonstrates exceptional scene understanding capability. According to the sample image provided by NVIDIA, it is able to identify potential risks and suggest actions based on visual input.
Improved text-only performance: NVIDIA researchers observed that while open multimodal LLMs often achieve strong results in vision language tasks, their performance tends to degrade in text-only tasks. Therefore, they developed “production-grade multimodality” for the NVLM models. This enables NVLM models to excel in both vision language tasks and text-only tasks (average accuracy increased by 4.3 points after multimodal training).
Points to be cautious about:
Molmo is a family of open-source VLMs developed by the Allen Institute for AI. Available in 1B, 7B, and 72B parameters, Molmo models deliver state-of-the-art performance for their class. According to the benchmarks, they can perform on a par with proprietary models like GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet.
The key to Molmo’s performance lies in its unique training data, PixMo. This highly curated dataset consists of 1 million image-text pairs and includes two main types of data:
Interestingly, Molmo researchers used an innovative approach to data collection: Asked annotators to provide spoken descriptions of images within 60 to 90 seconds. Specifically, these detailed descriptions included everything visible, even the spatial positioning and relationships among objects. The results show that annotators provided detailed captions far more efficiently than traditional methods (writing them down). Overall, they collected high-quality audio descriptions for 712k images that were sampled from 50 high-level topics.
Key features:
Points to be cautious about:
Qwen2-VL is the latest iteration of the VLMs in the Qwen series. It now goes beyond basic recognition of objects like plants and landmarks to understand complex relationships among multiple objects in a scene. In addition, it is capable of identifying handwritten text and multiple languages within images.
Qwen2-VL also extends its capabilities to video content, supporting video summarization, question answering, and real-time conversations around videos.
Key features:
Points to be cautious about:
You can find more limitation information on the model’s GitHub repository.
Pixtral is a 12 billion parameter open-source model developed by Mistral, marking the company's first foray into multimodal capabilities. Pixtral is designed to understand both images and text, released with open weights under the Apache 2.0 license.
As an instruction-tuned model, Pixtral is pre-trained on a large-scale dataset of interleaved image and text documents. Therefore, it is capable of multi-turn, multi-image conversations. Unlike previous open-source models, Pixtral maintains excellent text benchmark performance while excelling in multimodal tasks.
Key features:
Outstanding instruction following capability: Benchmark results indicate that Pixtral 12B significantly outperforms other open-source multimodal models like Qwen2-VL 7B, LLaVa-OneVision 7B, and Phi-3.5 Vision in instruction following tasks. Mistral has created new benchmarks, MM-IF-Eval and MM-MT-Bench, to further assess performance in multimodal contexts, where Pixtral also excels. These benchmarks are expected to be open-sourced for the community in the near future.
Multi-image processing: Pixtral can handle multiple images in a single input, processing them at their native resolution. The model supports a context window of 128,000 tokens and can ingest images with varied sizes and aspect ratios.
Points to be cautious about:
To deploy Pixtral 12B, you can run openllm serve pixtral:12b
with OpenLLM.
This is probably the first question you should ask yourself. Also, think about the type of data your application needs to process. If your use case only requires text, an LLM is often sufficient. However, if you need to analyze both text and images, a VLM is a reasonable choice.
If you choose a VLM, be aware that certain models may compromise their text-only performance to excel in multimodal tasks. This is why some model developers emphasize that their new models, such as NVLM and Pixtral, do not sacrifice text performance for multimodal capabilities.
For other modalities, note that different models may be specialized for particular fields, such as document processing and audio analyzing. These are more suited for multimodal scenarios beyond just text and images.
Consider the following factors to ensure optimal performance and usability:
VLMs often require significant computational resources due to their large size. Top-performing open-source models like the above-mentioned ones can reach over 70 billion parameters. This means you need high-performance GPUs to run them, especially for real-time applications.
If you are looking for a solution that simplifies this process, you can try BentoCloud. It seamlessly integrates cutting-edge AI infrastructure into enterprises’ private cloud environment. Its cloud-agnostic approach allows AI teams to select the cloud regions with the most competitive GPU rates. As BentoCloud offloads the infrastructure burdens, you can focus on building the core functions with your VLM.
Not all model serving and deployment frameworks are designed to handle multimodal inputs, such as text, images, and videos. To leverage the full potential of your VLM, ensure your serving and deployment framework can accommodate and process multiple data types simultaneously.
BentoML supports a wide range of data types, such as text, images, audios and documents. You can easily integrate it with your existing ML workflow without a custom pipeline for handling multimodal inputs.
VLMs are often used in demanding applications such as:
In these use cases, traffic can spike unpredictably based on user behavior. This means your deployment framework should support fast scaling during peak hours. BentoML provides easy building blocks to create scalable APIs, allowing you to deploy and run any VLMs on BentoCloud. Its autoscaling feature makes sure you only pay for the resources you use.
Each benchmark serves a specific purpose and can highlight different capabilities of models. Here are five popular benchmarks for VLMs:
One thing to note is that you should always treat benchmarks with caution. They are important, but by no means the only reference for choosing the right model for your use case.
Over the past month, we’ve seen a wave of powerful open-source VLMs emerge. Is this a coincidence, or are LLMs moving towards multimodal capabilities as a trend? It may be too early to say for sure. What remains unchanged is the need for robust solutions to quickly and securely deploy these models into production at scale.
If you have questions about productionizing VLMs, check out the following resources: