A Guide to Open-Source Embedding Models

October 25, 2024 • Written By Sherlock Xu

If you are building an AI-powered system for semantic search, recommendation engines, or information retrieval, you’re likely familiar with embedding models. These models are useful for transforming text, images, and other data types into vectors that capture semantic meaning. Embedding models help systems understand and retrieve relevant content based on similarity in meaning.

In this blog post, we’ll explore some of the top open-source embedding models and answer common questions about them.

NV-Embed-v2

NV-Embed-v2 is the latest release of the generalist embedding models developed by NVIDIA. It delivers state-of-the-art performance across a wide variety of tasks, ranking No. 1 on the MTEB leaderboard. It achieves an impressive score of 72.31 across 56 different tasks, spanning retrieval, classification, clustering, STS, and more. It’s worth mentioning that the previous version NV-Embed-v1 also earned the top spot on the same leaderboard.

Why choose it:

  • Novel design: The model uses latent-attention pooling, enabling the LLM to attend to specific latent vectors, resulting in better embedding quality. Additionally, it uses a two-staged instruction tuning approach that makes it versatile across both retrieval and non-retrieval tasks.
  • Retrieval excellence: NV-Embed-v2 holds the No. 1 rank in the retrieval sub-category, scoring 62.65 across 15 retrieval tasks, which are essential for applications like RAG.
  • Negative mining: NV-Embed-v2 introduces hard-negative mining techniques to improve the accuracy of embedding tasks by eliminating false negatives more effectively.

Points to be cautious about:

  • Non-commercial license: NV-Embed-v2 is released under the CC-BY-NC-4.0 license, which prohibits commercial usage.

BGE-M3

BGE (BAAI General Embedding) models are a family of text embedding models developed by the Beijing Academy of Artificial Intelligence (BAAI). One of the most popular versions in the series is BGE-M3. It stands out due to its versatility in multi-functionality, multi-linguality, and multi-granularity capabilities, also known as M3.

Why choose it:

  • Multi-functionality: BGE-M3 can simultaneously perform the three common retrieval functionalities of embedding model: dense, multi-vector, and sparse retrieval.
  • Multi-linguality: BGE-M3 supports more than 100 working languages. It learns a common semantic space for different languages. This enables both multilingual retrieval within each language and crosslingual retrieval between different languages.
  • Multi-granularity: BGE-M3 is able to process inputs of different granularities, from short sentences to long documents of up to 8192 tokens.

Points to be cautious about:

  • Generalizability needs further testing: While BGE-M3 performs well on benchmark datasets, the researchers think they need more tests to confirm its effectiveness across real-world datasets.
  • Computational demand for long documents: Although BGE-M3 handles inputs up to 8192 tokens, processing very lengthy documents may pose challenges in terms of computational resources and efficiency.
  • Performance across languages: The researchers claim multi-lingual support, but they also admit there may be potential variations in performance across different language families and linguistic features.

BGE-M3 is just one part of the broader BGE family. If you're looking for English-only alternatives, you may want to explore bge-base-en-v1.5 or bge-en-icl.

all-mpnet-base-v2

MPNet is a novel pre-training method for language understanding tasks. It addresses limitations of masked language modeling (MLM) in BERT and permuted language modeling (PLM) in XLNet. In the MPNet family, all-mpnet-base-v2 is one of the most popular embedding models. It is designed specifically for sentence and short paragraph encoding. The original developers use a contrastive learning objective: given a sentence from a paired dataset, the model predicts which of a set of randomly sampled sentences is the correct pair.

all-mpnet-base-v2 is also a sentence-transformers model. It is able to map sentences and paragraphs into a 768-dimensional dense vector space, ideal for clustering, semantic search, and other NLP tasks.

To date, all-mpnet-base-v2 represents one of the most downloaded embedding models on Hugging Face.

Why choose it:

  • Extensive training: The model is trained on over 1 billion sentence pairs to help it capture fine-grained semantic relationships.
  • Fine-tuning: The model is very adaptable and can be further fine-tuned to optimize performance for specific tasks. As of this writing, it has 149 fine-tuned versions on Hugging Face.
  • Flexible licensing: The model is released under the Apache 2.0 license. This means it supports both personal and commercial use in accordance with the license terms.

Points to be cautions about:

  • Input length limitations: By default, the model truncates inputs longer than 384 word pieces, which may lead to a loss of context in longer text.
  • Moderate performance: Compared to other models of similar size, all-mpnet-base-v2 may not perform as well on certain tasks. It does not rank particularly high on the MTEB leaderboard across a range of benchmarks.

If you want to try all-mpnet-base-v2, I also recommend all-MiniLM-L6-v2. Both are sentence-transformers models and easy to set up. For those new to AI models, they make an excellent starting point for exploring embeddings.

gte-multilingual-base

gte-multilingual-base is the latest model in Alibaba Group’s GTE (General Text Embedding) family. It stands out for its strong performance in multilingual retrieval tasks and comprehensive representation evaluations. With 305 million parameters, this model balances high-quality embeddings with efficient resource usage.

Why choose it:

  • Multilingual support: The model covers more than 70 languages, delivering reliable multilingual performance.
  • Elastic dense embedding: gte-multilingual-base supports elastic output dense representations, optimizing storage and improving efficiency in downstream tasks.
  • Encoder architecture: Built on an encoder-only transformer architecture, gte-multilingual-base is smaller and more resource-efficient than decode-only models like gte-qwen2-1.5b-instruct. It delivers a 10x increase in inference speed.
  • Sparse vectors: In addition to dense representations, it can also generate sparse vectors.

Points to be cautious about:

  • Inconsistent performance across languages: In the paper, the researchers note that performance may vary for certain languages. This is likely due to the limited language data during contrastive pre-training, which may affect the performance for these languages.

Other recommended gte models:

  • gte-Qwen2-7B-instruct: A top-ranking model on the MTEB leaderboard
  • gte-large-en-v1.5: A model optimized for English, with a max sequence length of 8192

Nomic Embed Vision

Nomic Embed Vision is Nomic’s latest model for multimodal embeddings, available in two versions: v1 and v1.5. They are fully compatible with the corresponding versions of Nomic Embed Text, which are also popular embedding models. All Nomic Embed models with the same version have compatible latent spaces and can be used for multimodal tasks. In short, you can use the Nomic Embed models to:

  • Embed both image and text data
  • Perform unimodal semantic search within image and text datasets
  • Perform multimodal semantic search across image and text datasets

Why choose it:

  • High performance: Both v1 and v1.5 outperform models like OpenAI CLIP ViT B/16 on ImageNet zero-shot, Datacomp, and MTEB benchmarks.
  • Multimodal latent space: Nomic Embed Vision’s embeddings are fully compatible with Nomic Embed Text, supporting multimodal semantic search across image and text datasets.
  • Compact and efficient architecture: With only 92 million parameters, Nomic Embed Vision is well-suited for high-volume applications with the 137M Nomic Embed Text model.
  • Accessible and auditable: The training code and instructions are open-sourced, allowing researchers to reproduce or customize the model.

Points to be cautions about:

  • Licensing for production use: Nomic Embed Vision is released under the CC-BY-NC-4.0 license, meaning it’s non-commercial. However, as new models are released, Nomic plans to re-license older models under Apache-2.0.

What are the common use cases of embedding models?

Embedding models are important tools that convert text, images, or other data into vector representations, capturing their underlying semantics and structure. This makes them essential in a wide range of AI applications. To name a few:

  • Semantic search: Embeddings allow you to retrieve semantically similar items by encoding content (text, images, etc.) into vector space, where similar items are close to each other. In search engines, this enables users to easily find relevant content.
  • Information retrieval: Embeddings enable AI models to search through large databases for documents or responses relevant to a given query. A typical use case is RAG, where retrieved data helps improve real-time content generation.
  • Clustering and classification: By grouping similar data points in vector space, embeddings make it easy to classify and organize content. For example, you can group customer reviews by sentiment or documents by topic.
  • Recommendation systems: Embeddings help recommendation engines understand user preferences based on the semantic similarities between user interests. This makes it possible to provide more personalized recommendations.

What should I consider when deploying embedding models?

When deploying embedding models, consider these key factors:

  • Performance and accuracy: Choose a model suited to your specific tasks, like retrieval, clustering, or classification. Review benchmarks like MTEB to ensure the model meets the desired accuracy and performance for your use case.
  • Low latency and fast scaling. Real-time applications, such as search engines or chatbots, require fast, low-latency embeddings. If you have diverse traffic patterns, fast autoscaling (especially fast cold start time) is also important. BentoML provides standardized abstractions to build scalable APIs and lets you run any embedding model on BentoCloud. This inference platform provides fast and scalable infrastructure for model inference and advanced AI applications.
  • Integration for complicated AI systems: Embedding models can be powerful components within compound AI solutions. A simple example is combining an embedding model with an LLM for a RAG system. BentoML offers a suite of toolkits that simplify building and scaling such AI systems, including multi-model chainsdistributed orchestration, and multi-GPU serving.

How can I improve the quality of embeddings?

Improving the quality of embeddings can enhance the performance of tasks like search, classification, and clustering. Common strategies include:

  1. Fine-tune on domain-specific data. Start with fine-tuning your embedding model on data that closely resembles your target domain. This can greatly improve relevance and semantic accuracy. It is particularly effective for specialized industries like legal, medical, or e-commerce.
  2. Use contrastive learning. This is probably one of the most mentioned terms when you hear somebody talking about embedding models. It simply means to train embedding models by learning to differentiate between similar (positive) and dissimilar (negative) pairs of samples. This approach helps the model better capture subtle semantic differences.
  3. Experiment with different embedding dimensions. Different dimensions can impact both quality and resource usage. Lower dimensions may simplify and speed up computations but could lose detail, while higher dimensions often capture richer information at the cost of more storage.
  4. Use multimodal embedding training: For applications that require text, images, or other data types, I highly recommend you train the model with multimodal data. This can improve embedding quality by enabling the model to capture cross-modal relationships.

Final thoughts

It is never easy to choose the right embedding model with so many options in the market. I hope this guide has provided clarity on some of the top open-source embedding models. Actually, for each model listed, there are often many great variants worth exploring. My best advice here is to take advantage of the flexibility of open-source models by fine-tuning them with your own data. This can significantly improve embedding accuracy for your specific needs. And finally, remember that choosing the right deployment tool is just as crucial. It can make all the difference in achieving smooth, scalable, and efficient performance.

If you have questions about embedding models, check out the following resources: