AI (11) Selection of embedding model

Little scum · Posted on 3/14/2025 11:01:35 PM

Requirements: When enterprises build a RAG knowledge base, it is important to choose the appropriate embedding embedding model, as the performance of embedding determines the accuracy of retrieval and indirectly determines the reliability of the output of large models. Commonly used models: bge, m3e, nomic-embed-text, BCEmbedding (NetEase Youdao).

Why do you need to embed a model?

Computers can only handle numerical operations and cannot directly understand non-numerical forms of data such as natural language, text, images, and audio. Therefore, we need to "vectorize" to transform this data into numerical forms that computers can understand and process, that is, map them into mathematical vector representations. This process is usually achieved with the help of embedding models, which can effectively capture semantic information and internal structures in the data.

The role of embedding models is that they not only convert discrete data (such as words, image fragments, or audio fragments) into continuous low-dimensional vectors, but also preserve the semantic relationships between the data in vector space. For example, in natural language processing, embedding models can generate word vectors, making semantically similar words closer together in vector space. This efficient representation allows computers to perform complex calculations and analysis based on these vectors, thereby better understanding and processing complex data such as text, images, or sounds.

By embedding the vectorization of the model, computers can not only efficiently process large-scale data, but also demonstrate stronger performance and generalization capabilities in various tasks (such as classification, retrieval, generation, etc.).

Embed model evaluation

To judge the quality of an embedded model, there must be a clear set of criteria. MTEB and C-MTEB are commonly used for benchmarking.

MTEB

Huggingface has an MTEB (Massive Multilingual Text Embedding Benchmark) evaluation standard, which is a relatively recognized standard in the industry and can be used as a reference. It covers 8 embedding tasks, a total of 58 datasets and 112 languages, making it the most comprehensive text embedding benchmark to date.

List:The hyperlink login is visible.
GitHub address:The hyperlink login is visible.

C-MTEB

C-MTEB is the most comprehensive Chinese semantic vector evaluation benchmark, covering 6 categories of evaluation tasks (retrieval, sorting, sentence similarity, reasoning, classification, clustering) and 35 datasets.

C-MTEB Papers:The hyperlink login is visible.
Codes and leaderboards:The hyperlink login is visible.(Many addresses on the Internet are old)

Little scum · Posted on 3/17/2025 8:55:55 AM

Arctic Embed 2.0

Snowflake is excited to announce the release of Arctic Embed L 2.0 and Arctic Embed M 2.0, the next iteration of our cutting-edge embedding model, now with support for multilingual search.The hyperlink login is visible.

Model download

Arctic Embed L 2.0：The hyperlink login is visible.
Arctic Embed M 2.0：The hyperlink login is visible.

Little scum · Posted on 3/17/2025 4:30:21 PM

BCEmbedding is a model library of bilingual and cross-lingual semantic representation algorithms developed by NetEase Youdao, including two types of basic models: EmbeddingModel and RerankerModel. EmbeddingModel is specifically designed to generate semantic vectors and plays a pivotal role in semantic search and Q&A, while RerankerModel excels in optimizing semantic search results and semantically related sequencing.

GitHub：The hyperlink login is visible.

EmbeddingModel：The hyperlink login is visible.
RerankerModel：The hyperlink login is visible.

Little scum · Posted on 3/18/2025 10:07:55 AM

Model name	version	Organization/Individual	address	Embed length	Maximum input length
gte	gte-large-zh	Alibaba Dharma Academy	The hyperlink login is visible.	1024	512
bge	bge-large-zh-v1.5	Beijing Zhiyuan Artificial Intelligence Research Institute	The hyperlink login is visible.	1024	512
m3e	m3e-base	moka	The hyperlink login is visible.	768	512
tao8k	tao8k	Huggingface is developed and open sourced by amu	The hyperlink login is visible.	1024	512

AI (11) Selection of embedding model

Related Posts

Sections viewed