This article is a mirror article of machine translation, please click here to jump to the original article.

View: 1262|Reply: 3

AI (11) Selection of embedding model

[Copy link]
Posted on 2025-3-14 23:01:35 | | | |
Requirements: When enterprises build a RAG knowledge base, it is important to choose the appropriate embedding embedding model, as the performance of embedding determines the accuracy of retrieval and indirectly determines the reliability of the output of large models. Commonly used models: bge, m3e, nomic-embed-text, BCEmbedding (NetEase Youdao).

Why do you need to embed a model?

Computers can only handle numerical operations and cannot directly understand non-numerical forms of data such as natural language, text, images, and audio. Therefore, we need to "vectorize" to transform this data into numerical forms that computers can understand and process, that is, map them into mathematical vector representations. This process is usually achieved with the help of embedding models, which can effectively capture semantic information and internal structures in the data.

The role of embedding models is that they not only convert discrete data (such as words, image fragments, or audio fragments) into continuous low-dimensional vectors, but also preserve the semantic relationships between the data in vector space. For example, in natural language processing, embedding models can generate word vectors, making semantically similar words closer together in vector space. This efficient representation allows computers to perform complex calculations and analysis based on these vectors, thereby better understanding and processing complex data such as text, images, or sounds.

By embedding the vectorization of the model, computers can not only efficiently process large-scale data, but also demonstrate stronger performance and generalization capabilities in various tasks (such as classification, retrieval, generation, etc.).

Embed model evaluation

To judge the quality of an embedded model, there must be a clear set of criteria. MTEB and C-MTEB are commonly used for benchmarking.

MTEB

Huggingface has an MTEB (Massive Multilingual Text Embedding Benchmark) evaluation standard, which is a relatively recognized standard in the industry and can be used as a reference. It covers 8 embedding tasks, a total of 58 datasets and 112 languages, making it the most comprehensive text embedding benchmark to date.



List:The hyperlink login is visible.
GitHub address:The hyperlink login is visible.



C-MTEB

C-MTEB is the most comprehensive Chinese semantic vector evaluation benchmark, covering 6 categories of evaluation tasks (retrieval, sorting, sentence similarity, reasoning, classification, clustering) and 35 datasets.

C-MTEB Papers:The hyperlink login is visible.
Codes and leaderboards:The hyperlink login is visible.(Many addresses on the Internet are old)





Previous:Linux commands sudo and apt English word abbreviations
Next:.NET/C# uses the SM3 algorithm to generate signatures
 Landlord| Posted on 2025-3-17 08:55:55 |
Arctic Embed 2.0

Snowflake is excited to announce the release of Arctic Embed L 2.0 and Arctic Embed M 2.0, the next iteration of our cutting-edge embedding model, now with support for multilingual search.The hyperlink login is visible.

Model download

Arctic Embed L 2.0:The hyperlink login is visible.
Arctic Embed M 2.0:The hyperlink login is visible.

 Landlord| Posted on 2025-3-17 16:30:21 |
BCEmbedding is a model library of bilingual and cross-lingual semantic representation algorithms developed by NetEase Youdao, including two types of basic models: EmbeddingModel and RerankerModel. EmbeddingModel is specifically designed to generate semantic vectors and plays a pivotal role in semantic search and Q&A, while RerankerModel excels in optimizing semantic search results and semantically related sequencing.

GitHub:The hyperlink login is visible.

EmbeddingModel:The hyperlink login is visible.
RerankerModel:The hyperlink login is visible.

 Landlord| Posted on 2025-3-18 10:07:55 |
Model nameversionOrganization/IndividualaddressEmbed lengthMaximum input length
gtegte-large-zhAlibaba Dharma AcademyThe hyperlink login is visible.1024512
bgebge-large-zh-v1.5Beijing Zhiyuan Artificial Intelligence Research InstituteThe hyperlink login is visible.1024512
m3em3e-basemokaThe hyperlink login is visible.768512
tao8ktao8kHuggingface is developed and open sourced by amuThe hyperlink login is visible.1024512

Disclaimer:
All software, programming materials or articles published by Code Farmer Network are only for learning and research purposes; The above content shall not be used for commercial or illegal purposes, otherwise, users shall bear all consequences. The information on this site comes from the Internet, and copyright disputes have nothing to do with this site. You must completely delete the above content from your computer within 24 hours of downloading. If you like the program, please support genuine software, purchase registration, and get better genuine services. If there is any infringement, please contact us by email.

Mail To:help@itsvse.com