Requirements: When enterprises build a RAG knowledge base, it is important to choose the appropriate embedding embedding model, as the performance of embedding determines the accuracy of retrieval and indirectly determines the reliability of the output of large models. Commonly used models: bge, m3e, nomic-embed-text, BCEmbedding (NetEase Youdao).
Why do you need to embed a model?
Computers can only handle numerical operations and cannot directly understand non-numerical forms of data such as natural language, text, images, and audio. Therefore, we need to "vectorize" to transform this data into numerical forms that computers can understand and process, that is, map them into mathematical vector representations. This process is usually achieved with the help of embedding models, which can effectively capture semantic information and internal structures in the data.
The role of embedding models is that they not only convert discrete data (such as words, image fragments, or audio fragments) into continuous low-dimensional vectors, but also preserve the semantic relationships between the data in vector space. For example, in natural language processing, embedding models can generate word vectors, making semantically similar words closer together in vector space. This efficient representation allows computers to perform complex calculations and analysis based on these vectors, thereby better understanding and processing complex data such as text, images, or sounds.
By embedding the vectorization of the model, computers can not only efficiently process large-scale data, but also demonstrate stronger performance and generalization capabilities in various tasks (such as classification, retrieval, generation, etc.).
Embed model evaluation
To judge the quality of an embedded model, there must be a clear set of criteria. MTEB and C-MTEB are commonly used for benchmarking.
MTEB
Huggingface has an MTEB (Massive Multilingual Text Embedding Benchmark) evaluation standard, which is a relatively recognized standard in the industry and can be used as a reference. It covers 8 embedding tasks, a total of 58 datasets and 112 languages, making it the most comprehensive text embedding benchmark to date.
List:The hyperlink login is visible. GitHub address:The hyperlink login is visible.
C-MTEB
C-MTEB is the most comprehensive Chinese semantic vector evaluation benchmark, covering 6 categories of evaluation tasks (retrieval, sorting, sentence similarity, reasoning, classification, clustering) and 35 datasets.
C-MTEB Papers:The hyperlink login is visible. Codes and leaderboards:The hyperlink login is visible.(Many addresses on the Internet are old)
|