【AI】(14) A brief introduction to open source vector databases

Little scum · Posted on 3/25/2025 11:29:25 AM

Requirements: Last time, we organized the selection of embedding models, and when converting the model into vectors, we need to consider saving the vectors. There are many vector databases, such as: LanceDB, Astra DB, Pinecone, Chroma, Weaviate, QDrant, Milvus, Zilliz, PGVector, Redis, Elasticsearch, Redis, FAISS, SQL Server 2025, etc.

What is a vector database?

A vector database is an organized collection of vector embeddings that incorporate vector embeddings that can be created, read, updated, and deleted at any time. Vector embeddings represent blocks of data, such as text or images, as numeric values. A vector database is a database system designed to store and retrieve high-dimensional vectors. It quickly finds the closest target vector by calculating the similarity between vectors (such as cosine similarity, Euclidean distance, etc.). This technique is often used to process embedding-based data, such as text, images, audio, or video feature representations.

A vector database is a collection of data stored in mathematical form. Vector databases make it easier for machine learning models to remember previous inputs, enabling machine learning to be used to support use cases such as search, recommendation, and text generation. Data can be identified based on similarity metrics rather than exact matches, allowing computer models to understand the context of the data.

When a customer visits a shoe store, the salesperson may recommend shoes that are similar to the one the customer likes. Similarly, when shopping in an e-commerce store, the store may recommend similar items under headings such as "The customer also bought...". Vector databases enable machine learning models to identify similar objects, just like a salesperson can find similar shoes, and an e-commerce store can recommend related products. (In fact, ecommerce stores may use such machine learning models to do the job).

In conclusion, vector databases enable computer programs to make comparisons, identify relationships, and understand context. This makes it possible to create advanced artificial intelligence (AI) programs such as large language models (LLMs).

Chroma

Site:The hyperlink login is visible.

Chroma is an efficient, Python-based, open-source database for large-scale similarity searches. It is designed to solve the problem of similarity searches in large-scale datasets, especially when dealing with high-dimensional data. Multiple hosting options are available: serverless/embedded, self-hosted (client-server), and cloud-native distributed SaaS solutions with both embedded and client-server models.
Excellent in prototyping and production environments. Due to the ephemeral nature of its data storage, Chroma is ideal for rapid prototyping of scripts. With simple setup, users can easily create collections and reuse them, facilitating subsequent data additions. In addition, Chroma has the ability to automatically load and save data. When the client is launched, it automatically loads the user's data; When closed, the data is automatically saved, greatly simplifying the data management process. This feature makes Chroma very popular during the prototyping and development phases.
Chroma received a seed round of funding in May 2022 and a second round of $1,800 in funding.

Pros: Chroma offers clients for more than a dozen programming languages, can quickly launch vector storage, and is the first vector database on the market to offer embedding mode by default. It is relatively developer-friendly and easy to integrate.
Disadvantages: The functionality is relatively simple, especially for applications that require more complex functions. Only CPU compute is supported, which may limit performance gains in situations that require significant compute resources.

LanceDB

Site:The hyperlink login is visible.

LanceDB is an open-source vector database designed for multimodal AI data for storing, managing, querying, and retrieving large-scale multimodal data embeddings. Its core is written in Rust and built on Lance, a columnar data format that optimizes high-speed random access and management of AI datasets such as vectors, documents, and images. It is suitable for various AI applications that need to process high-dimensional vector data, such as image recognition, natural language processing, recommendation systems, etc. LanceDB provides two modes: embedded and cloud-hosted services.

Advantages: LanceDB eliminates the need to manage servers, reducing developers' O&M costs and improving development efficiency. It is optimized for multimodal data and supports various data types such as images, text, and audio, improving the efficiency of the database when handling complex data. It provides a friendly API interface and visualization tools, allowing developers to easily integrate and use databases.
Disadvantages: It will only be launched in 2023, which is a very new database, and it is not mature enough in terms of function development and community operation.

PGVector

Site:The hyperlink login is visible.

PGVector is a PostgreSQL-based extension designed to provide powerful vector storage and query capabilities. It uses C language to implement a variety of vector data types and algorithms, and can efficiently store and query AI embeddings expressed in vectors. PGVector supports precise and approximate nearest neighbor search, enabling quick access to similar data points in high-dimensional space. It also supports a variety of vector calculation algorithms and data types, such as L2 distance, inner product, and cosine distance, among others. It is suitable for scenarios where the vector search function is not the core of the system, or the project is quickly launched in the early stage.

Pros: PGVector seamlessly integrates into existing PostgreSQL databases, allowing users to start using vector search capabilities without migrating existing databases. Because it is a PostgreSQL plugin, PGVector inherits its reliability and robustness with the help of PostgreSQL's long-term development and optimization, while enhancing vectorization processing.
Disadvantages: Compared with dedicated vector databases, the optimization of performance and resource utilization is slightly insufficient.

Qdrant

Site:The hyperlink login is visible.

Qdrant is an open-source vector database and cloud-hosted service launched in 2021 and designed for next-generation AI applications. Convenient APIs are provided to store, search, and manage points (i.e., vectors) with additional payloads to extend filtering support. The multiple index types, including Payload indexes, full-text indexes, and vector indexes, enable it to handle high-dimensional data efficiently. Additionally, Qdrant uses a custom HNSW algorithm for fast and accurate searches and allows for filtering of results based on relevant vector payloads. These features make Qdrant useful for neural networks or semantic-based matching, multifaceted search, and other applications. Qdrant's strength lies in its semantic search and similarity matching functions, which make it easy to implement business scenarios such as image, voice, and video search, as well as recommendation systems.

Pros: Excellent documentation to help developers get up and running with Docker easily. It is built entirely in Rust and offers APIs that developers can use through its Rust, Python, and Golang clients, which are the most popular languages for backend developers today. Qdrant supports various optimization strategies, such as index optimization and query optimization. It also supports distributed deployment and horizontal scaling to meet the needs of large-scale data processing.
Cons: The project is relatively new and does not have enough time to validate. When responding to the growth of business volume, it can only scale horizontally at the service level. Only static sharding is supported. According to Zilliz's report, as the number of unstructured data elements in vector databases grows, the amount of data stored is large, and query efficiency may be affected.

Milvus/Zilliz Cloud

Milvus website:The hyperlink login is visible.
Zilliz website:The hyperlink login is visible.

Milvus is a 2019 open-source pure vector database built on well-known vector search libraries such as FAISS, Annoy, and HNSW, and optimized for scenarios that require rapid similarity searches. Zilliz Cloud is a cloud-native vector database service developed based on Milvus, aiming to provide more convenient and high-performance management and scaling capabilities. In short, Zilliz is a commercial version of Milvus' cloud hosting, which is also a more successful business model in the database field.

Pros: Due to its long existence in the vector database ecosystem, the database is very mature and has a large number of algorithms. Lots of vector indexing options are available, and it's built from the ground up in Golang for extreme scalability. As of 2023, it is the only mainstream vendor that offers a viable DiskANN implementation, which is said to be the most efficient disk vector indexing.
Cons: Milvus seems to be a solution that goes all out on scalability issues – it is highly scalable through a combination of proxies, load balancers, message brokers, Kafka, and Kubernetes 7, which makes the entire system very complex and resource-intensive. Client-side APIs, such as Python, are also not as readable or intuitive as newer databases like Weaviate and Qdrant, which tend to be more focused on the developer experience. Milvus is built with the idea of streaming data to vector indexes for massive scalability, and in many cases, Milvus seems to be a bit overkill when the amount of data is not too large. For more static and infrequent large-scale situations, alternatives like Qdrant or Weaviate may be cheaper and can get up and running faster in production.

other

Redis：The hyperlink login is visible.
Pinecone：The hyperlink login is visible.
Weaviate：The hyperlink login is visible.
FAISS：The hyperlink login is visible.、The hyperlink login is visible.
Elasticsearch：The hyperlink login is visible.
SQL Server：The hyperlink login is visible.

Reference:
The hyperlink login is visible.
The hyperlink login is visible.
The hyperlink login is visible.
The hyperlink login is visible.

【AI】(14) A brief introduction to open source vector databases

Related Posts

Sections viewed