How NVIDIA GPU Acceleration Supercharged Milvus Vector Database
In today’s data-driven world, quickly and accurately searching through vast amounts of unstructured data is crucial for powering cutting-edge AI applications. From generative AI and similarity search to recommendation engines and virtual drug discovery, vector databases have emerged as the backbone technology enabling these advanced capabilities. However, the insatiable demand for real-time indexing and high throughput has continued to push the boundaries of what’s possible with traditional CPU-based solutions.
- Real-time indexing: Vector databases often need to ingest and index new vector data continuously and at a high velocity. Real-time indexing capabilities are essential to keep the database up-to-date with the latest data without creating bottlenecks or backlogs.
- High throughput: Many applications that leverage vector databases, such as recommendation systems, semantic search engines and anomaly detection, require real-time or near-real-time query processing. High throughput enables vector databases to handle a large volume of incoming queries concurrently, delivering low-latency responses to end users or services.
At the heart of vector databases lies a core set of vector operations, such as similarity calculations and matrix operations, which are highly parallelizable and computationally intensive. With their massively parallel architecture comprising thousands of cores capable of executing numerous threads simultaneously, GPUs are an ideal computational engine for accelerating these operations.
Harnessing GPU Acceleration
To address these challenges, NVIDIA developed the CUDA-Accelerated Graph Index for Vector Retrieval (CAGRA), a GPU-accelerated framework that leverages the high-performance capabilities of GPUs to deliver exceptional throughput for vector database workloads.
At NVIDIA GTC 2024, Zilliz and NVIDIA unveiled Milvus 2.4, the world’s first vector database accelerated by powerful GPU indexing and search capabilities. Milvus is an open source vector database system built for large-scale vector similarity search and AI workloads. Initially created by Zilliz, an innovator in the world of unstructured data management and vector database technology, Milvus made its debut in 2019. To encourage widespread community engagement and adoption, it has been hosted by the Linux Foundation since 2020.
Milvus 2.4 harnesses NVIDIA GPUs’ massively parallel computing power and the new CAGRA from the RAPIDS cuVS library. This GPU acceleration enables significant performance gains in Milvus: Benchmarks demonstrate up to 50x faster vector search performance than state-of-the-art CPU-based indexes like Hierarchical Navigable Small Worlds (HNSW).
Exploring the Milvus 2.4 Architecture
Milvus is designed for cloud native environments and follows a modular design philosophy. It separates the system into various components and layers involved in handling client requests, processing data and managing storage and retrieval of vector data. This modular design enables Milvus to update or upgrade the implementation of specific modules without changing their interfaces. This modularity makes it relatively easy to incorporate GPU acceleration support into Milvus.

The Milvus 2.4 architecture
The modular architecture includes components such as the Coordinator, Access Layer, Message Queue, Worker Node and Storage layers. The Worker Node is further subdivided into Data Nodes, Query Nodes and Index Nodes. The Index Nodes are responsible for building indexes, while the Query Nodes handle query execution.
To leverage the benefits of GPU acceleration, CAGRA is integrated into Milvus’ Index and Query Nodes. This integration enables offloading computationally intensive tasks, such as index building and query processing, to GPUs, taking advantage of their parallel processing capabilities.
Within the Index Nodes, CAGRA support has been incorporated into the index-building algorithms, allowing efficient construction and management of high-dimensional vector indexes on GPU hardware. This acceleration significantly reduces the time and resources required for indexing large-scale vector data sets.
Similarly, CAGRA is utilized in the Query Nodes to accelerate the execution of complex vector similarity searches. By leveraging GPU processing power, Milvus can perform high-dimensional distance calculations and similarity searches at unprecedented speeds, resulting in faster query response times and improved overall throughput.
Evaluating Performance
For this evaluation, we utilized three publicly available instance types on AWS:
- m6id.2xlarge: This instance type is powered by the Intel Xeon 8375C CPU.
- g4dn.2xlarge: This GPU-accelerated instance is equipped with an NVIDIA T4 GPU.
- g5.2xlarge: This instance type features the NVIDIA A10G GPU.
By leveraging these diverse instance types, we aimed to evaluate the performance and efficiency of Milvus with CAGRA integration across different hardware configurations. The m6id.2xlarge instance served as a baseline for CPU-based performance, while the g4dn.2xlarge and g5.2xlarge instances allowed us to assess the benefits of GPU acceleration using the NVIDIA T4 and A10G GPUs, respectively.
| Instance | Processor | vCPU | GPU | Memory | GPU mem | Hourly Price |
|---|---|---|---|---|---|---|
| m6id.2xlarge | Intel Xeon 8375C | 8 | – | 32G | – | $0.4746 |
| g4dn.2xlarge | T4 | 8 | 1 | 32G | 16G | $0.752 |
| g5.2xlarge | A10G | 8 | 1 | 32G | 24G | $1.212 |
Evaluation environments, AWS
We used two publicly available vector data sets from VectorDBBench:
- OpenAI-500K-1536-dim: This data set consists of 500,000 vectors, each with a dimensionality of 1,536. It is derived from the OpenAI language model.
- Cohere-1M-768-dim: This data set contains 1 million vectors, each with a dimensionality of 768. It is generated from the Cohere language model.
These data sets were specifically chosen to evaluate the performance and scalability of Milvus with CAGRA integration under different data volumes and vector dimensionalities. The OpenAI-500K-1536-dim data set allows for assessing the system’s performance with a moderately large data set of extremely high-dimensional vectors. In contrast, the Cohere-1M-768-dim data set tests the system’s ability to handle larger volumes of moderately high-dimensional vectors.
Index Building Time
We compared the index-building time between Milvus with the CAGRA GPU acceleration framework and the standard Milvus implementation using the HNSW index on CPUs.

For the Cohere-1M-768-dim data set, the index building times are:
- CPU (HNSW): 454 seconds
- T4 GPU (CAGRA): 66 seconds
- A10G GPU (CAGRA): 42 seconds
For the OpenAI-500K-1536-dim dataset, the index building times are:
- CPU (HNSW): 359 seconds
- T4 GPU (CAGRA): 45 seconds
- A10G GPU (CAGRA): 22 seconds
The results clearly show that CAGRA, the GPU-accelerated framework, significantly outperforms CPU-based HNSW index building, with the A10G GPU being the fastest across both data sets. The GPU acceleration provided by CAGRA reduces the index building time by up to an order of magnitude compared to the CPU implementation, demonstrating the benefits of leveraging GPU parallelism for computationally intensive vector operations like index construction.
Throughput
We also compared the performance between Milvus with the CAGRA GPU acceleration framework and the standard Milvus implementation using the HNSW index on CPUs. The metric we evaluated is queries per second (QPS), which measures the throughput of query execution.
We varied the batch size, representing the number of queries processed concurrently, from 1 to 100 during the evaluation process. This wide range of batch sizes allowed us to conduct a realistic and thorough evaluation, assessing the performance under different query workload scenarios.
Evaluating Throughput
The charts show that:
- For a batch size of 1, the T4 is 6.4x to 6.7x faster than the CPU, and the A10G is 8.3x to 9x faster.
- When the batch size increases to 10, the performance improvement is more significant: T4 is 16.8x to 18.7x faster, and A100 is 25.8x to 29.9x faster.
- With a batch size of 100, the performance gain continues to grow: T4 is 21.9x to 23.3x faster, and A100 is 48.9x to 49.2x faster.
The results demonstrate substantial performance gains from leveraging GPU acceleration for vector database queries, particularly for larger batch sizes and higher-dimensional data. Milvus with CAGRA unlocks the parallel processing capabilities of GPUs, enabling significant throughput improvements and making it well-suited for demanding vector database workloads.
Blazing New Trails
The benchmarks indicate that integration of NVIDIA’s CAGRA GPU acceleration framework into Milvus 2.4 represents a major achievement in vector databases. GPUs’ massively parallel computing power significantly boosted performance for vector indexing and search operations, improving real-time, high-throughput vector data processing.
The Milvus 2.4 collaboration between Zilliz and NVIDIA exemplifies the power of open innovation and community-driven development by bringing GPU acceleration to vector databases.
The open source Milvus 2.4 is available now, and enterprises looking for a fully managed vector database service can look forward to GPU acceleration coming to Zilliz Cloud later this year. Zilliz Cloud provides a seamless experience for deploying and scaling Milvus on major cloud providers like AWS, Google Cloud Platform and Azure without operational overhead.