KIOXIA AiSAQ Achieves 4.8 Billion High-Dimensional Vector Search on a Single Server, with 7.8x Index Build Time Acceleration via GPUs

KIOXIA Achieves 4 8B RAG Vector Search on a Single Server

Abstract
As enterprises push RAG systems to ingest ever-growing knowledge sources, two barriers dominate:
scaling cost and index build time. Traditional ANNS solutions depend on DRAM for index search – resulting in prohibitive cost scaling, and require days to build large indexes - making billion-scale RAG impractical.

Our key achievements

Massive 4.8 Billion RAG Search Using Only a Single Node:
57 ms P95 latency, 200 QPS throughput at 90% recall@10
End-to-End Index Build Time Drastically Reduced by 7.8x with GPU Acceleration:
CPU: 31 days → 4 x H100 GPUs: 4 days
All-in-Storage ANN: Truly Independent of DRAM Capacity Constraints:
KIOXIA AiSAQ™ technology stores all data structures on SSDs, enabling up to 7x lower cost vs. DiskANN with minimal DRAM footprint

These efforts mark an enabling point for enterprise RAG deployments - enterprises can now achieve scalable, cost-effective¹ and performant vector search at the 5 billion scale and beyond.

Introduction
As large language models (LLMs) are increasingly used for knowledge-intensive tasks, ensuring factual accuracy and relevance has become increasingly important. RAG (Retrieval Augmented Generation) enables LLMs to use external knowledge sources to verify and ground their outputs. Such sources can include accurate and up to date information that was not part of the dataset used to train the LLM, or massive amounts of a company’s private information in enterprise RAG applications. The need for high scale RAG continues to grow and it is further intensified with Gen-AI evolutions such as agentic AI and reasoning models, which require more data to verify their responses. However, many current ANNS (Approximate Nearest Neighbor Search) solutions, used in RAG systems, are based on DRAM (e.g. HNSW), or include DRAM-based data structures (e.g. DiskANN) which increase the cost of RAG at high scales, hence limiting the practical size of RAG datasets and its effectiveness at grounding outputs.

KIOXIA developed AiSAQ software - an open source, all-in-storage ANN search solution, that enables high scaling for RAG and sematic search applications. AiSAQ was recently integrated into Milvus, a leading open-source vector database with a large application development community.

The AiSAQ index is based on the Vamana graph (the same graph used for DiskANN). Graph index build is a computationally intensive task, and as dataset size scales, index build time becomes prolonged and may affect index freshness. To accelerate index build, NVIDIA developed cuVS, a GPU-accelerated library for vector search.

This blog describes RAG at high scales, based on AiSAQ all-in-storage ANNS and NVIDIA cuVS GPU-accelerated index build. The solution is demonstrated by benchmarking AiSAQ index build and search performance for a 4.8 billion, high-dimensionality dataset, and showing how it cost-effectively exceeds RAG application requirements when deployed at a production-grade Milvus vector database.

AiSAQ (All-in-Storage ANNS with Product Quantization)
ANNS is a core component in a RAG system that searches through external information sources (after their data is represented as vector embeddings) to find the pieces of information most relevant to the user’s query. These information elements are then fed as context along with the original query to the LLM. There are different types of ANNS solutions that can be categorized based on the media used to store their data structures, as depicted in Figure 1.

Figure 1: ANNS Solutions and Search Media

DRAM-based ANNS solutions like HNSW store all data structures in DRAM. Hybrid solutions like DiskANN store most of the data-structures in SSD, while keeping a quantized version of the embedded vectors (PQ - Product Quantized vectors) in DRAM. As dataset size increases, the cost of the DRAM data structures dominates the solution’s cost, and in turn limits scale.

AiSAQ is an all-in-storage ANNS solution, with data-structures residing in SSD - resulting in near-zero DRAM consumption. Innovative algorithms are used to optimize the arrangement of the data-structures to minimize SSD access during search, thereby achieving low search latency. Therefore, AiSAQ enables economic scaling of vector databases, not limited by DRAM.

AiSAQ has two configurations: a performance configuration (AiSAQ-P), which provides low latency and high throughput - catering to online semantic search applications, and a scalability configuration (AiSAQ-S) which trades off performance for higher scale - catering to RAG and offline semantic search applications. AiSAQ provides the flexibility to choose either configuration, or to tune to the optimal point in between maximal scale and maximal performance, based on the application’s requirements.

AiSAQ-P keeps all data-structures on SSD. Neighbor PQ vectors of each node are collocated inline with the node’s index entry, and visiting a node requires only a single SSD read (similarly to DiskANN). This effective SSD access scheme achieves low latency and high throughput. Collocating the neighboring PQ vectors inline in the index data-structure results in duplications of the PQ vectors, and therefore the amplification of the SSD footprint, which facilitates a trade-off between search performance and cost. This trade-off can be controlled by tuning the number of inline vectors collocated in the index.

For ANNS with DRAM components, the dataset size that can be served in a query node is limited by the server’s DRAM capacity (typically around a few TB). In contrast, an all-in-storage ANNS is limited by the storage capacity, which can be two or more orders of magnitude larger. Therefore, compared with DiskANN, AiSAQ-P enables significantly larger vertical scaling for online semantic search applications.

AiSAQ-S keeps the PQ vectors on a separate data structure without duplications. This significantly reduces the SSD footprint and the cost. Visting a node requires multiple SSD accesses to read the neighbor PQ vectors. To reduce this overhead, AiSAQ further optimizes the SSD access scheme:

PQ vectors rearrangement: the PQ vectors data structure is rearranged such that neighboring vectors in the graph have a higher chance to be located adjacently in the SSD, which improves locality and reduces SSD accesses
PQ vector cache in DRAM: uses a small cache to store the latest accessed PQ data to further leverage the locality of the rearrangement scheme in SSD, and reduce accesses
Multiple entry points: candidate entry points are computed during index build, and selected during search to reduce average number of graph hops

This reduced SSD footprint is manifested in reduced search media cost (the combined cost of SSD and DRAM used for ANNS). AiSAQ-S reduces search media cost by 4x vs. DiskANN when TLC SSDs are used, and by 7x for QLC SSDs¹. In other words, for the same dollar budget, AiSAQ supports up to a 7x larger dataset compared with DiskANN.

System Optimization for RAG
In the RAG application pipeline, queries are embedded and sent to an ANNS server, which finds the indices of the embedded vectors closest to the query vector. Next, the corresponding information elements, such as text paragraphs, are retrieved and fed to the LLM for inference in GPU rack servers, as depicted in Figure 2.

Figure 2: RAG Application Pipeline

The latency in the pipeline is dominated by LLM inference latency, which may be up to a few seconds. Therefore, the latency requirements for the ANNS phase can be relaxed to about 100 ms without impacting user-experience. AiSAQ-S leverages this relaxed latency to provide higher scale while meeting or exceeding latency requirements. From a throughput perspective, providing throughput of several hundreds of QPS (Queries per Second) is adequate for driving multi-GPU LLM inferencing racks.

Integration to Milvus
Vector databases are the framework used in practice to deploy ANNS applications. Milvus is a leading open-source vector database designed to manage large-scale datasets, and is a popular solution chosen by many enterprises to deploy their Gen-AI applications.

Milvus recently integrated AiSAQ as one of its selectable indexes. The integrated solution provides a production-grade, fully featured vehicle, supporting the complete index life cycle - including index search and build, index update and hybrid search (filtering).

Cost-Effective Solution for High Scale RAG
The demand for high scale RAG grows as enterprises wish to index, search and reason over increasingly larger volumes of information in multi-modal formats. However, as dataset sizes increase, more sophisticated and efficient schemes must be deployed to address the index search and index build challenges prevalent at high scale. We describe RAG at high scales, based on AiSAQ all-in-storage ANNS and NVIDIA cuVS GPU-accelerated index build.

Global index and GPU index build
Vamana Graph-based indexing empirically demonstrates computational complexity that grows logarithmically with dataset size. Theoretically, using a single monolithic index would provide the most efficient search performance. However, there are practical considerations that limit the index size, including the substantial DRAM required for index build, complexity of index updates, system limitations on segment size in LSM (Log Structured Merge) vector DBs such as Milvus, etc. Therefore, in practice, large datasets are sharded into multiple smaller segments.

In the basic scheme, search is run on each segment to find the top-k vectors closest to the query, followed by an all-reduce phase which finds the top-k vectors across all segments. However, this scheme of blindly searching across all segments is inefficient in the sense that search complexity scales linearly with the number of segments. Assuming a fixed maximal segment size, search complexity scales linearly with dataset size.

Global index is a hybrid cluster and graph ANNS scheme. It draws on the coarse search phase from IVF-Flat’s clustering algorithm to reduce the search space to a subset of clusters (nProbe clusters) and uses AiSAQ as an economically and computationally efficient graph-based algorithm for in-cluster search. For each cluster an independent AiSAQ index is built. The index is based on the same Vamana graph used by DiskANN and is followed by a short post-processing phase (including PQ vector rearrangement and multi-entry point computation). Index build was accelerated using NVIDIA cuVS, leveraging cuVS support for Vamana index build. cuVS was additionally used to accelerate the generation of other index elements employing k-means clustering, such as PQ vector generation, and clustering for entry point generation.

Global index was implemented within the KIOXIA development branch of Milvus as an extension of the Cluster Compaction functionality in Milvus. Additionally, AiSAQ GPU-based index build was implemented using cuVS, and other settings and features were implemented to enhance Milvus support for high scale datasets. KIOXIA intends to contribute these enhancements to the Milvus open-source repository.

Benchmark
In this section we demonstrate the solution by benchmarking AiSAQ index build and search performance for a 5 billion scale, high-dimensionality dataset, and show that it exceeds RAG application requirements when deployed on a single node in a production-grade Milvus vector DB.

Setup
Our first challenge was the availability of a dataset: publicly available high-dimensionality vector datasets at over a billion scale are rare. Our approach was to generate a baseline dataset at the billion scale and synthetically expand it to a 5-billion-scale dataset.

The baseline dataset was generated using Falcon text corpus² (truncated to 960M paragraphs). NVIDIA’s llama-3.2 embedding model³ was used to embed the text paragraphs to 1024-dimension vectors, where each dimension is represented as a 4B full-precision element (4KiB per vector).

We’ve synthetically expanded the baseline 960M/1024D dataset to a 4.8B/1024D dataset by adding random noise to randomly selected subsets of the dimensions. Specifically, we randomly select half of the dimensions and add a constant value to the selected dimensions. The value is about 5% of the vector’s L2 norm per dimension (equaling 2E-3 for a normalized 1024-dimension vector), and it is added with a random polarity. Each vector in the baseline dataset was expanded accordingly into four synthetic vectors to generate the new dataset.

Index build and search were run on a single Milvus node which served both as the Query node for index search, and Data node for index build. A Dell PowerEdge™ R760xa server was used as the Milvus node with the following specifications:

GPU: 4 x NVIDIA H100 NVL 96G
CPU: Dual socket Intel^® Xeon^® Gold 6548Y+ (32 cores each @ 2.5GHz)
Storage: 4 x KIOXIA CD8P Series 15.36 TB (PCIe^® 5.0, data center SSDs) in a RAID0 configuration
Memory: 768 GiB DRAM

An HP ProLiant DX360 Gen10 server served as the client machine and was connected to the Milvus node via 100 Gb/s Ethernet.

The Falcon 4.8B dataset was clustered to 320 segments with 15M vectors per segment on average, using Cluster Compaction (resulting segment size of about 60GB).

AiSAQ-S was used for in-cluster search with the following parameters:

R = 64
Lbuild = 256
PQ vector size = 256B
DiskPQ size = 1024B
Inline vectors= 0
Dynamic cache = 2MB
Lsearch = 10, 20, … , 250

Milvus VectorDBBench was used as the benchmarking tool.

GPU Index Build and Data Ingestion
AiSAQ index build can be separated into two phases: DiskANN index build, followed by an AiSAQ post-processing phase (which includes PQ vector rearrangement and multi-entry point computation).

DiskANN index build is predominantly centered around building the Vamana graph. In our setup, 320 indexes need to be built, one for each of the ~15M vector segments. Building this index on the Data node using CPU takes 28.4 days. Such a prolonged index build time impacts the freshness of the index and search complexity during updates.

We used NVIDIA cuVS to accelerate AiSAQ index build, leveraging on cuVS support for Vamana graph build. cuVS-based k-means implementation was additionally used to accelerate the generation of PQ vectors and the generation of multiple entry points for AiSAQ.

By utilizing GPUs, index build can be massively accelerated. We have extended the Milvus implementation to support index build with multiple GPUs in parallel. When using all four H100 GPUs in the Milvus node, indexing time was accelerated by 20x, resulting in AiSAQ index build time of 1.4 days (refer to Figure 3(b)). AiSAQ post-processing overhead (on top of DiskANN’s index build) was also measured, and accounts for ~3% of total index build time.

Figure 3: GPU Accelerated Index Build (4.8 billion dataset)

AiSAQ index build is only one phase that a new dataset goes through when ingested into the Milvus vector database. The complete end-to-end index build process includes the following phases:

Upload: the dataset is uploaded from the client machine (as Parquet files)
Import: conversion of the dataset into temporary segments
Cluster: generation of clustered segments (using Cluster Compaction)
AiSAQ Index Build: an AiSAQ index is build for each of the clustered segments
Load: The indexes are loaded⁴ to Milvus’ local SSDs for search

We benchmarked the end-to-end index build process, as shown in Table 1. Using a CPU-based approach, ingestion is dominated by AiSAQ index build time and totals 31.0 days. Accelerating AiSAQ index build with four GPUs reduces the end-to-end index build time by 7.8x, lowering it to 4.0 days (refer to Figure 3(a)).

Table 1: End-to-End Index Build of Falcon 4.8B/1024D

Index Search
Index search was benchmarked using Milvus vectorDBBench. This tool enables generating multiple query streams which are sent concurrently to the query nodes to emulate production settings. In benchmarking index search we tested multiple concurrency values. For each concurrency value we vary the Lsearch parameter to measure performance as a function of recall.

Global index enables tuning via a pruning factor (equivalent to IVF Flat’s nProbe) to reduce the search space to a subset of clusters closest to the query. AiSAQ graph-based ANNS is used to search within each of the remaining clusters for the top-k results for each cluster, and an all-reduce phase sorts the final
top-k.

The hybrid cluster and graph scheme of Global index significantly improves search performance compared with the basic blind sharding scheme, which searches through all segments. As depicted in Figure 4, search latency is reduced by 4.1x in a typical setting⁵ when comparing Global index with blind sharding, enabling the index to meet the sub 100 ms requirement of a RAG application.

Figure 4: P95 Search Latency of Global Index vs. Blind Sharding

Index search results are depicted in Figure 5. We used nProbe=14, which prunes over 95% of the 320 clusters, and achieve P95 latency of 57 ms and throughput of 200 QPS on single Milvus server with 9 concurrent query streams – exceeding RAG application requirements and leaving performance headroom to further expansion of scale.

Figure 5: P95 Latency (left) and throughput (right) as function of Recall@10 for 4.8B/1024D

Going forward we plan to expand our benchmarks to higher scales, leveraging on the cost-effectiveness of SSDs, the large storage capacity available in query nodes for vertical scaling, and the significant acceleration provided by GPUs for index build at scale. We are also exploring more sophisticated approaches to high-scale synthetic dataset generation, which will be discussed in future blog posts.

Summary
In this blog we reviewed RAG applications and the drivers for dataset scale. While ANNS solutions with DRAM components run into scalability bottlenecks, KIOXIA AiSAQ offers a scalable ANNS that meets or exceeds RAG application requirements.

AiSAQ is integrated in Milvus, a leading open-source vector DB, which provides a ready-to-use vehicle for application developers of RAG, online and offline semantic search applications.

As RAG dataset size increases, advanced schemes must be deployed to address the index search and index build challenges prevalent at high scale. KIOXIA provides cost-effective RAG based on AiSAQ all-in-storage ANNS, and NVIDIA cuVS GPU-accelerated index build. NVIDIA cuVS was extensively used to accelerate multiple phases of the AiSAQ index build process, including Vamana graph generation, PQ vectors generation and multi-entry point computation.

An advanced Global index scheme based on hybrid cluster and graph algorithms was used to increase search efficiency at high scale. The solution was benchmarked for a 4.8 billion, high dimensionality dataset on a single Milvus server and demonstrated to exceed RAG application requirements.

Enabling economic scaling of RAG datasets improves LLM accuracy and expands the scale of private data available for search and reasoning in enterprise RAG deployments. KIOXIA continues to expand AiSAQ to provide the Gen-AI industry solutions to economically meet the growing demand for higher scales.

Authors: Assaf Sella, David Dorham, Gili Buzaglo, Shimon Tsalmon, Yedidia Kaplan, Miki Schnarch

Notes:

¹ Source: KIOXIA engineering team. Cost analysis is based on assumptions (as of Q1 2025) regarding price per gigabyte for DRAM, TLC and QLC SSDs, and presented as a relative reference in percentage

²Falcon corpus: https://huggingface.co/datasets/tiiuae/falcon-refinedweb

³ Embedding model: llama-3.2-nvembed-1b-v2

⁴ MinIO was used as the object store for datasets, segments, and indexes. In our setup, MinIO is deployed on the Milvus node

⁵ The Falcon 4.8B dataset was sharded into 60 segments with 80M vectors per segment (large segments were used to optimize blind sharding performance). Search is run on each segment, followed by an all-reduce phase. Performance was benchmarked with 9 concurrent query streams

Trademarks:

NVIDIA is a trademark and/or registered trademark of NVIDIA Corporation in the U.S. and other countries.

PCIe is a registered trademark of PCI-SIG.

Microsoft is a trademark of the Microsoft group of companies

Dell and PowerEdge are trademarks of Dell Technologies or its subsidiaries.

Intel and Xeon are trademarks of Intel Corporation or its subsidiaries.

Disclaimers:

Definition of SSD capacity: Kioxia Corporation defines a kilobyte (KB) as 1,000 bytes, a megabyte (MB) as 1,000,000 bytes, a gigabyte (GB) as 1,000,000,000 bytes, a terabyte (TB) as 1,000,000,000,000 bytes, and a kibibyte (KiB) is 1,024 bytes. A computer operating system, however, reports storage capacity using powers of 2 for the definition of 1GB = 2^30 bytes = 1,073,741,824 bytes and 1TB = 2^40 bytes = 1,099,511,627,776 bytes and therefore shows less storage capacity. Available storage capacity (including examples of various media files) will vary based on file size, formatting, settings, software and operating system, and/or pre-installed software applications, or media content. Actual formatted capacity may vary.

Disclaimer
The views and opinions expressed in this blog are those of the author(s) and do not necessarily reflect those of KIOXIA America, Inc.

Related posts

How to Accelerate Vector Databases without High DRAM Costs? Use Disk-based Vector Indexes with Fast PCIe® 5.0 SSDs from KIOXIA!

KIOXIA’s Contribution to ANN Search and the Path to Trillion-Vector Databases

Interested in Enhancing Predictive LLMs through SSD Scaling? Introducing KIOXIA AiSAQ™ Search Engine Technology that Offloads LLM Data to SSDs

KIOXIA AiSAQ Technology Explained: Tom Coughlin Talks KIOXIA’s Open-Source AI Storage Technology