Java

Introduction to jVector

As data-driven applications grow in scale and complexity, finding relationships between vectors efficiently is essential. JVector offers a high-performance Java solution for vector indexing and similarity search, utilising graph-based algorithms, product quantization, and a disk-aware design to deliver low-latency, high-recall results.

This article introduces JVector, explains its concepts and demonstrates how to build and query vector indexes effectively in Java.

1. Basic Concepts and Core APIs

JVector is a fast and compact Java library built for vector similarity search. It leverages advanced graph-based algorithms inspired by DiskANN to deliver quick and accurate nearest-neighbor searches across large, high-dimensional datasets. Designed for low-latency retrieval and high recall, it fits naturally into use cases such as recommendation systems, semantic search, and AI-driven data retrieval.

To optimize performance, JVector applies product quantization to compress vectors, keeping them in memory during queries. This, along with its disk-aware design, minimizes disk access and boosts efficiency. Its parallel and incremental architecture allows index building and querying to happen simultaneously, scaling efficiently across multiple CPU cores. With an intuitive API, it integrates seamlessly into enterprise Java systems, already powering DataStax Astra DB.

The main components of JVector’s API are straightforward. Index creation starts with GraphIndexBuilder, which constructs the vector index. Vectors are supplied through the RandomAccessVectorValues interface, allowing JVector to read and organize them efficiently. For querying, the GraphSearcher class serves as the core search engine, returning results as SearchResult objects ordered by similarity. These searchers are reusable, and the GraphSearcher.Builder offers configurable options for repeated queries.

2. Project setup – Maven configuration

Below is a minimal pom.xml that pulls the JVector library from Maven Central. Use the project’s latest stable or release candidate version from Maven Central.

    <dependency>
      <groupId>io.github.jbellis</groupId>
      <artifactId>jvector</artifactId>
      <!-- Replace with the latest published version -->
      <version>4.0.0-rc.2</version>
    </dependency>

3. Building and Persisting an In-Memory Vector Index

This section demonstrates a complete Java example that shows how to load word embeddings from a dataset, build an in-memory vector index using JVector, persist it to disk, and validate the index by reloading it. The example uses a subset of the GloVe (Global Vectors for Word Representation) dataset.

public class InMemoryExample {

    private static Path indexPath;
    private static Map<String, VectorFloat<?>> datasetVectors;
    private static final VectorTypeSupport VECTOR_TYPE_SUPPORT = VectorizationProvider.getInstance()
            .getVectorTypeSupport();

    public static void main(String[] args) throws Exception {
        // Adjust this to the path of your GloVe file (e.g., glove.6B.50d.txt)
        String glovePath = "src/main/resources/glove.6B.50d.txt";

        System.out.println("Loading GloVe dataset...");
        datasetVectors = loadGlove6B50dDataSet(glovePath, 1000); // Load first 1000 vectors
        System.out.println("Loaded " + datasetVectors.size() + " vectors.");

        indexPath = Files.createTempFile("sample", ".inline");

        System.out.println("Building index...");
        persistIndex(new ArrayList<>(datasetVectors.values()), indexPath);
        System.out.println("Index persisted at: " + indexPath.toAbsolutePath());

        // Validate that index can be read back
        try (ReaderSupplier readerSupplier = ReaderSupplierFactory.open(indexPath)) {
            GraphIndex index = OnDiskGraphIndex.load(readerSupplier);
            System.out.println("Successfully loaded index: " + index.getClass().getSimpleName());
        }
    }

    /**
     * Builds and persists the JVector index to disk.
     */
    public static void persistIndex(List<VectorFloat<?>> baseVectors, Path indexPath) throws IOException {
        int dimension = baseVectors.get(0).length();
        RandomAccessVectorValues vectorValues = new ListRandomAccessVectorValues(baseVectors, dimension);

        BuildScoreProvider scoreProvider
                = BuildScoreProvider.randomAccessScoreProvider(vectorValues, VectorSimilarityFunction.EUCLIDEAN);

        try (GraphIndexBuilder builder
                = new GraphIndexBuilder(scoreProvider, dimension, 16, 100, 1.2f, 1.2f, true)) {
            OnHeapGraphIndex index = builder.build(vectorValues);
            OnDiskGraphIndex.write(index, vectorValues, indexPath);
        }
    }

    /**
     * Loads the first N vectors from the GloVe 6B 50d dataset.
     */
    public Map<String, VectorFloat<?>> loadGlove6B50dDataSet(int limit) throws IOException {
        String filePath = "src/main/resources/glove.6B.50d.txt";
        return loadGlove6B50dDataSet(filePath, limit);
    }

    public static Map<String, VectorFloat<?>> loadGlove6B50dDataSet(String filePath, int limit) throws IOException {
        Map<String, VectorFloat<?>> dataset = new LinkedHashMap<>();
        try (BufferedReader reader = new BufferedReader(new FileReader(filePath))) {
            String line;
            int count = 0;
            while ((line = reader.readLine()) != null && count < limit) {
                String[] parts = line.split(" ");
                if (parts.length < 51) {
                    continue; // skip invalid lines
                }
                String word = parts[0];
                VectorFloat<?> vector = VECTOR_TYPE_SUPPORT.createFloatVector(50);
                for (int i = 0; i < 50; i++) {
                    vector.set(i, Float.parseFloat(parts[i + 1]));
                }
                dataset.put(word, vector);
                count++;
            }
        }
        return dataset;
    }

}

The class begins by defining three main fields. The indexPath holds a temporary file path where the generated index is stored, while datasetVectors is a Map containing words and their corresponding vector representations. The VECTOR_TYPE_SUPPORT field, obtained from JVector’s VectorizationProvider, creates compatible vector objects.

The main method coordinates the entire process — from loading data to validating the index. It first loads the GloVe dataset using loadGlove6B50dDataSet, then calls persistIndex to transform the vectors into a searchable graph structure. The resulting index is stored in a temporary file and later reloaded using OnDiskGraphIndex.load to confirm successful creation.

The persistIndex() method handles the actual index construction. It determines the vector dimension, wraps the data with RandomAccessVectorValues, and uses BuildScoreProvider with the Euclidean similarity metric for distance computation. Using GraphIndexBuilder, it creates the in-memory graph index, balancing speed and accuracy. Finally, it writes the completed index to disk using OnDiskGraphIndex.write().

The loadGlove6B50dDataSet() method loads word embeddings from the GloVe dataset, where each line includes a word and its 50-dimensional vector. It parses each line, constructs a VectorFloat using JVector’s vector factory, and limits the load to 1000 entries for quick execution. Invalid entries are ignored. The resulting Map<String, VectorFloat<?>> efficiently maps words to vectors, making it suitable for semantic search and vector indexing.

4. Searching for Similar Vectors Using the Disk Index

Once the JVector index has been built and stored on disk, the next logical step is to perform similarity searches. This involves loading the index from disk, creating a searcher, and querying it with a target vector. JVector provides efficient methods to find the nearest neighbors based on similarity metrics like cosine or Euclidean distance.

The following code demonstrates how to load the disk-based index and perform a simple similarity search on the vectors.

public class InMemoryExample {

    private static Path indexPath;
    private static Map<String, VectorFloat<?>> datasetVectors;
    private static final VectorTypeSupport VECTOR_TYPE_SUPPORT = VectorizationProvider.getInstance()
            .getVectorTypeSupport();

    public static void main(String[] args) throws Exception {
        String glovePath = "src/main/resources/glove.6B.50d.txt";

        System.out.println("Loading GloVe dataset...");
        datasetVectors = loadGlove6B50dDataSet(glovePath, 1000);
        System.out.println("Loaded " + datasetVectors.size() + " vectors.");

        indexPath = Files.createTempFile("sample", ".inline");

        System.out.println("Building index...");
        persistIndex(new ArrayList<>(datasetVectors.values()), indexPath);
        System.out.println("Index persisted at: " + indexPath.toAbsolutePath());

        // Perform similarity search
        searchSimilarVectors(indexPath);
    }

    /**
     * Builds and persists the JVector index to disk.
     */
    public static void persistIndex(List<VectorFloat<?>> baseVectors, Path indexPath) throws IOException {
        int dimension = baseVectors.get(0).length();
        RandomAccessVectorValues vectorValues = new ListRandomAccessVectorValues(baseVectors, dimension);
        BuildScoreProvider scoreProvider
                = BuildScoreProvider.randomAccessScoreProvider(vectorValues, VectorSimilarityFunction.EUCLIDEAN);

        try (GraphIndexBuilder builder
                = new GraphIndexBuilder(scoreProvider, dimension, 16, 100, 1.2f, 1.2f, true)) {
            OnHeapGraphIndex index = builder.build(vectorValues);
            OnDiskGraphIndex.write(index, vectorValues, indexPath);
        }
    }

    /**
     * Loads and searches for similar vectors using the persisted index.
     */
    public static void searchSimilarVectors(Path indexPath) throws IOException {
        VectorFloat<?> queryVector = datasetVectors.get("computer");
        if (queryVector == null) {
            System.out.println("Query word not found in dataset.");
            return;
        }

        ArrayList<VectorFloat<?>> vectorsList = new ArrayList<>(datasetVectors.values());

        try (ReaderSupplier readerSupplier = ReaderSupplierFactory.open(indexPath)) {
            GraphIndex index = OnDiskGraphIndex.load(readerSupplier);

            SearchResult result = GraphSearcher.search(
                    queryVector,
                    5, // Top 5 similar vectors
                    new ListRandomAccessVectorValues(vectorsList, vectorsList.get(0).length()),
                    VectorSimilarityFunction.EUCLIDEAN,
                    index,
                    Bits.ALL
            );

            System.out.println("\nTop 5 similar vectors to: computer");

            int count = result.getNodes().length;
            NodeScore[] scores = result.getNodes();

            for (int i = 0; i < count; i++) {
                System.out.println("Node ID: " + scores[i].node + " | Similarity Score: " + scores[i].score);
            }
            System.out.println("\nTotal nodes visited during search: " + result.getExpandedCount());
        }
    }

    /**
     * Loads the first N vectors from the GloVe 6B 50d dataset.
     */
    public Map<String, VectorFloat<?>> loadGlove6B50dDataSet(int limit) throws IOException {
        String filePath = "src/main/resources/glove.6B.50d.txt";
        return loadGlove6B50dDataSet(filePath, limit);
    }

    public static Map<String, VectorFloat<?>> loadGlove6B50dDataSet(String filePath, int limit) throws IOException {
        Map<String, VectorFloat<?>> dataset = new LinkedHashMap<>();
        try (BufferedReader reader = new BufferedReader(new FileReader(filePath))) {
            String line;
            int count = 0;
            while ((line = reader.readLine()) != null && count < limit) {
                String[] parts = line.split(" ");
                if (parts.length < 51) {
                    continue;
                }
                String word = parts[0];
                VectorFloat<?> vector = VECTOR_TYPE_SUPPORT.createFloatVector(50);
                for (int i = 0; i < 50; i++) {
                    vector.set(i, Float.parseFloat(parts[i + 1]));
                }
                dataset.put(word, vector);
                count++;
            }
        }
        return dataset;
    }
}

This section adds a new method, searchSimilarVectors(), that demonstrates how to query the index for similar embeddings. The method first retrieves the embedding vector for the word "computer" from the dataset. It then loads the index from disk using OnDiskGraphIndex.load() and calls GraphSearcher.search().

This method performs an Approximate Nearest Neighbor (ANN) search using the provided query vector and returns the top 5 most similar vectors based on Euclidean distance. The parameter Bits.ALL specifies that all nodes in the graph are eligible for searching, ensuring comprehensive results. The results, represented as SearchResult objects, contain node IDs and similarity scores that quantify how close each vector is to the query.

Sample Output

Node ID: 951 | Similarity Score: 1.0
Node ID: 732 | Similarity Score: 0.08945871
Node ID: 925 | Similarity Score: 0.06841564
Node ID: 933 | Similarity Score: 0.05566362
Node ID: 622 | Similarity Score: 0.05308513

Total nodes visited during search: 13

This output shows that the node with ID 951 is the exact match for the query “computer”, with a similarity score of 1.0 indicating perfect alignment. The remaining nodes (732, 925, 933, 622) represent the closest semantic neighbors in vector space, words whose embeddings are most similar to “computer.”

The line Total nodes visited during search: 13 means the search algorithm internally examined 13 nodes before returning the top 5 matches. Overall, this confirms that the graph index efficiently retrieves semantically related vectors using JVector’s high-performance approximate nearest-neighbor search.

5. Conclusion

In summary, JVector provides a powerful and efficient solution for building and searching vector indexes in Java. By combining graph-based indexing, product quantization, and disk-aware optimization, it enables fast and accurate similarity searches across large datasets. Its simple API, scalability, and seamless integration make it ideal for AI, recommendation, and search applications that rely on vector representations.

6. Download the Source Code

This concludes our intro to JVector.

Download
You can download the full source code of this example here: jvector intro

Omozegie Aziegbe

Omos Aziegbe is a technical writer and web/application developer with a BSc in Computer Science and Software Engineering from the University of Bedfordshire. Specializing in Java enterprise applications with the Jakarta EE framework, Omos also works with HTML5, CSS, and JavaScript for web development. As a freelance web developer, Omos combines technical expertise with research and writing on topics such as software engineering, programming, web application development, computer science, and technology.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Back to top button