Building Semantic Caching with Spring AI
Traditional caching relies on exact key matching. This approach works well for deterministic inputs but fails when dealing with natural language queries, where users may ask the same question in different ways. Semantic caching solves this problem by using embeddings and vector similarity search. Instead of matching exact strings, we compare the meaning of queries. If a new query is semantically similar to a previous one, we return the cached response instead of calling the LLM again. Let us delve into an understanding of Spring AI semantic caching.
1. What is Semantic Caching?
Semantic caching is a technique that improves application performance by storing and reusing responses based on the meaning of a query rather than its exact text, allowing similar or contextually related requests to retrieve cached results without reprocessing; this approach is especially useful in AI-driven systems where embeddings and vector similarity are used to match incoming queries with previously answered ones, reducing latency and cost while maintaining relevance, and is commonly implemented in modern AI frameworks such as Spring AI using vector databases and embedding models.
2. Code Example
2.1 Add Dependencies (pom.xml)
To follow this example, you need a Spring Boot project with the following dependencies in your pom.xml:
<dependencies>
<!-- Spring AI OpenAI Starter -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>
<!-- Redis Vector Store Starter -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-redis-spring-boot-starter</artifactId>
</dependency>
</dependencies>
2.1.1 Running Redis Stack Using Docker
This implementation relies on Redis Stack to support vector similarity search via RediSearch. The easiest way to start a compatible Redis instance is by using Docker.
docker run -d \ --name redis-stack \ -p 6379:6379 \ redis/redis-stack:latest
Once started, Redis Stack will be available at localhost:6379
2.2 Configure application.properties
In application.yml, we configure the text-embedding-3-small model with reduced dimensions (512 instead of 1536) to optimize Redis storage and similarity search performance without sacrificing accuracy. To enable OpenAI integration, generate an API key from the OpenAI API Keys page, store it as an environment variable, and reference it using ${OPENAI_API_KEY} to keep credentials secure.
spring:
ai:
openai:
api-key: ${OPENAI_API_KEY}
embedding:
options:
model: text-embedding-3-small
dimensions: 512
vectorstore:
redis:
index-name: semantic-cache
initialize-schema: true
data:
redis:
host: localhost
port: 6379
app:
cache:
threshold: 0.92
2.2.1 Explanation
The application.yml snippet provides the core infrastructure settings for a Spring AI semantic caching system, divided into three critical functional blocks. First, the spring.ai.openai section configures the EmbeddingModel bean; by selecting text-embedding-3-small with dimensions: 512, the system utilizes a high-efficiency Matryoshka-style embedding that truncates the standard 1536-dimensional vector down to a more compact size, significantly reducing storage costs and search latency in Redis without a substantial loss in semantic accuracy. Second, the vectorstore.redis and data.redis blocks establish the connection to a Redis Stack instance (running at localhost:6379) and enable initialize-schema: true, which ensures that the necessary RediSearch index (named semantic-cache) is automatically created upon application startup if it does not already exist. Finally, the custom app.cache.threshold property defines a similarity boundary of 0.92; this value acts as a strict mathematical filter during vector searches, where a score closer to 1.0 requires a near-perfect conceptual match to trigger a cache hit, effectively balancing the system’s ability to reuse answers for paraphrased questions while avoiding the retrieval of irrelevant or incorrect information.
2.3 Implementing Sementic Cache Service
The following code uses Spring AI 1.x API patterns for the VectorStore and SearchRequest.
// SemanticCacheService.java
package com.example.ai.service;
import org.springframework.ai.chat.model.ChatModel;
import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
import java.util.List;
import java.util.Map;
@Service
public class SemanticCacheService {
private final ChatModel chatModel;
private final VectorStore vectorStore;
@Value("${app.cache.threshold}")
private double threshold;
public SemanticCacheService(ChatModel chatModel, VectorStore vectorStore) {
this.chatModel = chatModel;
this.vectorStore = vectorStore;
}
public String getAnswer(String query) {
// Correct SearchRequest usage
SearchRequest searchRequest = SearchRequest.builder()
.query(query)
.topK(1)
.similarityThreshold(threshold)
.build();
List<Document> results = vectorStore.similaritySearch(searchRequest);
if (!results.isEmpty()) {
return "[CACHE HIT]: " + results.get(0).getMetadata().get("answer");
}
// Cache Miss: Query the LLM
String response = chatModel.call(query);
// Store the result
storeManualEmbedding(query, response);
return response;
}
public void storeManualEmbedding(String question, String answer) {
Document doc = new Document(question, Map.of("answer", answer));
vectorStore.add(List.of(doc));
}
}
2.3.1 Code Explanation
The SemanticCacheService is a Spring Boot @Service designed to reduce LLM costs and latency by implementing a “cache-aside” pattern based on vector similarity. It leverages the ChatModel for generating new responses and the VectorStore for managing and searching existing query-answer pairs. When getAnswer is called, the service constructs a SearchRequest using a fluent builder, configuring topK(1) to find the single best match and applying a similarityThreshold (injected from application properties) to ensure the cached query is semantically identical to the current one. The vectorStore.similaritySearch call converts the input string into an embedding and performs a mathematical distance calculation in the vector database; if a document is found, the service bypasses the LLM and returns the stored answer from the document’s metadata, labeled as a [CACHE HIT]. Conversely, on a cache miss, the chatModel.call(query) method is triggered to fetch a response from the AI provider, which is then immediately persisted via storeManualEmbedding. This storage method wraps the question and its generated answer into a Document object—where the question serves as the text to be vectorized and the answer is stored as metadata—ensuring that subsequent similar questions are served directly from the cache.
2.4 Exposing the endpoints
Create a controller to allow manual priming and querying of the cache.
// CacheController.java
package com.example.ai.controller;
import com.example.ai.service.SemanticCacheService;
import org.springframework.web.bind.annotation.*;
@RestController
@RequestMapping("/api/cache")
public class CacheController {
private final SemanticCacheService service;
public CacheController(SemanticCacheService service) {
this.service = service;
}
@PostMapping("/prime")
public String prime(@RequestParam String q, @RequestParam String a) {
service.storeManualEmbedding(q, a);
return "Embedding stored with 512 dimensions!";
}
@GetMapping("/ask")
public String ask(@RequestParam String q) {
return service.getAnswer(q);
}
}
2.4.1 Code Explanation
The CacheController class acts as the RESTful interface for the semantic caching system, exposing two specific endpoints to manage and query the AI cache via the SemanticCacheService. It is annotated with @RestController and mapped to the base path /api/cache, using constructor-based dependency injection to access the underlying service logic. The /prime endpoint, handled by a @PostMapping, allows developers or administrators to manually “warm” the cache by providing a question (q) and a corresponding answer (a) as request parameters; this triggers the storeManualEmbedding method, which vectorizes the question into a 512-dimension embedding and stores it in the vector database for future retrieval. The /ask endpoint, handled by a @GetMapping, serves as the primary entry point for users, taking a natural language query and passing it to the getAnswer method. This method performs a semantic search to determine if a similar intent already exists in the cache to return an immediate result; otherwise, it dynamically generates an answer using the LLM and caches it, ensuring the system becomes more efficient with every unique request it processes.
2.5 Testing the Implementation
Testing a semantic cache requires verifying three distinct states: a Cache Miss (first-time query), a Semantic Cache Hit (similar but not identical query), and a Threshold Rejection (unrelated query). You can use the following methods to validate your implementation.
2.5.1 Testing with cURL
Open your terminal and execute the following commands in sequence to observe how the semantic cache behaves across cache priming and semantic retrieval.
2.5.1.1 Manually prime the semantic cache
This request stores a question–answer pair in Redis as a 512-dimension embedding.
curl -X POST "http://localhost:8080/api/cache/prime" \ -d "q=How do I reset my password?" \ -d "a=Go to Settings > Security > Reset."
If everything works correctly, the embeddings will be stored in the vector store.
Embedding stored with 512 dimensions!
2.5.1.2 Ask a semantically similar question
Although the wording is different, the intent matches the cached question, so the system performs a vector similarity search and retrieves the cached response without invoking the LLM.
curl "http://localhost:8080/api/cache/ask?q=Where+is+ the+password+change+button?"
The system performs a semantic similarity check and returns the following response.
[CACHE HIT]: Go to Settings > Security > Reset.
This confirms that the cache is matched based on semantic meaning rather than exact text, demonstrating how Spring AI converts the query into an embedding, applies the configured similarity threshold, and returns the cached answer when a close match is found.
3. Code Example
Semantic caching is a powerful optimization technique for AI-driven applications, and by leveraging Spring AI, embeddings, and vector similarity search, we can reduce LLM calls, lower latency, and significantly cut operational costs; Spring AI makes this implementation clean and idiomatic for Spring developers, enabling an intelligent cache with minimal code that understands meaning rather than just keys, and this pattern can be further extended for production-grade systems by using persistent vector databases such as Pinecone, Weaviate, or PGVector.




