Enterprise Java

Real-Time Streaming with Spring AI ChatClient

Streaming response in Spring AI ChatClient enables progressive delivery of AI-generated content, allowing clients to receive partial data as it is produced instead of waiting for the entire response. This approach enhances user experience by showing real-time updates, supports large responses efficiently, and reduces perceived latency. Let us delve into understanding how the spring ai chatclient stream response works and how it can be integrated effectively in real-time applications.

1. What is Spring AI ChatClient?

Spring AI ChatClient is a fluent, versatile API designed to facilitate communication with AI chat models within Spring applications. It supports both synchronous and reactive streaming programming models, allowing developers to build prompts composed of user and system messages that guide AI output. The ChatClient encapsulates interaction with various AI models, enabling easy integration and customization such as model specification, streaming modes, prompt templating, and response handling. It can be auto-configured in Spring Boot or created programmatically, making it simple to inject and use within services and controllers for building sophisticated conversational AI features. In essence, Spring AI ChatClient provides a unified, extensible interface for sending queries to AI models (like OpenAI’s GPT), receiving responses either fully or progressively via streaming, and integrating those capabilities seamlessly into Java and Spring applications.

2. Code Example

2.1 Maven Dependencies

Below are the essential Maven dependencies you need to add to your pom.xml to enable Spring AI ChatClient with streaming support:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-chatclient</artifactId>
    <version>latest__jar__version</version>
</dependency>

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-webflux</artifactId>
</dependency>

This setup includes the Spring AI ChatClient dependency for AI interaction and the Spring Boot WebFlux starter to support reactive streaming capabilities required for streaming responses.

2.2 Create a Bean Class

Define a Spring bean to configure and provide the ChatClient with your API key and specify the AI model to be used for dependency injection.

import org.springframework.context.annotation.Bean;
import org.springframework.ai.chat.ChatClient;
import org.springframework.ai.chat.ChatClient.Builder;
import org.springframework.ai.openai.OpenAiChatModel;

@Bean
public ChatClient chatClient(Builder chatClientBuilder) {
    // Create the OpenAI chat model configured with your API key
    OpenAiChatModel chatModel = OpenAiChatModel.builder()
            .apiKey(System.getenv("OPENAI_API_KEY"))  // Reads API key from environment variable
            .build();

    // Build and return the ChatClient with the configured model
    return ChatClient.builder(chatModel).build();
}

This method first creates an OpenAI chat model instance using your API key, which is securely provided via an environment variable named OPENAI_API_KEY. Then, it builds the ChatClient instance configured with this model. This structure improves separation of concerns and security by avoiding hardcoding the API key in code while explicitly defining the model that ChatClient will use.

2.2.1 Download and Configure API Key

To download and configure the API key, start by creating an account on the [OpenAI Platform](https://platform.openai.com). Once logged in, go to the API Keys section in your dashboard and click on “Create secret key” to generate a new key. Copy and store this key securely. Next, set it as an environment variable on your development machine or server—for example, in a Unix/Linux/macOS terminal, run export OPENAI_API_KEY="your_actual_api_key_here". After setting the variable, restart your Spring Boot application so it can detect and use the key. Following these steps ensures that your Spring AI ChatClient is properly configured to access the OpenAI GPT model, supporting both streaming and non-streaming responses as needed.

2.3 Create a Controller Class

The following code snippet demonstrates how to create a Spring Boot controller class to handle both non-streaming and streaming chat responses using Spring AI’s ChatClient, providing different endpoints for each mode of interaction.

import org.springframework.web.bind.annotation.RestController;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.http.MediaType;
import reactor.core.publisher.Flux;

import org.springframework.ai.chat.ChatClient;
import org.springframework.ai.chat.CompletionRequest;
import org.springframework.ai.chat.StreamMode;

@RestController
public static class ChatController {

    private final ChatClient chatClient;

    public ChatController(ChatClient chatClient) {
        this.chatClient = chatClient;
    }

    // Non-streaming endpoint: returns full completion string
    @GetMapping("/chat/nonstream")
    public String nonStreamingResponse(@RequestParam String prompt) {
        CompletionRequest request = CompletionRequest.builder()
                .model("gpt-4")
                .prompt(prompt)
                .build();
        return chatClient.complete(request).getCompletion();
    }

    // Streaming words endpoint: returns Flux<String> for words
    @GetMapping(value = "/chat/stream/words", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<String> streamWords(@RequestParam String prompt) {
        CompletionRequest request = CompletionRequest.builder()
                .model("gpt-4")
                .prompt(prompt)
                .streamMode(StreamMode.WORDS)
                .build();
        return chatClient.streamCompletion(request);
    }

    // Streaming chunks endpoint: returns Flux<String> for chunks
    @GetMapping(value = "/chat/stream/chunks", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<String> streamChunks(@RequestParam String prompt) {
        CompletionRequest request = CompletionRequest.builder()
                .model("gpt-4")
                .prompt(prompt)
                .streamMode(StreamMode.CHUNKS)
                .build();
        return chatClient.streamCompletion(request);
    }

    // Streaming JSON endpoint: returns Flux<String> with JSON objects
    @GetMapping(value = "/chat/stream/json", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<String> streamJson(@RequestParam String prompt) {
        CompletionRequest request = CompletionRequest.builder()
                .model("gpt-4")
                .prompt(prompt)
                .streamMode(StreamMode.JSON)
                .build();
        return chatClient.streamCompletion(request);
    }
}

This controller provides REST endpoints to handle chat requests asynchronously, either returning the full response immediately or streaming partial data in real time using Server-Sent Events. The endpoints are designed to demonstrate different streaming modes supported by Spring AI’s ChatClient.

2.4 Create a Main Class

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class SpringAiChatApplication {

    public static void main(String[] args) {
        SpringApplication.run(SpringAiChatApplication.class, args);
    }
}

2.5 Code Run and Output

Run the application with: ./mvnw spring-boot:run and access the endpoints via browser or cURL.

curl "http://localhost:8080/chat/nonstream?prompt=What%20are%20streaming%20responses%3F"
-- Output
Streaming responses allow partial results to be sent to the client as they are generated, improving latency and user experience by showing real-time data...


curl "http://localhost:8080/chat/stream/words?prompt=Explain%20streaming%20responses"
-- Output (Streaming Words outputs words incrementally)
Streaming responses allow partial results to be sent ...


curl "http://localhost:8080/chat/stream/chunks?prompt=Explain%20streaming%20responses"
-- Output (Streaming Chunks outputs phrases or sentence parts, chunk by chunk)
Streaming responses allow partial results. They improve latency and user experience by showing real-time data ...


curl "http://localhost:8080/chat/stream/json?prompt=Explain%20streaming%20responses"
-- Output (Streaming JSON outputs JSON-formatted partial completions)
{"content":"Streaming responses allow partial results"}
{"content":"to be sent as they are generated,"}
{"content":"improving latency and user experience."}

2.6 Consuming Streaming Responses Programmatically

While calling the REST endpoints directly (through a browser or cURL) demonstrates the progressive streaming behavior, it doesn’t show how to consume these responses programmatically. The following example illustrates how this can be achieved using Spring WebClient.

import org.springframework.web.reactive.function.client.WebClient;
import reactor.core.publisher.Flux;

public class ChatClientConsumer {

    private final WebClient webClient;

    public ChatClientConsumer() {
        this.webClient = WebClient.create("http://localhost:8080");
    }

    // Non-streaming example (waits for full response)
    public void getNonStreamingResponse(String prompt) {
        String response = webClient.get()
                .uri(uriBuilder -> uriBuilder.path("/chat/nonstream")
                        .queryParam("prompt", prompt)
                        .build())
                .retrieve()
                .bodyToMono(String.class)
                .block();

        System.out.println("Full Response:\n" + response);
    }

    // Streaming example (receives data word by word)
    public void getWordStream(String prompt) {
        Flux<String> responseStream = webClient.get()
                .uri(uriBuilder -> uriBuilder.path("/chat/stream/words")
                        .queryParam("prompt", prompt)
                        .build())
                .retrieve()
                .bodyToFlux(String.class);

        responseStream.subscribe(word -> System.out.print(word + " "));
    }

    // Streaming example (receives data chunk by chunk)
    public void getChunkStream(String prompt) {
        Flux<String> responseStream = webClient.get()
                .uri(uriBuilder -> uriBuilder.path("/chat/stream/chunks")
                        .queryParam("prompt", prompt)
                        .build())
                .retrieve()
                .bodyToFlux(String.class);

        responseStream.subscribe(chunk -> System.out.print(chunk));
    }

    // Streaming example (receives JSON fragments)
    public void getJsonStream(String prompt) {
        Flux<String> responseStream = webClient.get()
                .uri(uriBuilder -> uriBuilder.path("/chat/stream/json")
                        .queryParam("prompt", prompt)
                        .build())
                .retrieve()
                .bodyToFlux(String.class);

        responseStream.subscribe(jsonFragment -> System.out.println("Received: " + jsonFragment));
    }

    public static void main(String[] args) {
        ChatClientConsumer consumer = new ChatClientConsumer();

        System.out.println("=== Non-Streaming Response ===");
        consumer.getNonStreamingResponse("Explain streaming responses");

        System.out.println("\n\n=== Streaming Words ===");
        consumer.getWordStream("Explain streaming responses");

        System.out.println("\n\n=== Streaming Chunks ===");
        consumer.getChunkStream("Explain streaming responses");

        System.out.println("\n\n=== Streaming JSON ===");
        consumer.getJsonStream("Explain streaming responses");
    }
}

The ChatClientConsumer class demonstrates how to consume both non-streaming and streaming AI responses from the Spring AI ChatClient service using Spring WebClient. It initializes a WebClient instance pointing to the local server (http://localhost:8080) and provides four methods to handle different response types. The getNonStreamingResponse() method sends a request to the /chat/nonstream endpoint and blocks until the full response is received, printing it once complete. In contrast, the getWordStream(), getChunkStream(), and getJsonStream() methods connect to the /chat/stream/words, /chat/stream/chunks, and /chat/stream/json endpoints respectively, each returning a Flux<String> that emits data progressively as it’s generated by the AI model. Using subscribe(), these methods print each word, chunk, or JSON fragment in real time, showcasing the non-blocking reactive nature of streaming responses. The main() method executes all four examples sequentially, allowing developers to observe the difference between traditional full-response handling and real-time streaming consumption in action.

3. Conclusion

Implementing streaming responses in Spring AI ChatClient significantly enhances interactivity and responsiveness of AI-driven applications. With flexible streaming modes—words, chunks, and JSON—developers can tailor output delivery to their UI/UX needs. Using the reactive programming model with Project Reactor enables seamless integration with non-blocking frameworks like WebFlux. The example code provides a foundation to quickly adopt streaming APIs and build dynamic chat experiences.

Yatin Batra

An experience full-stack engineer well versed with Core Java, Spring/Springboot, MVC, Security, AOP, Frontend (Angular & React), and cloud technologies (such as AWS, GCP, Jenkins, Docker, K8).
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Back to top button