Follow the Trail: Supercharging vLLM with OpenTelemetry Distributed Tracing
Ever struggled to pinpoint why your LLM inference serving isn’t as fast as it should be? What if you could trace every request and find the bottleneck with precision? Now you can!
In the ever-evolving landscape of machine learning, efficient inference serving for large language models (LLMs) is critical. The latest integration of OpenTelemetry distributed tracing into vLLM empowers you with advanced tools to monitor and enhance its performance. Leveraging these components with solutions such as Instana, we offer a holistic view of distributed systems, providing detailed insights request flows and helping organizations identify bottlenecks before they impact users.
What’s in it for You?
By reading this blog you will benefit:
- Improved Performance Insights:
Track requests with distributed tracing to identify bottlenecks and optimize response times. - Simplified Debugging:
Benefit from enhanced, structured logging that makes pinpointing, conducting root cause analysis, and resolving issues quicker and more straightforward. - Cross-Service Tracing Integration:
Explore how extending OpenTelemetry distributed tracing to other microservices can provide a comprehensive performance overview and simplify troubleshooting across your system.
Recently, I had the opportunity to contribute to the vLLM project, an open-source endeavor from UC Berkeley, which aims to optimize the serving of LLMs. My contribution focused on integrating OpenTelemetry distributed tracing, significantly enhancing the observability and performance monitoring of the system.
What is vLLM?
vLLM is an innovative open-source project designed to streamline the inference serving of large language models. Developed by UC Berkeley, it addresses the challenges of scalability and efficiency, making it a vital tool for organizations leveraging LLMs in their applications.
The Need for Better Observability
Observability is crucial for understanding the performance and behavior of complex systems, especially in enterprise environments and MLOps workflows. In the context of AI observability and vLLM, it helps developers and engineers monitor the performance, detect issues, and optimize the system effectively. Before the recent enhancements, vLLM had basic logging and monitoring capabilities, but there was room for improvement in providing detailed and actionable insights, in particular, adding support for distributed tracing.
Integrating OpenTelemetry
OpenTelemetry is a powerful observability framework that provides standardized tools for collecting, processing, and exporting telemetry data such as traces, metrics, and logs. By integrating OpenTelemetry into vLLM, we aimed to enhance its observability capabilities, allowing users to gain deeper insights into the system’s performance and identify bottlenecks more efficiently.
Benefits of the Integration
vLLM includes logging and metrics, but troubleshooting end-to-end individual requests can be challenging, as it requires correlating logs across multiple services. Distributed tracing addresses this by offering a comprehensive view of request flows through various system components.
Here’s how it works: Distributed tracing assigns a unique trace ID to each request, which is propagated through all involved services. Each service then generates spans (units of work) that are linked to the trace ID. These spans capture detailed information about the processing of the request, including timestamps and metadata. By integrating vLLM with OpenTelemetry, we enable the collection and correlation of these spans across different services.
As a result, we can track each request from start to finish, capturing all related spans and creating a complete timeline of its journey through the system. This level of visibility helps you understand how different parts of our system interact and allows us to pinpoint where delays or bottlenecks occur, ultimately simplifying the process of troubleshooting and optimizing performance.
Step-by-Step Walkthrough
We’ll showcase the integration of vLLM with OpenTelemetry through three progressively complex scenarios. Each scenario builds on the previous one, demonstrating how tracing evolves from a simple setup to a fully distributed system. This progression not only highlights the growing complexity but also underscores the increasing value of distributed tracing in managing and optimizing system performance as your architecture scales.
We start with vLLM as a library and demonstrate its ability to export traces to Jaeger, an open source distributed tracing platform.
Then, we run vLLM as a server and demonstrate correlating spans of vLLM with those coming from the client.
Last, we build a mini LLM serving system including multiple language models, each served by a different instance of vLLM, a router that routes requests to the right vLLM instance based on the model name, and a client application. For this scenario, we’ll use IBM Instana as the trace collector and web UI rather than Jaeger. Instana, like Jaeger, also supports the OpenTelemetry protocol, and provides enterprise-level observability capabilities.
vLLM as a library
Imagine we’re developing a small application that uses vLLM as a library to handle LLM inference. While the application is simple, we notice inconsistent response times for different requests. Some requests take longer due to internal processing delays, but pinpointing where these delays occur is challenging. By instrumenting vLLM with OpenTelemetry and exporting traces to Jaeger, we gain a clear view of how latency is distributed within the library itself. This scenario demonstrates how tracing at the library level can help identify hidden bottlenecks in straightforward applications.
In this first scenario, we’ll set up vLLM as a library and configure it to export traces to a trace collector. We’ll use Jaeger, which supports the OpenTelemetry protocol, making it a great choice for collecting, visualizing and analyzing the traces.
First, let’s start Jaeger in a Docker container:
docker run --rm --name jaeger \
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
-p 14250:14250 \
-p 14268:14268 \
-p 14269:14269 \
-p 9411:9411 \
jaegertracing/all-in-one:1.57Next, extract the Jaeger IP address so that vLLM can send traces to it:
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
echo $OTEL_EXPORTER_OTLP_TRACES_ENDPOINTSet the OpenTelemetry service name and configure the exporter for insecure communication. Note that using insecure is intended for development and testing purposes only; it is not recommended for production environments due to security concerns:
export OTEL_SERVICE_NAME="vllm-server"
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=trueThe environment variables OTEL_SERVICE_NAME and OTEL_EXPORTER_OTLP_TRACES_INSECURE are used by the OpenTelemetry client internally within vLLM to set the service name and specify the connection mode.
Now, install the necessary vLLM and OpenTelemetry packages in our Python virtual environment:
pip install \
'vllm==0.5.4' \
'opentelemetry-sdk>=1.26.0,<1.27.0' \
'opentelemetry-api>=1.26.0,<1.27.0' \
'opentelemetry-exporter-otlp>=1.26.0,<1.27.0' \
'opentelemetry-semantic-conventions-ai>=0.4.1,<0.5.0'With everything set up, we can now run the following Python code to use vLLM as a library and generate some text. Save this code as inference_with_tracing.py.
import os
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(
model="facebook/opt-125m",
# Set the OpenTelemetry endpoint from the environment variable.
otlp_traces_endpoint=os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"],
)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")Run the script using:
python inference_with_tracing.pyOnce the code has completed running, open Jaeger’s web UI in your browser to explore the traces:
Here, we’ll find one trace per request. Each trace includes attributes such as request metadata, latencies, and the number of processed tokens. These attributes help to diagnose issues like long queue times, providing valuable insights for optimization.
vLLM as a Server
As our application scales, we transition from using vLLM as a library to deploying it as a server. Now, multiple clients are sending requests to the server, and we need to ensure that the end-to-end latency remains low. However, we start noticing that some requests take significantly longer to process than others. By correlating spans from both the client and server using OpenTelemetry, we can trace the path of each request across the system. This scenario highlights how distributed tracing helps us understand how latency is introduced at various stages of the request’s journey, allowing us to optimize both client and server performance.
Step 1: Run vLLM Server
First, let’s start vllm as a server and configure it to export traces to Jaeger, the OpenTelemetry trace collector:
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"Step 2: Custom Client Application
With the vLLM server is running and ready to receive requests, use the following python code to create a client that sends requests to the server. Save this code as client.py:
import requests
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
OTLPSpanExporter)
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (BatchSpanProcessor,
ConsoleSpanExporter)
from opentelemetry.trace import SpanKind, set_tracer_provider
from opentelemetry.trace.propagation.tracecontext import (
TraceContextTextMapPropagator)
trace_provider = TracerProvider()
set_tracer_provider(trace_provider)
trace_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
tracer = trace_provider.get_tracer("dummy-client")
vllm_url = "http://localhost:8000/v1/completions"
with tracer.start_as_current_span("client-span", kind=SpanKind.CLIENT) as span:
prompt = "San Francisco is a"
span.set_attribute("prompt", prompt)
headers = {}
TraceContextTextMapPropagator().inject(headers)
payload = {
"model": "facebook/opt-125m",
"prompt": prompt,
"max_tokens": 10,
"best_of": 20,
"n": 3,
"use_beam_search": "true",
"temperature": 0.0,
}
response = requests.post(vllm_url, headers=headers, json=payload)Set the environment variables and run the client:
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
export OTEL_SERVICE_NAME="client-service"
python client.pyThe client adds trace context to the HTTP headers of the requests sent to vLLM, following the trace context specification.
https://www.w3.org/TR/trace-context/
The client also exports its requests to the same trace collector, allowing us to explore them as well.
vLLM extracts the trace context from the request and applies it to the traces it exports. This enables us to correlate the traces exported both by both vllm and the client, providing a comprehensive view of each request’s journey.
Step 3: Explore Correlated Traces
We can now open Jaeger’s web UI in our browser to explore the correlated traces. Each trace represents a request, showing spans from both the client and the vLLM server. This correlation offers valuable insights into the end-to-end request flow and helps identify performance bottlenecks or issues.
vLLM as a Mini Serving System
In this final scenario, we’ll build a mini LLM serving system with multiple language models, each served by a different vLLM instance. A router application dynamically directs requests to the appropriate vLLM instance based on the model name, while a client application sends these requests. As the system grows more complex, managing latency becomes a greater challenge. Requests may experience delays due to routing overhead, load on specific vLLM instances, or inefficiencies in the client application.
To address these issues, we’ll use IBM Instana as our trace collector and web UI. Like Jaeger, Instana supports the OpenTelemetry protocol and allows to monitor how latency is distributed across the entire microservices-based system. This scenario showcases how comprehensive tracing across all components enables us to identify and address latency issues, ensuring a smooth and efficient user experience. Using IBM Instana provides a complete end-to-end OpenTelemetry solution for enterprise customers.
Step 1: Creating a Mapping File for the Router
First, we create a mapping file for the router application that maps each model name to its corresponding service address. Save this file as model_map.yaml:
generation:
facebook/opt-125m: opt-125m:8033
gpt2: gpt2:8033Refer to the text-generation-router repository for more details.
Step 2: Start the serving system with docker compose
Since the router we’ll use supports only the TGI API, we’ll run the vLLM with a TGI-adapter.
The docker-compose yaml defines three services: the router application and two vLLM instances running the gpt2 and facebook/opt-125m models. Save the file as docker-compose.yaml:
services:
router:
image: quay.io/wxpe/text-gen-router:main.87b9dfd
environment:
- OTEL_SERVICE_NAME=vllm-router
ports:
- "8033:8033"
volumes:
- ./model_map.yaml:/app/model_map.yaml
command: [
"fmaas-router",
"--grpc-port", "8033",
"--otlp-endpoint", "${INSTANA_AGENT_ADDRESS}",
"--model-map-config", "/app/model_map.yaml"
]
opt-125m:
image: quay.io/opendatahub/vllm:fast-ibm-0cd0aad
environment:
- HF_HUB_OFFLINE=0
- OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
- OTEL_SERVICE_NAME=vllm-opt-125m
command: [
"--model", "facebook/opt-125m",
"--otlp-traces-endpoint", "${INSTANA_AGENT_ADDRESS}",
"--grpc-port", "8033",
"--gpu-memory-utilization", "0.45",
]
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
gpt2:
image: quay.io/opendatahub/vllm:fast-ibm-0cd0aad
environment:
- HF_HUB_OFFLINE=0
- OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
- OTEL_SERVICE_NAME=vllm-gpt2
command: [
"--model", "gpt2",
"--otlp-traces-endpoint", "${INSTANA_AGENT_ADDRESS}",
"--grpc-port", "8033",
"--gpu-memory-utilization", "0.45",
]
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]Export the INSTANA_AGENT_ADDRESS environment variable and execute the following command to start the serving system:
export INSTANA_AGENT_ADDRESS=”<INSTANA-AGENT-ADDRESS>”
docker compose -f docker-compose.yaml upStep 3: Run a Custom Client Application
Now that our setup is ready, we can run a custom client application that sends requests to the router, which then forwards them to the corresponding vLLM instance. Save the following code as grpc_client.py:
import grpc
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
OTLPSpanExporter)
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (BatchSpanProcessor,
ConsoleSpanExporter)
from opentelemetry.trace import SpanKind, set_tracer_provider
from opentelemetry.trace.propagation.tracecontext import (
TraceContextTextMapPropagator)
from pb import generation_pb2, generation_pb2_grpc
trace_provider = TracerProvider()
set_tracer_provider(trace_provider)
trace_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
tracer = trace_provider.get_tracer("dummy-client")
router_address = "localhost:8033"
with grpc.insecure_channel(router_address) as channel:
stub = generation_pb2_grpc.GenerationServiceStub(channel)
for model_id in ["gpt2", "facebook/opt-125m"]:
with tracer.start_as_current_span("client-span",
kind=SpanKind.SERVER) as span:
prompt = "San Francisco is a"
span.set_attribute("prompt", prompt)
# Inject the current context into the gRPC metadata
headers = {}
TraceContextTextMapPropagator().inject(headers)
metadata = list(headers.items())
reqs = [generation_pb2.GenerationRequest(text=prompt, )]
req = generation_pb2.BatchedGenerationRequest(
model_id=model_id,
requests=reqs,
params=generation_pb2.Parameters(
sampling=generation_pb2.SamplingParameters(temperature=0.0),
stopping=generation_pb2.StoppingCriteria(max_new_tokens=10)))
response = stub.Generate(req, metadata=metadata)
trace_provider.force_flush()Run the client application with:
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
export OTEL_SERVICE_NAME=client-service
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=$INSTANA_AGENT_ADDRESS
python grpc_client.pyNote: The client relies on the protocol buffer classes generated from the “.proto” files. These classes can be found in the repository accompanying this blog.
Step 4: Inspect Traces with Instana
With the setup complete, let’s inspect our traces in the Instana web UI.
As in the previous example, the client exports traces to the trace collector.
In Instana, we can visualize the entire request flow. Each trace will show the time spent in each component, helping us identify bottlenecks and optimize performance.
By breaking down the total latency into the duration spent in each component, we can spot bottlenecks and determine which components benefit most from optimizations.
Conclusion
Integrating OpenTelemetry with vLLM through solutions such Jaeger and IBM Instana represents a significant step forward in enhancing the observability and overall performance of the system. By using these powerful tools, developers gain actionable insights through comprehensive distributed tracing, deep visibility into latency issues and enables you to future-proof your infrastructure. Your architecture remains transparent, traceable and optimized for peak performance.
I’m thrilled to have been a part of this project and look forward to seeing how these enhancements will benefit the community. If you’re interested in learning more or contributing to vLLM, check out the project on GitHub.
Thank you for reading! I welcome your feedback and questions in the comments.
