Large Language Model Observability: The Breakdown

The LLM stack brings a different set of metrics than your team usually tracks. In this Makers episode, co-host Janakiram MSV identifies the new "golden signals."

Mar 28th, 2024 1:46pm by Alex Williams

Featued image for: Large Language Model Observability: The Breakdown

Getting the most out of a larger language model is the point of LLM observability.

“In the last 12 months or so, there is a new stack that has evolved,” noted Janikiram MSV, an independent analyst and frequent contributor to The New Stack, who joined me as co-host for this episode of The New Stack Makers.

And that is the LLM stack, which has multiple pieces of the puzzle like the large language model, the vector databases, the embedding model, the retrieval systems, the read anchor models, and it’s a whole new ecosystem. So making sure that we are monitoring the golden signals that come out of this new stack and making sure that we are getting what we want out of the system is primarily the objective of LLM observability.”

But what is the goal?

“For folks familiar with DevOps- and SRE-based metrics, they already know what infrastructure observability is,” MSV said. “The goal of any observability mechanism is to make sure we have insights into a system. So in infrastructure observability, we look at four golden signals, which are called MELT: metrics, events, logs, and traces.

“Now, if I am a systems administrator or an Ops guy, I am responsible for measuring these four metrics and keeping an eye on them to ensure my systems are delivering the uptime, which is 99.9%,” or whatever the service-level agreement is.

He continued, “Very similar to this, the LLM also has certain metrics entirely different from what we have been tracking for infrastructure, which we will do a deep dive on.”

MSV detailed the critical aspects of LLM observability. He broke it down by starting with the overall GenAI stack, which has several sub-topics, including:

GPU.
CPU.
Storage and vector database.
The model serving the model usage.
The change in agents in the application.

Other topics covered in this episode included hallucinations, span traces, relevance, retrieval models, latency usage, monitoring, and user feedback.

First, MSV said, it is important to examine the overall stack that an enterprise may use on-premises to understand LLM observability.

Accelerated computing sits at the bottom layer, which contains high-end CPUs and GPUs. Now, monitoring the usage, we must determine whether the infrastructure resources are oversubscribed or undersubscribed.

“And we already have enough mechanisms to track that,” MSV said. “So that is the first layer. The second layer is the storage layer, which will be your model catalog or the model garden. Now, this needs to be in sync with an external model provider like Hugging Face, because that’s where you’re going to pull the models from.”

Important: check for the most updated models, MSV said.

“And then there is the vector database,” he said. “The vector database contains the embeddings and the vectors of your ground truth. Keeping that always highly available is very critical. So you need to treat that the way you treat your Postgres or MySQL, or any other database. and ensure uptime of the vector database.”

The inference engine sits at the third layer, combining the model-serving environment and the API server.

MSV went into considerable depth in this episode. We concluded by looking at the peer companies in the LLM observability space. These companies included: Arize.ai (Phoenix), Datadog, Dynatrace, LangChain (LangSmith), New Relic, Signoz, and Truera.

We’ll explore these different companies in an upcoming episode of The New Stack Makers.

Alex Williams is founder and publisher of The New Stack. He's a longtime technology journalist who did stints at TechCrunch, SiliconAngle and what is now known as ReadWrite. Alex has been a journalist since the late 1980s, starting at the...