The Advent of Automated Observability

AI may never be a cure-all for observability, but it can certainly be a valuable companion.

Mar 20th, 2024 10:00am by Ozan Unlu

Featued image for: The Advent of Automated Observability

Image via Pixabay.

The cost of downtime is well documented, impacting everything from revenue to productivity to compliance to brand reputation. Over the past year, there have been several examples of major airlines experiencing technical glitches in their customer-facing check-in and electronic ticketing systems, resulting in thousands of canceled and delayed flights. This past April, online discount brokerage Robinhood was slammed with a $10M fine for outages in 2020.

When we look at the headlines, we often see coverage of bigger companies and outages. Often, their response breaks down into two components: increased monitoring and troubleshooting.

Monitoring means identifying metrics that are indicative of whether you’re meeting your service level objectives (SLOs), and then relying on human-defined alerting thresholds to fire when metrics are trending outside of expected behavior.
Troubleshooting means that when an alert fires, you have to sift through logs looking for a “needle in the haystack” to determine the root cause of the issue. Often, this means relying on “institutional knowledge” — who knows our systems the best, has seen this issue before and knows how to solve it?

Monitoring and troubleshooting, as outlined above, are reactive. You’re dedicating significant manpower hours towards manual tasks. Plus, you have incomplete coverage of anomalies because you’re only alerting on known behaviors. As a byproduct of both the above, you might experience slow resolutions, dependent entirely on (a) whether or not you caught the issue and (b) whether or not you can locate the relevant log data.

There’s a significant problem with this approach. The rare nature of events that can occur in a production environment makes “predicting” them impractical in the traditional sense. In the course of day-to-day life, certain unavoidable casualties and events with a lasting business impact can be impossible to predict. For example, prior to 2020, could anyone have foreseen a once-in-a-lifetime pandemic that would result in a major hit to the U.S. economy?

The longtail of potential errors in application development is analogous to this, and it’s the reason why, in 2024, it’s still so hard to foresee and prevent production outages. In a production environment, many specific issues may happen only once, such that you may never see them happen again, while other types of degradation may occur much more regularly, even daily. It’s impossible to completely understand and predict all the ways things could go wrong in an application development context.

Larger organizations that have built sophisticated observability practices might be able to thrive under these conditions. But what about small and even mid-market organizations that have limited operations resources? And where observability is just one of their many responsibilities? Superior performance (speed and reliability) is critical for anyone who builds revenue-generating software, no matter how big or small.

AI as an Observability “Copilot”

As we noted above, in a production environment, many causes of production outages may only happen once. Smaller teams likely don’t have the resources or foresight to predict every scenario that can cause a system to fail. This is exactly the kind of scenario where AI can help maximize monitoring coverage.

More specifically, AI can be used to baseline data sets and detect anomalies. In this use case, AI algorithms can recognize normal activity across different timeframes — from months to weeks, even down to individual days — and flag when an abnormality crops up. In this way, AI can be valuable in providing proactive signals when an issue may be brewing — without requiring the user to define alert conditions. It can even detect “unknown unknowns,” so engineers don’t have to attempt to predict the future in the form of specific indicators or thresholds.

Another area where AI can help is as a troubleshooting copilot. AI can be used to interpret the log data correlated to an alert. Then generative AI can summarize the behavior and recommend a path to resolution in conversational text. When an anomaly is detected, AI can:

Analyze the contents of the logs contributing to the anomaly
Communicate the severity of the issue and what it’s impacting
Summarize the negative behavior in a conversational text
Provide a recommendation on how to resolve the issue

In this way, AI can help organizations move through the troubleshooting process more quickly. It’s almost as if a colleague has investigated the issue for you. It is very powerful when AI can predict and recommend, enabling professionals to decide on remediation.

Today, AI is disrupting many industries — from marketing to retail to legal and more. The common theme across these use case scenarios is that AI is automating a lot of the “heavy lifting,” freeing human beings to focus on their core tasks. Observability is no different, as IT and operations teams will always have more pressing concerns than “build this thing in case something happens.” AI may never be a cure-all for observability, but it can certainly be a valuable companion. It’s “on call” 24/7, so you don’t need to be; it can build and refine alerts on your behalf, and it can locate the data you need to deliver a better user experience for your customers.

Ozan Unlu is the CEO and Founder of Edge Delta, an edge observability platform. Previously he served as a Senior Solutions Architect at Sumo Logic; a Software Development Lead and Program Manager at Microsoft; and a Data Engineer at Boeing....