Generative AI Incidents Hit Different
Generative AI (GenAI) cloud services are unique in their intense demands on hardware — as well as the computational resources running on top of it. Yet despite the need for reliability, there’s been almost no research on it, or on how cloud GenAI incidents are being managed.
So seven Microsoft researchers (including two based in China) teamed up with three more researchers from China-based universities and two from the University of Illinois Urbana-Champaign and published what they call “a comprehensive study of incidents from GenAI cloud services” — all taken from within Microsoft and exploring “symptoms, root causes, and mitigation strategies.”
Their conclusion? From an infrastructure standpoint, these services truly are different. “Like any large-scale cloud service, failures are inevitable in cloud-based GenAI services,” the paper begins.
But GenAI is unique, and “understanding the characteristics of these incidents — including detection, triage, diagnosis, and mitigation — is crucial for enhancing the quality of GenAI cloud services.”
Four Years After GPT-3
Using data from Microsoft’s Incident Management system over roughly four years, they analyzed GenAI cloud production incidents for their “general characteristics,” including their impact on availability and other quality-of-service issues (which include, among other things, “generated content quality issues”). And they ultimately found two crucial differences in their “Empirical Study of Production Incidents in Generative AI Cloud Services”:
- They take longer to mitigate.
- They’re primarily caused by infrastructure.
Their general analysis started with June 2020 (the release date of GPT-3) and extended through February 2024. Their paper notes that Microsoft hosts OpenAI’s voluminous training infrastructure as well as its public-facing APIs. (A graph shows spikes after the introduction of GPT-3.5, ChatGPT and GPT-4.) Microsoft also hosts services like Azure OpenAI.

Using nearly four years of real-world incidents to identify the actual challenges faced in production systems, they hoped to uncover ways to improve the reliability of large-scale GenAI cloud services. And fortunately, Microsoft’s system captured the root causes and mitigation steps (along with discussions by the engineers involved), which the researchers saw as “enabling a comprehensive and comparative analysis of GenAI cloud service incidents…” Significantly, they were also able to painstakingly classify which incidents were not GenAI-related, allowing them to compare. Microsoft’s system also captured whether the incident’s severity was high, medium or low.
The researchers focused on “significant”/high-severity incidents with detailed root cause descriptions, “facilitating an insightful qualitative analysis.” But besides the usual focus on reliability, GenAI services also face their own unique issues — including “response quality degradation” (like inappropriate output from simple prompts or “the generation of invalid content, where the model couldn’t understand the user’s prompt”) and end-user privacy considerations (as well as their own unique kind of performance issues).
Even the filters for harmful content can malfunction, either generating false alarms or allowing actual harmful content to slip through. There can be network issues, storage issues and even problems with actual computing resources. But there are also unique kinds of “deployment failures” like problems with the availability of large language models (LLMs), or even with the APIs for selecting a model or setting parameters (as well as the APIs for uploading or downloading data).
The researchers dubbed these “GenAI Incidents.”
Findings
The issues broke into three clear categories:
- Performance degradation: 49.8%
- Deployment failure: 35.7%
- Invalid inference: 14.5%
But importantly, GenAI cloud services had a much higher rate of incidents detected by humans (rather than automated monitors):
- GenAI cloud services: 38.3%
- Other services: 13.7%
Their paper notes that 45.9% of GenAI cloud services “are still under development or in the preview stage” (with 54.1% in “General Availability”). These human-reported incidents had to be reassigned later to a more appropriate team more frequently than automated reports. (Although the researchers note one possible reason is “the interdependency on other services. Resolving an incident might exceed the capabilities of a single team, and collaborative efforts across different service domains are needed.”)
For this and other reasons, human-reported incidents needed 72% more time to mitigate. (Automated reports, after all, often come with suggested troubleshooting guides.)
The report suggests service providers should enhance observability “to detect and diagnose issues more effectively…. Automatic monitors and trouble-shooting guides can significantly boost the mitigation process, and reduce the Time to Mitigation for GenAI incidents.” Although another problem is that monitoring of GenAI services currently seems to have a higher false alarm rate than human-detected incidents — 11.0% vs. 6.6% — with both of these numbers higher than what non-GenAI services experienced (3.8% and 4.8%).

But all GenAI incidents also took longer to mitigate than non-GenAI incidents, the researchers found, suggesting that GenAI incidents are more complex (with their “vast and interconnected layers of infrastructure, dependencies, and configurations… A single symptom can stem from multiple root causes, thus complicating the debugging”). They suggest one obvious solution: observability with more granular insights into what’s causing the incidents, from automation tools or agents. But another suggestion is Infrastructure as Code (IaC) practices, “to manage complex GenAI cloud infra more effectively.”
They were actually able to quantify how this is playing out in the real world.
“GenAI cloud systems require 2.5x more infrastructure fixes, 3.0x more code changes, and 3.0x as many configuration updates compared to non-GenAI services.”
Confronting Complexity
They also suggest service providers should supply users of GenAI services with better support and documentation to help them “navigate the complexities of GenAI service integration and management.” Developers could help lower the number of GenAI incidents by implementing stricter input validation processes and “dynamic” rate-limiting strategies “that adapt to real-time conditions.”
But part of the problem seems to be the true complexity of supporting GenAI. For non-GenAI cloud services, 54.7% of their mitigations are the speedy “ad-hoc fixes,” defined as “improvised, situation-specific steps” to first mitigate symptoms, like blocking malicious users bypassing size limitations with extra lines in the validation code. But only 22.4% of GenAI fixes are ad hoc. The researchers suggest GenAI cloud services, being in their early development stage, require “more diverse, sophisticated, and time-consuming methods.”
They expect monitoring tools for GenAI incidents will improve, helping to reduce time to mitigation.
But mitigating GenAI cloud service incidents were unique in another way, since “a specific root cause is not tied to a single type of fix…. Given the tight deadlines for on-call engineers, quick approaches like rollback are prioritized to reduce downtime.” (This leads to a telling statistic: “While code bugs account for 21.5% of the GenAI incidents, only 7.6% of fixes are code changes…”)
There were other interesting observations about the unique challenges of GenAI infrastructure, like how low-severity incidents “exhibit a significantly longer time-to-mitigation compared to other severity levels because these lower-priority GenAI incidents often remain unresolved for extended periods due to their low impact.” And 14.5% of reported incidents were “invalid outputs” like hallucinations or irrelevant responses, which they describe as “challenging to detect” and in need of automated checks.
Current detection methods are the LLM’s own “self-judgment,” calculating consistency scores after multiple attempts, or using models fine-tuned with human-labeled data, but none of these methods are fully effective — or cost-efficient. “More robust research is needed to address these limitations and develop scalable validation algorithms…”
But their report also contains this interesting caveat. Since all the incidents came from Microsoft’s cloud systems — which already deploy automated tools to stop some incidents before they happen — this Microsoft-only dataset “may not fully represent the behavior of other GenAI cloud services.”
So the researchers are already planning a wider evaluation using multiple companies’ GenAI cloud services.