For decades, software quality assurance has operated on a simple premise: Test an application against a predictable script to see if it works as expected.
However, this entire model begins to break down when the user is an autonomous AI agent — a user that can think for itself, create its own path and deviate from any script we could possibly write.
As AI agents, enabled by protocols such as MCP, gain the ability to dynamically chain tools and services together, the traditional approach to QA is quickly becoming obsolete. Srinivasan Sekar, a Director of Engineering at LambdaTest, an AI-native software testing platform, notes that the shift from predictable execution to autonomous decision-making is prompting a complete rethink of how we test software. He argues that instead of testing static, predefined outcomes, we should be constantly looking at how an agent behaves and reasons.
A new framework for AI quality is now emerging, informed by leaders at the forefront of this transformation. It requires a new approach that moves beyond simple pass/fail results and instead focuses on validating the agent’s user experience as it navigates a complex digital world.
From Testing Outcomes to Validating Behavior
To understand the shift required for AI testing, Raju Malhotra, CPTO at Certinia, suggests a powerful analogy: Testing a self-driving car. You cannot simply test a self-driving car in a closed circuit with predictable conditions and declare it safe for the real world. Instead, you must test how it responds to a dynamic, unpredictable environment. He argues that the same principle applies to AI agents. The focus of testing must shift from verifying a single, specific output to evaluating a distribution of acceptable outcomes that operate within a set of real-time guardrails.
For QA teams, this means the first step is to redefine what a test case is. Instead of a rigid script with a single expected result, a modern test case for an AI agent defines a set of operational boundaries or guardrails. For example, a test might not specify the exact path an e-commerce agent takes to process a refund, but it would define clear guardrails such as the final refund amount must not exceed the original purchase price and no personally identifiable information should be logged in plain text.
This new philosophy is what Sekar calls “autonomous workflow validation.” He explains that the goal is no longer to validate a known, predefined path to success, because an autonomous agent may create its own novel path. Instead, the critical task for quality assurance is to evaluate the agent’s behavior and decision-making process at each step. The new fundamental question for testing becomes: Based on the context it was given, did the AI agent demonstrate sound reasoning by choosing the right tool for the right job?
Sai Krishna, also a Director of Engineering at LambdaTest, builds on this with a practical implementation approach. “We’ve developed what we call contextual assertion frameworks,” he explains. “Instead of asserting that an agent clicked button X, we assert that, given context Y, the agent’s choice was reasonable.
For instance, if an agent is troubleshooting a failed test, we validate whether it gathered sufficient diagnostic information before escalating — not whether it followed steps 1, 2 and 3 in order. 3 in order. This step requires instrumenting the agent’s decision points and creating a scoring system for reasoning quality, not just outcome accuracy.
Implementing this strategy requires building new tools for observability. Sekar notes that QA engineers need access to the agent’s decision logs, recording not only the final action but also the context received, the tools considered and the selection reasoning.
Sai Krishna adds that at LambdaTest, they’ve built a “reasoning replay” capability: “We can literally step through an agent’s thought process frame-by-frame, like debugging code. When a test fails, instead of asking ‘What went wrong?’, we ask, ‘Where did the reasoning diverge from expected patterns?’ This has cut our agent debugging time by 60%.” The test itself then becomes an audit of this reasoning process, with automated checks ensuring alignment with business rules and best practices.
The Rise of the ‘AI Red Team’ and ‘Supervisor’
If the goal is to evaluate the dynamic behavior of an autonomous agent, the methods for doing so must also become dynamic and autonomous. According to notable engineering leaders, the only effective way to validate a complex AI system is to use another AI. This has led to the rise of two new, specialized roles for AI in the quality assurance process: The adversary and the supervisor.
The adversarial approach, what Bobby DeSimone, CEO of Pomerium, calls the rise of the “AI red team,” is a strategy built on the principle of fighting AI with AI. He argues that to properly harden a system against unpredictable agentic behavior, you need specialized tester agents designed to do the illogical and unexpected things a human QA engineer might not consider. These AI red teams relentlessly probe the system with novel inputs and complex requests, seeking to find edge cases and vulnerabilities before they can be exploited in production.
In practice, implementing an AI red team means creating a separate suite of tests where the goal is not to validate correct behavior, but to actively induce failure. This involves prompting specialized adversarial agents with objectives such as “find a way to get the primary agent to violate its tone policy,” or “construct a query that bypasses the data access restrictions.” The results of these tests are not simple pass/fail metrics, but a catalog of vulnerabilities that need to be addressed.
Complementing this adversarial approach is the concept of the “AI supervisor,” as described by Mike Finley, CTO of AnswerRocket. While the red team acts as an attacker, the supervisor acts as a diligent manager, continuously overseeing the work of the primary AI agent. He suggests a two-part framework for this oversight.
First, the supervisor agent requires the primary agent to provide proof points for its work, demanding verifiable sources and a documented chain of reasoning for its decisions. Second, the supervisor actively monitors the primary agent’s output, checking not just for factual accuracy, but also for qualitative aspects such as adherence to company tone and policy.
Building an effective AI supervisor requires designing the primary agent to be inherently auditable. For every significant action, the agent must be engineered to output a reasoning payload alongside its result. This payload, containing the source data, the logic applied and a confidence score, becomes the raw material for the supervisor agent. The supervisory tests then become a continuous, automated audit, programmatically checking these proof points against a knowledge base of business rules and compliance requirements.
Building for Testability and the MLOps Foundation for Trust
These advanced testing agents, however, cannot operate in a vacuum; they must be built upon a solid MLOps foundation and a culture that designs for testability from day one.
Etan Lightstone, a product design leader at Domino Data Lab, argues that building trust in agents requires applying familiar operational principles. He suggests that for an enterprise with mature MLOps capabilities, trusting an agent is not enormously different from trusting a human user, because the same pillars of governance are in place: Robust logging of every action, complete auditability to trace what happened and the critical ability to roll back any action if something goes wrong.
This product-centric mindset also extends to how we design and test the MCP tools before they ever reach production. Lightstone proposes a novel approach he calls “usability testing for AI.” Just as a product team would run usability tests with human beings to uncover design flaws before a release, he advises that MCP servers should be tested with sample AI agents. This is an effective way to discover issues in how a tool’s functions are documented and described — which is critical, since this documentation effectively becomes part of the prompt that the AI agent uses.
Furthermore, he suggests we need to build “support links” for AI agents acting on our behalf. When a user gets stuck, they can often click a link to get help or submit feedback. Lightstone argues that AI agents need similar recovery mechanisms. This could be an MCP-exposed feedback tool that an agent can call if it cannot recover from an error or a dedicated function to get help from a documentation search. This approach treats the agent as a true user, building a more resilient and testable AI ecosystem from the ground up.
From an implementation perspective, Sai Krishna stresses the need for infrastructure. He says, “At LambdaTest, we see agent testing infrastructure as a top-tier product. We’ve set up special testing sandboxes where agents can fail safely and roll back all transactions.”
Just like coding, LambdaTest controls its guardrails. This way, when business rules change, teams can see exactly how they affect agents’ behavior in the test suite. This infrastructure-as-code method, along with Sekarʼs focus on MLOps basics, forms the operational backbone for dependable agentic systems.
Sekar and Sai Krishna both say that built-in recovery paths, agent-centric usability testing andstrong MLOps are all important areas of study that need to be mastered to move AI out of the experimental stage. Sekar notes, “The companies that will be successful with agentic AI are not the ones with the most advanced models. They have the most advanced systems for validation and observability.” Sai adds, “We tell our customers that if they can’t explain why their agent made a decision, they’re not ready to go live.”
The Path Forward
The emergence of agentic AI is transforming the role of quality assurance from a final, pre-deployment gatekeeper into a continuous, dynamic partner in a live system. The new framework for AI quality is no longer about writing rigid test cases for every possible outcome. Instead, it is about creating a resilient ecosystem of AI supervisors and adversarial red teams, built upon a strong MLOps foundation, to continuously evaluate an agent’s reasoning and behavior in real-time.
As the leaders in this article emphasize, this is not just an academic exercise in finding new ways to test software; it is about building the essential business and compliance guardrails required for the agentic era. The discipline of creating these new, sophisticated validation frameworks is the critical work that will separate experimental AI projects from the scalable, secure and trustworthy enterprise systems of the future.

