Microsoft on Tuesday introduced ASSERT, an open-source framework designed to help developers test whether AI systems behave correctly for their specific products and services. Short for Adaptive Spec-driven Scoring for Evaluation and Regression Testing, the tool converts plain-English descriptions of an AI model's intended behavior, policies, and guardrails into structured test cases that can be executed and scored automatically. ASSERT also captures the intermediate steps an AI agent takes, including tool calls and decision paths, so engineers can pinpoint exactly where a system goes wrong. The framework addresses a gap that general-purpose AI evaluations often miss: behavior that depends heavily on a particular application's context, tools, and rules. A developer could, for instance, tell ASSERT that a document research agent must never email people outside the company, should restrict confidential information to C-level executives, and must produce concise summaries that take prior context into account. From those instructions alone, the system generates scenarios designed to probe whether those policies are actually being followed over time, scoring the results and flagging regressions. Sarah Bird, Microsoft's chief product officer of Responsible AI, framed evaluations as the foundation of any trustworthy deployment. "If you don't understand the behavior of the AI system, it's really hard to know if it's meeting your organization's bar," Bird said, adding that application-specific testing across multiple dimensions is essential to building confidence in a system. Developers can also feed ASSERT additional system context, available tools, and hard constraints, allowing them to tailor evaluations to the precise shape of the product they are shipping. The launch reflects a broader shift in how enterprises are approaching AI quality assurance as agentic systems grow more complex. Rather than relying on broad benchmarks that measure general capabilities, teams now need ongoing, automated checks tied to their own workflows and risk tolerance. By making that process as simple as writing a specification, Microsoft is betting that more rigorous, continuous testing will become a routine part of AI development — and that gaps between what a model is supposed to do and what it actually does will surface earlier, before users do.