Scaling Redteaming of AI Guardrails

As AI adoption accelerates, enterprises must ensure their models operate safely and compliantly by evaluating model safety and guardrail performance. For example, guidance from MITRE and CISA emphasizes the importance of rigorous redteaming (adversarial testing designed to identify vulnerabilities) as part of holistic model assessment.

However, many large companies lack the resources and expertise to conduct thorough safety redteaming. For example, they may rely on manual testers who evaluate AI use cases one by one, or use limited public datasets that fail to capture the nuances of real-world use-case data. These approaches are not scalable and create bottlenecks for product teams looking to deploy AI applications as the number of use cases to test increases.

In this blog post, we will outline the challenges of manual redteaming, then demonstrate how Dynamo's automated evaluation tests help solve these challenges.

The Challenges of Redteaming AI Systems

Through our work with large enterprises, Dynamo has identified three major pain points when it comes to manually redteaming AI systems:

Pain point 1 - Manual redteaming requires extensive time and resources

Manual redteaming is slow and resource-intensive. Security teams must test individual prompts against models one at a time, then manually compile results and reports. As the number of AI use cases grows, companies must either expand and hire more testers, which is expensive and challenging given that redteaming requires specialized expertise, or accept delays to their AI product development.

To further illustrate the high time and resource cost of this process, we can break down the typical workflow that a security or compliance team must go through when manually evaluating an AI use case against a single policy, e.g. "Prohibit discriminatory language":

Define what policies to test the use case against (1-2 weeks) – First, teams must align with legal, risk, compliance, and product teams on what their policy definitions are, and how to deal with any possible edge cases. For example, if risk and compliance wanted to see compliance with the policy "prohibit discriminatory language", these parties would first need to arrive at a comprehensive definition for what discriminatory language means. Arriving at a comprehensive definition of these policies can be challenging and require back-and-forth deliberation between stakeholders on the strictness of the policy definition.
Create benchmark datasets per policy (1–3 weeks) – Next, teams must gather benchmarking data to create diverse, comprehensive datasets per policy. Data should either be human labeled or human audited to ensure alignment with the policy definitions. Oftentimes, the process of curating or labeling the benchmarking data will reveal further edge cases that need to be addressed with the help of legal or risk stakeholders. Therefore, steps 1 and 2 are repeated until the policy definition becomes more concrete.
Evaluate the guardrail performance (several days) – Next, teams must use the benchmarking datasets to evaluate the compliance of the AI system with regards to each policy and calculate performance metrics, such as F1, precision, recall, and false positive rate.
Run manual redteaming (1-2 weeks) – Once baseline policy compliance is measured, teams must review the benchmarking data and manually redteam the system and guardrails to identify generalizable weaknesses (failure modes). For example, the model may be weak to common misspellings, bullet pointed lists, or struggle when the prompt is especially long. Redteamers will also want to test the model's resilience against common adversarial techniques, including prompt injection or role-play.
Report Results (1-3 weeks) – Finally, teams should summarize key findings and report model performance results to any stakeholders. This includes explaining any evaluation methodology to nontechnical teams, grouping datapoints by failure mode, and outlining the potential weaknesses of the model with respect to each type of jailbreak attack.
If model performance is not at the levels desired, teams must improve the compliance of the model and repeat Step 3. through Step 5. until sufficient compliance with the policy is achieved.

In total, the time it takes to evaluate a single AI use case against a single policy can range between 4 to 10 weeks. In addition, enterprises often have 10+ policies per use-case that they need to define and redteam against, such as testing to ensure that LLMs do not give "investment advice", "legal advice", or "material nonpublic information". Thus, each steps must also be repeated per policy, creating an ever-growing amount of work for risk and security teams.

Pain point 2 - Redteaming results are inconsistent

Even with sufficient resources, redteaming often produces inconsistent results. Different testers use different techniques, leading to variability in findings and making it difficult to generate standardized reports. Without clear, repeatable testing metrics, enterprises may struggle to compare AI systems against one another or accurately measure the effectiveness of their AI security guardrails before and after implementation.

Pain Point 3 - Redteaming techniques must constantly evolve

New attack types are constantly emerging, and redteamers must constantly learn new techniques to keep pace with the latest attack strategies and threats being used in the real world. Enterprises need to ensure coverage against a large database of adversarial threats and known vulnerabilities, as they become known. Maintaining an in-house team to constantly track documented and emerging jailbreaking and prompt injection vulnerabilities can be costly and difficult to sustain for CISO offices.

DynamoEval Helps Scale Redteaming Efforts

To address these three challenges, DynamoEval offers a structured, automated approach to AI redteaming, helping enterprises conduct more efficient and thorough evaluations.

Automated Jailbreaking Tests Accelerate Safety Redteaming

Dynamo's Static Jailbreaking and Adaptive Jailbreaking tests can systematically probe AI models for safety vulnerabilities in a standardized fashion, addressing all three pain points listed above.

First, DynamoEval performs redteaming across 20+ attack vectors simultaneously, eliminating the need for manual testing of individual prompts. Dynamo's evaluation platform also encourages human review of test datapoints, so any incorrect classifications can be audited and relabeled. Next, reports are automatically generated with standardized metrics and key findings based on any model vulnerabilities. Finally, Dynamo's team of ML researchers constantly update the jailbreaking taxonomy to include state of the art attack vectors and new findings. This ensures that enterprises do not need to maintain an in-house team to constantly track documented and emerging jailbreaking and prompt injection vulnerabilities.

The Static Jailbreaking test evaluates an AI system’s ability to resist single-turn adversarial techniques, including the DAN (Do Anything Now) Attacks, Encoding Attacks, Persuasive Adversarial Prompts (PAP), Greedy Coordinate Gradient (GCG) Attacks, and more. In comparison, Dynamo's Adaptive jailbreaking test uses a series of multi-turn prompts that are continuously refined based on the AI system’s response. Dynamo pioneered two novel techniques for adaptive jailbreaking: Tree of Attacks with Pruning (TAP), a structured "tree" of prompt variations that iteratively optimizes jailbreak effectiveness, and Iterative Refinement Induced Self-Jailbreak (IRIS), where the attack learns from its own failures and self-improves over time.

Both of these jailbreaking evaluations significantly reduce the time and expertise required for redteaming while producing repeatable, high-quality evaluations.

The jailbreaking agent (in white) iteratively adapts its queries, eventually causing the AI system to output a noncompliant response (in blue).

‍Policy Compliance Tests Evaluate Custom Criteria

In addition to jailbreaking, DynamoEval’s Policy Compliance Test evaluates how well AI models adhere to specific custom policies by generating synthetic benchmarking data related to each policy automatically. For example, given a "Prohibit discriminatory language" policy, Dynamo will generate both compliant and non-compliant data with respect to the policy definition. The test also measures AI compliance before and after DynamoGuard guardrails are applied, giving enterprises a quantifiable way to demonstrate risk reduction after the application of guardrails. Therefore, this test reduces the time needed to curate benchmarking datasets from weeks to hours.

Automated Tests still Require Human Oversight and Review

DynamoEval’s automated tests are designed to accelerate and scale model evaluation efforts in conjunction with human expertise, not replace all human involvement. That's why Dynamo supports a rich set of audit and relabeling features. By integrating subject matter experts into the evaluation process, enterprises can ensure that automated redteaming findings are accurate and aligned with their desired policy definitions, while still saving considerable time and resources.

Before vs. After DynamoEval: A Clear Comparison

As AI systems become more complex, scalable and automated redteaming solutions will be critical for ensuring compliance and security. DynamoEval empowers enterprises to unblock AI product teams while maintaining rigorous security standards by providing faster, repeatable, and more effective AI evaluations.

Book a demo with our sales team today!

‍

Scaling Redteaming of AI Guardrails

The Challenges of Redteaming AI Systems

DynamoEval Helps Scale Redteaming Efforts

Before vs. After DynamoEval: A Clear Comparison

Book a demo with our sales team today!

Recent posts

Introducing DynamoGuard On-Device Guardrails: The Most Secure Generative AI Solution

Breaking the Bank on AI Guardrails? Here’s How to Minimize Costs Without Comprising Performance