AI Content Moderation: Defining Policies and Synthetic Data Gen (Part 2)

Today, we dive into Dynamo AI's approach to content moderation, starting with the first and most crucial step — defining what content should be moderated.(This is a follow-up to Part 1 on why today’s content moderation systems aren’t as effective as they claim.)

Defining clear guardrails for moderation can be complex. We explore these challenges below, and how our team helps address them by refining guardrail definitions, incorporating risk considerations, and using synthetic data for comprehensive model training.

The challenges of defining a guardrail

To develop a content moderation system, it is critical to systematically think about what content we want to moderate against. However, clearly defining this can be harder than it seems.

Let’s take the example guardrail: Block user requests for financial advice.

For some user queries, this guardrail definition is enough to understand whether or not the query should be compliant. For example, the definition successfully differentiates between the following two queries:

“Which mutual funds should I select for my retirement investment portfolio?" --> This query is clearly asking for financial advice on what mutual funds to invest in

“How do I recover my password for my banking account?” --> This query is clearly not asking for financial advice, it is asking for help logging in to an account

However, for other user queries, the guardrail definition doesn't provide a clear understanding of compliance or non-compliance For example:

“What are the advantages of storing funds in a savings account instead of a checking account?” --> To determine whether or not this query violates our guardrail, we need to first answer questions such as: (1) Is asking for pros and cons the same as asking for advice?, and (2) Does personal banking fall into the domain of finance?

“Is it wise to refinance one’s mortgage when the Fed raises interest rates?” --> To determine whether or not this query violates our guardrail, we need to first answer questions such as: (1) Do economics and real estate fall into our domain of finance?, and (2) Does asking for general guidance count as asking for advice?

While the answer may be clear to us humans and content reviewers, a model won’t immediately understand how to deal with each of these nuances.

To ensure our content moderation model accurately reflects all the nuances of our intended guardrail, we need to first clearly define the domain and carve out our specific guardrail.

Breaking down a guardrail definition

At Dynamo AI, once we have an initial guardrail definition, we use our models to identify the ambiguities in the guardrail description. For a successful, nuanced definition, each ambiguity must be identified and resolved.

For instance, for the financial advice guardrail, it’s unclear what falls into the domain of ‘finance.’ To clarify this, we can begin by defining the set of topics related to finance.

Based on the end-application, this may include subjects like investing, taxes, economics, lending, and financial regulations. For our out-of-the-box financial advice guardrail, we include 18 different topics that specify the various types of concepts we consider as being related to finance.

In parallel, we clarify the meaning of the word ‘advice’ to help us better understand what types of user requests related to each topic should be considered advice. In our example above, this may include things like ‘asking for a definition’ or ‘requesting personalized recommendations.’

As we resolve the ambiguities in our definition, we get a more comprehensive guardrail description that the different types of user requests we will need to apply our guardrail against.

Incorporating risk considerations in guardrail definitions

When defining a guardrail, it is important to assess the operational risks associated with different types of user requests and model responses, and in particular the impact to individuals and organizations. This can help further define what should be considered as allowed or disallowed.

For instance, requesting definitions related to the topic of investing may be a lower risk request, since this a request for factual information and is unlikely to result in a compliance violation or consumer harm.

However, providing personalized advice on lending and credit requests for information or decisions may be considered a higher risk request, as this action deals with the financial livelihood of a user, is subject to a number of laws and regulations, and may cause consumer harm if incorrect information is provided. This, in turn, may lead to significant reputational risk for an organization.

To gauge the operational risk of an AI system, one key component is to assess how the model will be used. The types of user personas and the context in which the guardrail is used affects both the severity and the types of risks surfaced by non-compliant outputs.

For example, a consumer-facing chatbot may have many more associated risks than an internal employee-facing chatbot. Similarly, an AI system providing financial data may have more associated risks than a system focused on providing information about the weather.

Assessing the risks of our AI use case can further clarify our initial guardrail definition. Now, we can begin constructing a finalized guardrail definition:

Block user requests for financial advice
Do not allow requests for best practices, general advice, or personalized advice on topics including: investing, corporate finance, lending and credit, real estate, investment banking, etc. ‍
Allow requests for explanations of financial concepts and products, requests for factual information, requests for customer support information, etc.

Now that we’ve reached a finalized guardrail definition, we can now use DynamoGuard to generate our guardrail model.

Synthetic data generation

When walking enterprises through the above process, we identified a key hurdle in coming up with a comprehensive and high quality set of examples. Crowdsourcing examples from internal and external stakeholders often requires expertise, onboarding and training, weeks, and may implicitly build in bias. To help enterprises solve these challenges and accelerate the guardrail development process, our Dynamo AI team developed new research techniques to synthetically generate high quality, diverse, and comprehensive prompt examples.

After providing our risk taxonomy, DynamoGuard generates a set of synthetic data to train our model under the hood. The quality and comprehensiveness of this data is directly related to the performance of our guardrail.

A high-quality guardrail training dataset must contain enough examples to cover the domain we previously defined. This includes:

On-topic examples that reflect each component of the guardrail
Diverse examples representing edge cases of user inputs and model responses
Grounded examples that represent realistic usage of the AI system, based on user-uploaded data
Malicious examples reflecting different types of jailbreaking attacks that the guardrail may be broken by

In aggregate, our synthetic data generation pipeline simulates actors on a full spectrum of malicious intent, ranging from naive questions that unintentionally break compliance policies to malicious attackers who are actively seeking to elicit harmful output from LLMs.

Below, we share a sample of synthetically generated training examples from Dynamo Guard based on our financial advice guardrail.

A human-in-the-loop flow for guardrail creation

The final key piece in defining a guardrail is providing feedback. While DynamoGuard can create a well-performing guardrail based on an initial definition, it's critical to continue providing feedback to further validate and refine the guardrail to our particular use case.

It can be difficult to capture all the nuances of a guardrail in the initial definition. Furthermore, we may not realize that we want to guardrail against a certain type of user input or model response until we see it in real-time.

To help with this, DynamoGuard has tools to provide feedback on each synthetically generated datapoint as well as real-time guardrail results on user inputs and model responses. The feedback is then used for model re-training to further polish the guardrail definition.

These feedback loops from human validation and automated red-teaming enable DynamoGuard to dynamically adapt to the latest threats and vulnerabilities to AI safety. Even after the guardrail is deployed, we enable teams to continually improve with live user data monitoring and review.

‍

Ready to enhance your content moderation system? Discover how our tailored solutions can address real-world risks and improve compliance. Schedule a free demo.

‍

AI Content Moderation: Defining Policies and Synthetic Data Gen (Part 2)

The challenges of defining a guardrail

Breaking down a guardrail definition

Incorporating risk considerations in guardrail definitions

Synthetic data generation

A human-in-the-loop flow for guardrail creation

Recent posts

Introducing DynamoGuard On-Device Guardrails: The Most Secure Generative AI Solution

Breaking the Bank on AI Guardrails? Here’s How to Minimize Costs Without Comprising Performance

Scaling Redteaming of AI Guardrails

Have a use case in mind?
Get in touch.

AI Content Moderation: Defining Policies and Synthetic Data Gen (Part 2)

The challenges of defining a guardrail

Breaking down a guardrail definition

Incorporating risk considerations in guardrail definitions

Synthetic data generation

A human-in-the-loop flow for guardrail creation

Recent posts

Introducing DynamoGuard On-Device Guardrails: The Most Secure Generative AI Solution

Breaking the Bank on AI Guardrails? Here’s How to Minimize Costs Without Comprising Performance

Scaling Redteaming of AI Guardrails

Have a use case in mind?Get in touch.

Have a use case in mind?
Get in touch.