As AI use grows, so do the challenges in content moderation. Learn how Dynamo AI refines guardrails to incorporate nuanced risk
Purus suspendisse a ornare non erat pellentesque arcu mi arcu eget tortor eu praesent curabitur porttitor ultrices sit sit amet purus urna enim eget. Habitant massa lectus tristique dictum lacus in bibendum. Velit ut viverra feugiat dui eu nisl sit massa viverra sed vitae nec sed. Nunc ornare consequat massa sagittis pellentesque tincidunt vel lacus integer risu.
Mauris posuere arcu lectus congue. Sed eget semper mollis felis ante. Congue risus vulputate nunc porttitor dignissim cursus viverra quis. Condimentum nisl ut sed diam lacus sed. Cursus hac massa amet cursus diam. Consequat sodales non nulla ac id bibendum eu justo condimentum. Arcu elementum non suscipit amet vitae. Consectetur penatibus diam enim eget arcu et ut a congue arcu.
Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.
Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.
“Nisi consectetur velit bibendum a convallis arcu morbi lectus aecenas ultrices massa vel ut ultricies lectus elit arcu non id mattis libero amet mattis congue ipsum nibh odio in lacinia non”
Nunc ut facilisi volutpat neque est diam id sem erat aliquam elementum dolor tortor commodo et massa dictumst egestas tempor duis eget odio eu egestas nec amet suscipit posuere fames ded tortor ac ut fermentum odio ut amet urna posuere ligula volutpat cursus enim libero libero pretium faucibus nunc arcu mauris sed scelerisque cursus felis arcu sed aenean pharetra vitae suspendisse ac.
Today, we dive into Dynamo AI's approach to content moderation, starting with the first and most crucial step — defining what content should be moderated.(This is a follow-up to Part 1 on why today’s content moderation systems aren’t as effective as they claim.)
Defining clear guardrails for moderation can be complex. We explore these challenges below, and how our team helps address them by refining guardrail definitions, incorporating risk considerations, and using synthetic data for comprehensive model training.
To develop a content moderation system, it is critical to systematically think about what content we want to moderate against. However, clearly defining this can be harder than it seems.
Let’s take the example guardrail: Block user requests for financial advice.
For some user queries, this guardrail definition is enough to understand whether or not the query should be compliant. For example, the definition successfully differentiates between the following two queries:
However, for other user queries, the guardrail definition doesn't provide a clear understanding of compliance or non-compliance For example:
While the answer may be clear to us humans and content reviewers, a model won’t immediately understand how to deal with each of these nuances.
To ensure our content moderation model accurately reflects all the nuances of our intended guardrail, we need to first clearly define the domain and carve out our specific guardrail.
At Dynamo AI, once we have an initial guardrail definition, we use our models to identify the ambiguities in the guardrail description. For a successful, nuanced definition, each ambiguity must be identified and resolved.
For instance, for the financial advice guardrail, it’s unclear what falls into the domain of ‘finance.’ To clarify this, we can begin by defining the set of topics related to finance.
Based on the end-application, this may include subjects like investing, taxes, economics, lending, and financial regulations. For our out-of-the-box financial advice guardrail, we include 18 different topics that specify the various types of concepts we consider as being related to finance.
In parallel, we clarify the meaning of the word ‘advice’ to help us better understand what types of user requests related to each topic should be considered advice. In our example above, this may include things like ‘asking for a definition’ or ‘requesting personalized recommendations.’
As we resolve the ambiguities in our definition, we get a more comprehensive guardrail description that the different types of user requests we will need to apply our guardrail against.
When defining a guardrail, it is important to assess the operational risks associated with different types of user requests and model responses, and in particular the impact to individuals and organizations. This can help further define what should be considered as allowed or disallowed.
For instance, requesting definitions related to the topic of investing may be a lower risk request, since this a request for factual information and is unlikely to result in a compliance violation or consumer harm.
However, providing personalized advice on lending and credit requests for information or decisions may be considered a higher risk request, as this action deals with the financial livelihood of a user, is subject to a number of laws and regulations, and may cause consumer harm if incorrect information is provided. This, in turn, may lead to significant reputational risk for an organization.
To gauge the operational risk of an AI system, one key component is to assess how the model will be used. The types of user personas and the context in which the guardrail is used affects both the severity and the types of risks surfaced by non-compliant outputs.
For example, a consumer-facing chatbot may have many more associated risks than an internal employee-facing chatbot. Similarly, an AI system providing financial data may have more associated risks than a system focused on providing information about the weather.
Assessing the risks of our AI use case can further clarify our initial guardrail definition. Now, we can begin constructing a finalized guardrail definition:
Now that we’ve reached a finalized guardrail definition, we can now use DynamoGuard to generate our guardrail model.
When walking enterprises through the above process, we identified a key hurdle in coming up with a comprehensive and high quality set of examples. Crowdsourcing examples from internal and external stakeholders often requires expertise, onboarding and training, weeks, and may implicitly build in bias. To help enterprises solve these challenges and accelerate the guardrail development process, our Dynamo AI team developed new research techniques to synthetically generate high quality, diverse, and comprehensive prompt examples.
After providing our risk taxonomy, DynamoGuard generates a set of synthetic data to train our model under the hood. The quality and comprehensiveness of this data is directly related to the performance of our guardrail.
A high-quality guardrail training dataset must contain enough examples to cover the domain we previously defined. This includes:
In aggregate, our synthetic data generation pipeline simulates actors on a full spectrum of malicious intent, ranging from naive questions that unintentionally break compliance policies to malicious attackers who are actively seeking to elicit harmful output from LLMs.
Below, we share a sample of synthetically generated training examples from Dynamo Guard based on our financial advice guardrail.
The final key piece in defining a guardrail is providing feedback. While DynamoGuard can create a well-performing guardrail based on an initial definition, it's critical to continue providing feedback to further validate and refine the guardrail to our particular use case.
It can be difficult to capture all the nuances of a guardrail in the initial definition. Furthermore, we may not realize that we want to guardrail against a certain type of user input or model response until we see it in real-time.
To help with this, DynamoGuard has tools to provide feedback on each synthetically generated datapoint as well as real-time guardrail results on user inputs and model responses. The feedback is then used for model re-training to further polish the guardrail definition.
These feedback loops from human validation and automated red-teaming enable DynamoGuard to dynamically adapt to the latest threats and vulnerabilities to AI safety. Even after the guardrail is deployed, we enable teams to continually improve with live user data monitoring and review.
Ready to enhance your content moderation system? Discover how our tailored solutions can address real-world risks and improve compliance. Schedule a free demo.