Product
Dec 8, 2024

AI Content Moderation: Defining Policies and Synthetic Data Gen (Part 2)

As AI use grows, so do the challenges in content moderation. Learn how Dynamo AI refines guardrails to incorporate nuanced risk

AI Content Moderation: Defining Policies and Synthetic Data Gen (Part 2)

Low-code tools are going mainstream

Purus suspendisse a ornare non erat pellentesque arcu mi arcu eget tortor eu praesent curabitur porttitor ultrices sit sit amet purus urna enim eget. Habitant massa lectus tristique dictum lacus in bibendum. Velit ut viverra feugiat dui eu nisl sit massa viverra sed vitae nec sed. Nunc ornare consequat massa sagittis pellentesque tincidunt vel lacus integer risu.

  1. Vitae et erat tincidunt sed orci eget egestas facilisis amet ornare
  2. Sollicitudin integer  velit aliquet viverra urna orci semper velit dolor sit amet
  3. Vitae quis ut  luctus lobortis urna adipiscing bibendum
  4. Vitae quis ut  luctus lobortis urna adipiscing bibendum

Multilingual NLP will grow

Mauris posuere arcu lectus congue. Sed eget semper mollis felis ante. Congue risus vulputate nunc porttitor dignissim cursus viverra quis. Condimentum nisl ut sed diam lacus sed. Cursus hac massa amet cursus diam. Consequat sodales non nulla ac id bibendum eu justo condimentum. Arcu elementum non suscipit amet vitae. Consectetur penatibus diam enim eget arcu et ut a congue arcu.

Vitae quis ut  luctus lobortis urna adipiscing bibendum

Combining supervised and unsupervised machine learning methods

Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.

  • Dolor duis lorem enim eu turpis potenti nulla  laoreet volutpat semper sed.
  • Lorem a eget blandit ac neque amet amet non dapibus pulvinar.
  • Pellentesque non integer ac id imperdiet blandit sit bibendum.
  • Sit leo lorem elementum vitae faucibus quam feugiat hendrerit lectus.
Automating customer service: Tagging tickets and new era of chatbots

Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.

“Nisi consectetur velit bibendum a convallis arcu morbi lectus aecenas ultrices massa vel ut ultricies lectus elit arcu non id mattis libero amet mattis congue ipsum nibh odio in lacinia non”
Detecting fake news and cyber-bullying

Nunc ut facilisi volutpat neque est diam id sem erat aliquam elementum dolor tortor commodo et massa dictumst egestas tempor duis eget odio eu egestas nec amet suscipit posuere fames ded tortor ac ut fermentum odio ut amet urna posuere ligula volutpat cursus enim libero libero pretium faucibus nunc arcu mauris sed scelerisque cursus felis arcu sed aenean pharetra vitae suspendisse ac.

Today, we dive into Dynamo AI's approach to content moderation, starting with the first and most crucial step — defining what content should be moderated.(This is a follow-up to Part 1 on why today’s content moderation systems aren’t as effective as they claim.)

Defining clear guardrails for moderation can be complex. We explore these challenges below, and how our team helps address them by refining guardrail definitions, incorporating risk considerations, and using synthetic data for comprehensive model training.

The challenges of defining a guardrail

To develop a content moderation system, it is critical to systematically think about what content we want to moderate against. However, clearly defining this can be harder than it seems.

Let’s take the example guardrail: Block user requests for financial advice. 

For some user queries, this guardrail definition is enough to understand whether or not the query should be compliant. For example, the definition successfully differentiates between the following two queries:

  • “Which mutual funds should I select for my retirement investment portfolio?" --> This query is clearly asking for financial advice on what mutual funds to invest in
  • “How do I recover my password for my banking account?” --> This query is clearly not asking for financial advice, it is asking for help logging in to an account

However, for other user queries, the guardrail definition doesn't provide a clear understanding of compliance or non-compliance For example: 

  • “What are the advantages of storing funds in a savings account instead of a checking account?” --> To determine whether or not this query violates our guardrail, we need to first answer questions such as: (1) Is asking for pros and cons the same as asking for advice?, and (2) Does personal banking fall into the domain of finance?
  • “Is it wise to refinance one’s mortgage when the Fed raises interest rates?” --> To determine whether or not this query violates our guardrail, we need to first answer questions such as: (1) Do economics and real estate fall into our domain of finance?, and (2) Does asking for general guidance count as asking for advice?

While the answer may be clear to us humans and content reviewers, a model won’t immediately understand how to deal with each of these nuances.

To ensure our content moderation model accurately reflects all the nuances of our intended guardrail, we need to first clearly define the domain and carve out our specific guardrail. 

Breaking down a guardrail definition

At Dynamo AI, once we have an initial guardrail definition, we use our models to identify the ambiguities in the guardrail description. For a successful, nuanced definition, each ambiguity must be identified and resolved.

For instance, for the financial advice guardrail, it’s unclear what falls into the domain of ‘finance.’ To clarify this, we can begin by defining the set of topics related to finance.

Based on the end-application, this may include subjects like investing, taxes, economics, lending, and financial regulations. For our out-of-the-box financial advice guardrail, we include 18 different topics that specify the various types of concepts we consider as being related to finance.

In parallel, we clarify the meaning of the word ‘advice’ to help us better understand what types of user requests related to each topic should be considered advice. In our example above, this may include things like ‘asking for a definition’ or ‘requesting personalized recommendations.’ 

As we resolve the ambiguities in our definition, we get a more comprehensive guardrail description that the different types of user requests we will need to apply our guardrail against.

Incorporating risk considerations in guardrail definitions

When defining a guardrail, it is important to assess the operational risks associated with different types of user requests and model responses, and in particular the impact to individuals and organizations. This can help further define what should be considered as allowed or disallowed.

For instance, requesting definitions related to the topic of investing may be a lower risk request, since this a request for factual information and is unlikely to result in a compliance violation or consumer harm.

However, providing personalized advice on lending and credit requests for information or decisions may be considered a higher risk request, as this action deals with the financial livelihood of a user, is subject to a number of laws and regulations, and may cause consumer harm if incorrect information is provided. This, in turn, may lead to significant reputational risk for an organization.

To gauge the operational risk of an AI system, one key component is to assess  how the model will be used. The types of user personas and the context in which the guardrail is used affects both the severity and the types of risks surfaced by non-compliant outputs.

For example, a consumer-facing chatbot may have many more associated risks than an internal employee-facing chatbot. Similarly, an AI system providing financial data may have more associated risks than a system focused on providing information about the weather.

Assessing the risks of our AI use case can further clarify our initial guardrail definition. Now, we can begin constructing a finalized guardrail definition: 

  1. Block user requests for financial advice
  2. Do not allow requests for best practices, general advice, or personalized advice on topics including: investing, corporate finance, lending and credit, real estate, investment banking, etc. 
  3. Allow requests for explanations of financial concepts and products, requests for factual information, requests for customer support information, etc.

Now that we’ve reached a finalized guardrail definition, we can now use DynamoGuard to generate our guardrail model. 

Synthetic data generation

When walking enterprises through the above process, we identified a key hurdle in coming up with a comprehensive and high quality set of examples. Crowdsourcing examples from internal and external stakeholders often requires expertise, onboarding and training, weeks, and may implicitly build in bias. To help enterprises solve these challenges and accelerate the guardrail development process, our Dynamo AI team developed new research techniques to synthetically generate high quality, diverse, and comprehensive prompt examples.

After providing our risk taxonomy, DynamoGuard generates a set of synthetic data to train our model under the hood. The quality and comprehensiveness of this data is directly related to the performance of our guardrail.

A high-quality guardrail training dataset must contain enough examples to cover the domain we previously defined. This includes:

  1. On-topic examples that reflect each component of the guardrail 
  2. Diverse examples representing edge cases of user inputs and model responses 
  3. Grounded examples that represent realistic usage of the AI system, based on user-uploaded data
  4. Malicious examples reflecting different types of jailbreaking attacks that the guardrail may be broken by

In aggregate, our synthetic data generation pipeline simulates actors on a full spectrum of malicious intent, ranging from naive questions that unintentionally break compliance policies to malicious attackers who are actively seeking to elicit harmful output from LLMs.

Below, we share a sample of synthetically generated training examples from Dynamo Guard based on our financial advice guardrail.

A human-in-the-loop flow for guardrail creation

The final key piece in defining a guardrail is providing feedback. While DynamoGuard can create a well-performing guardrail based on an initial definition, it's critical to continue providing feedback to further validate and refine the guardrail to our particular use case. 

It can be difficult to capture all the nuances of a guardrail in the initial definition. Furthermore, we may not realize that we want to guardrail against a certain type of user input or model response until we see it in real-time.

To help with this, DynamoGuard has tools to provide feedback on each synthetically generated datapoint as well as real-time guardrail results on user inputs and model responses. The feedback is then used for model re-training to further polish the guardrail definition.

These feedback loops from human validation and automated red-teaming enable DynamoGuard to dynamically adapt to the latest threats and vulnerabilities to AI safety. Even after the guardrail is deployed, we enable teams to continually improve with live user data monitoring and review.

Ready to enhance your content moderation system? Discover how our tailored solutions can address real-world risks and improve compliance. Schedule a free demo.