Product
Feb 20, 2025

Hallucinations Are Custom: Introducing Dynamo AI's LLM Failure Mode Taxonomy

Hallucinations Are Custom: Introducing Dynamo AI's LLM Failure Mode Taxonomy

Low-code tools are going mainstream

Purus suspendisse a ornare non erat pellentesque arcu mi arcu eget tortor eu praesent curabitur porttitor ultrices sit sit amet purus urna enim eget. Habitant massa lectus tristique dictum lacus in bibendum. Velit ut viverra feugiat dui eu nisl sit massa viverra sed vitae nec sed. Nunc ornare consequat massa sagittis pellentesque tincidunt vel lacus integer risu.

  1. Vitae et erat tincidunt sed orci eget egestas facilisis amet ornare
  2. Sollicitudin integer  velit aliquet viverra urna orci semper velit dolor sit amet
  3. Vitae quis ut  luctus lobortis urna adipiscing bibendum
  4. Vitae quis ut  luctus lobortis urna adipiscing bibendum

Multilingual NLP will grow

Mauris posuere arcu lectus congue. Sed eget semper mollis felis ante. Congue risus vulputate nunc porttitor dignissim cursus viverra quis. Condimentum nisl ut sed diam lacus sed. Cursus hac massa amet cursus diam. Consequat sodales non nulla ac id bibendum eu justo condimentum. Arcu elementum non suscipit amet vitae. Consectetur penatibus diam enim eget arcu et ut a congue arcu.

Vitae quis ut  luctus lobortis urna adipiscing bibendum

Combining supervised and unsupervised machine learning methods

Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.

  • Dolor duis lorem enim eu turpis potenti nulla  laoreet volutpat semper sed.
  • Lorem a eget blandit ac neque amet amet non dapibus pulvinar.
  • Pellentesque non integer ac id imperdiet blandit sit bibendum.
  • Sit leo lorem elementum vitae faucibus quam feugiat hendrerit lectus.
Automating customer service: Tagging tickets and new era of chatbots

Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.

“Nisi consectetur velit bibendum a convallis arcu morbi lectus aecenas ultrices massa vel ut ultricies lectus elit arcu non id mattis libero amet mattis congue ipsum nibh odio in lacinia non”
Detecting fake news and cyber-bullying

Nunc ut facilisi volutpat neque est diam id sem erat aliquam elementum dolor tortor commodo et massa dictumst egestas tempor duis eget odio eu egestas nec amet suscipit posuere fames ded tortor ac ut fermentum odio ut amet urna posuere ligula volutpat cursus enim libero libero pretium faucibus nunc arcu mauris sed scelerisque cursus felis arcu sed aenean pharetra vitae suspendisse ac.

As we’ve worked with large enterprises to productionize AI use cases, we’ve consistently found that today’s hallucination evaluations fail to capture the complexities of real-world LLM failure modes.

Most hallucination evals rely on a broad, one-size-fits-all definition of hallucinations, applying that defintion across different industries and use cases. However, enterprises often tell us that the specific hallucinations they actually encounter in the real-world are too nuanced and domain-specific to address effectively with a rigid one-size-fits-all eval.‍

Our own research confirms this. In our experiments with judge models, we found that models trained on domain-specific data consistently outperform those without it, particularly when identifying unsupported statements within the same domain. In a recent study, we observed up to a 10% increase in F1 score in hallucination detection accuracy when using domain-aware models.‍

Policy research also highlights the need for more refined hallucination evaluations. The Bank Policy Institute, in its commentary on AI use in Commodities Futures Trading Commission-regulated markets, emphasizes the importance of tracing how AI models process inputs into outputs. They advocate for both model evaluation, which assesses overall LLM performance, and system evaluation, which focuses on an LLM’s effectiveness within specific use cases.‍

Despite these needs, the industry still lacks benchmarks or standardized hallucination taxonomies, making hallucinations an unsolved challenge in AI today. Not all hallucinations are the same, and their evaluations must capture these nuances to accurately reflect real-world performance and risk.

Let's demonstrate this with an example:

Imagine the user finds an incorrect fraud alert on a valid purchase and would like to dispute it - a very common scenario in the financial services sector. Let's assume the responses are generated by a RAG system, and for simplicity let's assume all information required to answer the question correctly is available in the retrieved context. ‍

Context
To dispute your fraud alert, you can contact Mybank. You can dispute online, by mail, or by phone. Mybank will have 30 days to investigate a dispute.

Online

  • You can file a dispute online with mybank.com.
  • mybank.com has its own online interface.

 

By mail 

  • Download the dispute form from Mybank.
  • Include copies of documents that support your dispute.
  • Mail your dispute letter by certified mail and pay for a “return receipt”.

By phone

  • You can call Mybank to dispute information.

Human:
I found an incorrect fraud alert on a valid purchase. How do I dispute it?

LLM:
I'm sorry, I do not have the required context to answer that question.

The example above is a refusal, despite the LLM actually having the context required to answer a question. Many enterprises we work with define these invalid refusals as hallucinations. These invalid refusals do not disseminate misinformation, however they can reduce user trust in the AI system over time.

Human:
I found an incorrect fraud alert on a valid purchase. How do I dispute it?

LLM:
To dispute an account, you’ll need to file a formal written dispute with the Mybank, either online, by mail or phone. Once submitted, the bank is required to investigate the dispute within
30 days. However, if the account is more than 10 years old, the bank must remove it immediately without you needing to do anything. If the account is under 10 years old, the dispute investigation may take longer, possibly up to 60 days.

The example above has statements that directly contradict with each other: The first statement says there's a legal requirement to investigate within 30 days, however the second statement says it may take up to 60 days. This may be caught by an attentive reader but can still be very misleading. These types of hallucinations lead to experiences where the AI system is not helpful at all.

Human:  
I found an incorrect fraud alert on a valid purchase. How do I dispute it?

LLM:
To dispute an account, you’ll need to file a formal written dispute with Mybank, either online, by mail or phone. If the account is found to be inaccurate, it will be removed from your report. However, keep in mind that
accounts older than 5 years will automatically drop off your report, even without a dispute, as per the "Fraud Prevention Act."

In this example, on the other hand, LLM provides a completely inaccurate fact about accounts older than 5 years automatically dropping off your report, referencing a law that does not exist. These types of hallucinations can have very serious real world consequences, both to end users as well as to the enterprises.

These three simple examples demonstrate the following:

  1. While each are considered hallucinations, their severity is very different than each other. Evaluation systems today miss this nuance. And once that nuance is lost, evaluations are no longer representative of the real world consequences of deploying AI systems.
  2. Detecting different types of hallucinations requires different techniques and judge models for it to be effective.
  3. Root causes leading to different types of hallucinations, therefore remediations for them, are often very different than each other.

At Dynamo AI, we've been deeply exploring this challenge and are excited to introduce a new solution that empowers customers to evaluate and safeguard their AI systems with highly nuanced hallucination criteria.

Our platform offers a flexible approach, allowing users to apply a set of pre-built hallucination criteria or define their own—using nothing but natural language. Once a custom criterion is defined, it seamlessly integrates into evaluations and guardrails, functioning just like our out-of-the-box policies to ensure AI reliability at scale.

DynamoEval

Enables customers to stress test their models and understand system quality and risk in the pre-deployment phase. These results are used in an iterative process where teams work on remediations based on root causes of detected hallucinations, and then evaluate again until preset quality and risk criteria are met.

DynamoGuard

Enables customers to guardrail their models once deployed to production, and moderate incoming and outgoing messages. Customers can monitor performance and risk in real-time, and continuously align the hallucination judge models based on production data with humans in the loop.

More about Dynamo AI:

At Dynamo AI, we are committed to helping organizations measure and manage hallucinations effectively. Our comprehensive custom hallucinations offering provides deep insights into model performance, and enables teams to guardrail their AI Systems in production.

We also offer a range of AI privacy and security solutions to help you build trustworthy and responsible AI systems. To learn more about how Dynamo AI can help you evaluate and mitigate hallucinations in your AI Systems, or to explore our AI privacy and security offerings, please request a demo.