Purus suspendisse a ornare non erat pellentesque arcu mi arcu eget tortor eu praesent curabitur porttitor ultrices sit sit amet purus urna enim eget. Habitant massa lectus tristique dictum lacus in bibendum. Velit ut viverra feugiat dui eu nisl sit massa viverra sed vitae nec sed. Nunc ornare consequat massa sagittis pellentesque tincidunt vel lacus integer risu.
Mauris posuere arcu lectus congue. Sed eget semper mollis felis ante. Congue risus vulputate nunc porttitor dignissim cursus viverra quis. Condimentum nisl ut sed diam lacus sed. Cursus hac massa amet cursus diam. Consequat sodales non nulla ac id bibendum eu justo condimentum. Arcu elementum non suscipit amet vitae. Consectetur penatibus diam enim eget arcu et ut a congue arcu.
Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.
Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.
“Nisi consectetur velit bibendum a convallis arcu morbi lectus aecenas ultrices massa vel ut ultricies lectus elit arcu non id mattis libero amet mattis congue ipsum nibh odio in lacinia non”
Nunc ut facilisi volutpat neque est diam id sem erat aliquam elementum dolor tortor commodo et massa dictumst egestas tempor duis eget odio eu egestas nec amet suscipit posuere fames ded tortor ac ut fermentum odio ut amet urna posuere ligula volutpat cursus enim libero libero pretium faucibus nunc arcu mauris sed scelerisque cursus felis arcu sed aenean pharetra vitae suspendisse ac.
As we’ve worked with large enterprises to productionize AI use cases, we’ve consistently found that today’s hallucination evaluations fail to capture the complexities of real-world LLM failure modes.
Most hallucination evals rely on a broad, one-size-fits-all definition of hallucinations, applying that defintion across different industries and use cases. However, enterprises often tell us that the specific hallucinations they actually encounter in the real-world are too nuanced and domain-specific to address effectively with a rigid one-size-fits-all eval.
Our own research confirms this. In our experiments with judge models, we found that models trained on domain-specific data consistently outperform those without it, particularly when identifying unsupported statements within the same domain. In a recent study, we observed up to a 10% increase in F1 score in hallucination detection accuracy when using domain-aware models.
Policy research also highlights the need for more refined hallucination evaluations. The Bank Policy Institute, in its commentary on AI use in Commodities Futures Trading Commission-regulated markets, emphasizes the importance of tracing how AI models process inputs into outputs. They advocate for both model evaluation, which assesses overall LLM performance, and system evaluation, which focuses on an LLM’s effectiveness within specific use cases.
Despite these needs, the industry still lacks benchmarks or standardized hallucination taxonomies, making hallucinations an unsolved challenge in AI today. Not all hallucinations are the same, and their evaluations must capture these nuances to accurately reflect real-world performance and risk.
Imagine the user finds an incorrect fraud alert on a valid purchase and would like to dispute it - a very common scenario in the financial services sector. Let's assume the responses are generated by a RAG system, and for simplicity let's assume all information required to answer the question correctly is available in the retrieved context.
Context
To dispute your fraud alert, you can contact Mybank. You can dispute online, by mail, or by phone. Mybank will have 30 days to investigate a dispute.
Online
By mail
By phone
Human:
I found an incorrect fraud alert on a valid purchase. How do I dispute it?
LLM:
I'm sorry, I do not have the required context to answer that question.
The example above is a refusal, despite the LLM actually having the context required to answer a question. Many enterprises we work with define these invalid refusals as hallucinations. These invalid refusals do not disseminate misinformation, however they can reduce user trust in the AI system over time.
Human:
I found an incorrect fraud alert on a valid purchase. How do I dispute it?
LLM:
To dispute an account, you’ll need to file a formal written dispute with the Mybank, either online, by mail or phone. Once submitted, the bank is required to investigate the dispute within 30 days. However, if the account is more than 10 years old, the bank must remove it immediately without you needing to do anything. If the account is under 10 years old, the dispute investigation may take longer, possibly up to 60 days.
The example above has statements that directly contradict with each other: The first statement says there's a legal requirement to investigate within 30 days, however the second statement says it may take up to 60 days. This may be caught by an attentive reader but can still be very misleading. These types of hallucinations lead to experiences where the AI system is not helpful at all.
Human:
I found an incorrect fraud alert on a valid purchase. How do I dispute it?
LLM:
To dispute an account, you’ll need to file a formal written dispute with Mybank, either online, by mail or phone. If the account is found to be inaccurate, it will be removed from your report. However, keep in mind that accounts older than 5 years will automatically drop off your report, even without a dispute, as per the "Fraud Prevention Act."
In this example, on the other hand, LLM provides a completely inaccurate fact about accounts older than 5 years automatically dropping off your report, referencing a law that does not exist. These types of hallucinations can have very serious real world consequences, both to end users as well as to the enterprises.
At Dynamo AI, we've been deeply exploring this challenge and are excited to introduce a new solution that empowers customers to evaluate and safeguard their AI systems with highly nuanced hallucination criteria.
Our platform offers a flexible approach, allowing users to apply a set of pre-built hallucination criteria or define their own—using nothing but natural language. Once a custom criterion is defined, it seamlessly integrates into evaluations and guardrails, functioning just like our out-of-the-box policies to ensure AI reliability at scale.
Enables customers to stress test their models and understand system quality and risk in the pre-deployment phase. These results are used in an iterative process where teams work on remediations based on root causes of detected hallucinations, and then evaluate again until preset quality and risk criteria are met.
Enables customers to guardrail their models once deployed to production, and moderate incoming and outgoing messages. Customers can monitor performance and risk in real-time, and continuously align the hallucination judge models based on production data with humans in the loop.
At Dynamo AI, we are committed to helping organizations measure and manage hallucinations effectively. Our comprehensive custom hallucinations offering provides deep insights into model performance, and enables teams to guardrail their AI Systems in production.
We also offer a range of AI privacy and security solutions to help you build trustworthy and responsible AI systems. To learn more about how Dynamo AI can help you evaluate and mitigate hallucinations in your AI Systems, or to explore our AI privacy and security offerings, please request a demo.