AI and Testing: Faithfulness

In the previous post we looked at the Answer Relevancy metric with DeepEval and got some intuitions in place about how to start thinking about using metrics in general. In this post, we’ll look at a second metric that requires no faith but is all about being faithful.

We laid a lot of the ground work in the previous post in terms of how to construct scripts that will run via DeepEval using one of the built in metrics. Here we’re going to look at faithfulness and I’m essentially going to follow largely the same pattern as I did in the previous post just to lock in some of the overall lessons.

Staying True to the Source

Imagine asking a Game of Thrones fan about events in the books, and they confidently tell you details that sound perfectly plausible: because they are canon, just not book canon. They’re mixing in scenes from the HBO series, maybe even details from the prequel series House of the Dragon. Note that this fan is making things up nor are they confused. The information they are giving is internally consistent and fits the world. It’s just drawn from the wrong source material.

The HBO show’s version of events might be perfectly valid storytelling, but if you asked specifically about the books, citing the show is a faithfulness failure even though it’s not technically “wrong.”

LLMs face this exact challenge. When they’re given source material (whether that’s retrieved documents, uploaded files, or context you’ve provided), they need to base their answers on what’s actually there, not on what their training data suggests should be there. An LLM might generate an answer that sounds authoritative and perfectly reasonable, but if it’s adding claims that aren’t supported by the source material, even if those claims are “true” in some broader sense, it’s failing the faithfulness test.

This is particularly critical in RAG systems, where the whole point is to ground the LLM’s responses in specific, retrieved information. A faithful answer means the LLM is acting like a careful scholar citing only what’s in the texts before it, rather than an overconfident student padding their essay with educated guesses.

Let’s jump right into an example and start with this code:

Fundamentally, except for the verbosity, this is very similar to the script we ended up with in the previous post for Answer Relevancy. Here, however, we’re using the Faithfulness metric. As before, we have a question. That question is provided as the input to the test cases. We also have the actual output that, again, we specify.

Remember from the last post: when we do this, we’re creating control cases so we can understand how the metric works.

One thing you might be wondering about: what is this “Jeff Nyman’s warp drive paper” that’s being referred to? Well, I do actually have such a paper and we will use that paper later. For now, here is a PDF of that paper and this is what the model would eventually need to read. I recommend downloading that file and putting it in the same directory as your project files although, again, please note that for this current script, the paper is not actually being referenced at all. We’ll get to that in a bit.

I’m using one of my own papers simply because that eases seeking permission for use. In this case, I only have to ask me! I intend no hubris in using my own work here. In fact, this paper is a bit embarrassing. It was candidate accepted but never actually got fully published. The only way I got published in this context was as a co-author, among many, on this paper.

One major difference in our code from the Answer Relevancy example is the retrieval_context, which we populate with another variable that I’ve called research. This is the key to understanding faithfulness: the retrieval context represents the “source material” or “ground truth” that you’ve provided to your system. In a real RAG application, this would be the document chunks that were retrieved based on the user’s question. Here, we’re manually providing excerpts from the paper.

Think of it this way: the input is what the user asks, the actual_output is what your LLM responds with, and the retrieval_context is what you gave the LLM to base its answer on (or what it found on its own). Faithfulness checks whether the actual_output stays true to the retrieval_context. The question in the input sets the topic, but it’s the relationship between output and context that faithfulness evaluates.

Let’s run this script and see what we get. For the valid test case, you will likely get something like this:


**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "Jeff Nyman's paper calculates the energy requirements for faster-than-light travel.",
    "The paper proposes that an advanced civilization would use matter/antimatter annihilation as the most efficient method to produce energy.",
    "According to the calculations in the paper, a warp bubble would require approximately 10^28 Kg of antimatter.",
    "This amount of antimatter is roughly equivalent to the mass-energy of the planet Jupiter.",
    "The energy requirement for the warp drive would be significantly reduced if using a thin-shell of modified space-time instead of a bubble encompassing the volume of the craft."
]

Claims:
[
    "Nyman's paper suggests the most efficient energy production method is matter/antimatter annihilation.",
    "Generating a warp bubble requires approximately 10^28 kg of antimatter according to Nyman's paper.",
    "The amount of antimatter required for generating a warp bubble would be roughly equivalent to Jupiter's mass-energy based on Nyman's paper.",
    "Using a thin-shell configuration rather than a full volume bubble could significantly decrease the requirement of antimatter according to Nyman's paper."
]

Verdicts:
[
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "yes",
        "reason": null
    }
]

Score: 1.0
Reason: The score is 1.00 because there are no contradictions in the 'actual output' as indicated by the empty list of contradictions.

The faithfulness metric gave this response a perfect score of 1.0. But what does that actually mean? Faithfulness measures whether the generated answer stays true to the information in the retrieval context. Essentially, “Did the AI make anything up, or did it stick to what it was given?”

The metric extracted five key truths from my retrieval context (the passage from the paper about energy requirements) and four claims from the generated answer. Then it checked each claim against the available truths:

  • Claim 1: “Matter/antimatter annihilation is the most efficient method” ✔ Supported
  • Claim 2: “Requires ~10^28 kg of antimatter” ✔ Supported
  • Claim 3: “Equivalent to Jupiter’s mass-energy” ✔ Supported
  • Claim 4: “Thin-shell configuration reduces requirements” ✔ Supported

Since all four claims had a “yes” verdict, the score is 4/4 = 1.0.

It’s worth calling out that a perfect faithfulness score doesn’t mean the answer is comprehensive or well-written. It just means there are no overt hallucinations. Every factual assertion in the output can be traced back to something explicitly stated in the retrieval context. This is crucial for RAG systems where accuracy matters more than eloquence. You, presumably, would rather have a faithful-but-dry answer than a beautifully written one that invents details the source material never mentioned.

The invalid test case likely gave you something like this:


**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "Jeff Nyman's paper calculates the energy requirements for faster-than-light travel.",
    "The paper proposes that an advanced civilization would use matter/antimatter annihilation as the most efficient method to produce energy.",
    "According to the calculations in the paper, a warp bubble would require approximately 10^28 Kg of antimatter.",
    "This amount of antimatter is roughly equivalent to the mass-energy of the planet Jupiter.",
    "The energy requirement for the warp drive would be significantly reduced if using a thin-shell of modified space-time instead of a bubble encompassing the volume of the craft."
]

Claims:
[
    "Nyman's paper proposes using zero-point energy extraction from the quantum vacuum to power the warp drive.",
    "The paper calculates this would require harnessing the Casimir effect across a surface area roughly equivalent to Earth's diameter."
]

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The retrieval context states that Jeff Nyman's paper proposes using matter/antimatter annihilation as the most efficient method, not zero-point energy extraction from the quantum vacuum."
    },
    {
        "verdict": "idk",
        "reason": "There is no information in the retrieval context about harnessing the Casimir effect for powering the warp drive. The context does not provide enough evidence to confirm or deny this claim."
    }
]

Score: 0.5
Reason: The score is 0.50 because the actual output incorrectly mentions zero-point energy extraction from the quantum vacuum as the proposed method by Jeff Nyman's paper, contradicting the retrieval context which states that matter/antimatter annihilation is proposed.

This is what happens when the generated answer strays from the source material. This response scored 0.5 out of 1.0, meaning half of what it claimed was unfaithful to the retrieval context. The metric extracted the same five truths from the retrieval context, but this time found only two claims in the generated answer. Here’s how they were evaluated:

  • Claim 1: “Proposes using zero-point energy extraction from the quantum vacuum” ✘ Contradicted
  • Claim 2: “Requires harnessing the Casimir effect across Earth’s diameter” ❓ Unknown

The reason for the first verdict is that the paper explicitly states matter/antimatter annihilation is the proposed method, not zero-point energy. The reason for the second verdict is the retrieval context doesn’t mention the Casimir effect or Earth’s diameter at all.

The score calculation: 1 contradicted claim + 0.5 for the “idk” verdict = 0.5 points penalized out of 2 total claims = 0.5 score.

Notice the metric didn’t just catch a factual error. It caught two different types of problems. One is direct contradiction. The answer claims the opposite of what the source says. The other is unsupported invention. The answer introduces details that simply don’t exist in the source.

The “idk” verdict is particularly interesting. The metric is sophisticated enough to say “I can’t verify this” rather than wrongly calling it true or false, which we saw in the relevancy context as well. In a production RAG system, when considering faithfulness, an “idk” should probably be treated similarly to a contradiction. I say this because both indicate the model went beyond its source material. By contrast, remember how answer relevancy was a little more generous on “idk” verdicts.

This is exactly the kind of hallucination that makes RAG systems unreliable if left unchecked. Faithfulness metrics help us catch these issues before they reach users.

Realistic RAG Example

That’s our controlled example. Let’s now have our test read the actual document. To do that, we’re going to have to build some minimal RAG functionality, which means we need to get a few things.

You likely already have langchain_community if you followed through with the LangChain posts, but the others are new. Remember to install these when you are in your virtual environment.

  python -m pip install langchain_community
  python -m pip install langchain_text_splitters
  python -m pip install pypdf
  python -m pip install chromadb

You will also need to pull another model:

  ollama pull nomic-embed-text

The nomic-embed-text model is a high-performance, open-source text encoder designed to convert text into numerical representations called embeddings. These embeddings allow computers to understand the semantic meaning and relationships within text rather than just matching keywords.

I get more in depth on that broader topic in my Text Classification series, none of which is needed to understand this series.

We don’t have to dig too deep into what this model does but, just to clarify the above a little bit, imagine you’re looking for specific information in a massive PDF, like a two hundred page manual. Instead of just doing a “Ctrl+F” search for a specific word, this model “reads” and memorizes the underlying concepts of every paragraph. This means if you search for “how to fix a flat tire,” it can find the relevant section even if the PDF uses the phrase “replacing a punctured tire.” It essentially acts as a super-powered indexer that understands the point of what’s written, making it possible for an AI to quickly grab the right context from your documents to answer your questions accurately.

To get this example going, I’m just going to give you the full code, which is essentially about the simplest RAG application I could write around some of our previous code:

Congratulations! You just wrote a mini-RAG application. I put a few comments in there just to help situate the logic. With this code, we’re connecting our test to a real Retrieval-Augmented Generation (RAG) system, albeit a very, very (very!) simple one that we wrote as part of the test. Instead of manually providing the retrieval context, we’ll let the system automatically find and retrieve relevant passages from my paper.

Notice here how this is a tester stepping out a bit to act as a developer. I talked a little about that when I discussed about testers acting like developers.

When you build a RAG system, you’re essentially creating a “smart search” over your documents. Granted, the RAG functionality isn’t the point of this post but let’s at least break down what’s happening here so the logic we’re using isn’t entirely opaque.

  • Document Preparation: The PDF is loaded and split into manageable chunks (around one thousand characters each, with some overlap to preserve context across boundaries). Think of this like breaking a book into passages that each contain a complete thought.
  • Creating a Searchable Index: Each chunk gets converted into a mathematical representation (an “embedding”) that captures its meaning. These embeddings are stored in a vector database; essentially a specialized search index optimized for finding semantically similar content rather than just matching keywords.
  • Query-Time Retrieval: When you ask a question, the system converts your question into the same type of embedding, then finds the chunks whose embeddings are most similar. In our case, we’re retrieving the top three most relevant chunks.
  • Generation with Context: The retrieved chunks become the retrieval_context, and your execution model generates a response based on those specific passages, just like a student writing an essay based on assigned readings.

This is the real-world scenario faithfulness is designed for. We’re not hand-picking the perfect (or even imperfect!) excerpts anymore: the system is automatically deciding which parts of the document are relevant. This introduces new ways things can go wrong. The retrieval might find the wrong passages. The retrieved passages might not contain enough information. The model might blend information from the passages with its own knowledge.

Faithfulness testing becomes even more critical here because you need to verify the model is actually using what it retrieved, not making things up or adding details from elsewhere.

Here’s an example of what I got when I ran this:


**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "The paper focuses on a realistic model of a warp drive rather than discussing its physical realizability.",
    "The first four sections review necessary physics for understanding the new warp drive model.",
    "Calculations regarding speed limits and energy requirements are presented in the paper.",
    "The cosmological constant ? is approximately 10^-47 (GeV)^4 or 10^-10 J/m3 in SI units.",
    "For a warp bubble expanding at the speed of light, the required energy would be increased by a factor of 10^52.",
    "The paper considers a spacecraft with dimensions that must be encompassed by the warp bubble.",
    "It is assumed that an advanced civilization could utilize the most efficient method of energy production."
]

Claims:
[
    "Jeff Nyman's approach focuses on a spacecraft with an 'exotic power generator' capable of manipulating extra dimensions to create a warp bubble.",
    "The paper calculates the *energy required* to create this warp bubble using Equation 31.",
    "The phrasing 'most efficient method of energy production' in Equation 31 is crucial, implying reliance on advanced civilization's energy technology without specifying the exact source.",
    "The calculations use the cosmological constant (?) as a base value for energy and increase it by a factor of 10^52 due to the expansion of space."
]

Verdicts:
[
    {
        "verdict": "no",
        "reason": null
    },
    {
        "verdict": "yes",
        "reason": "The context does not mention any specific power generator or manipulation of extra dimensions, contradicting the claim."
    },
    {
        "verdict": "idk",
        "reason": "While the context mentions advanced civilization's energy production methods, it does not specify that this is the 'most efficient method' as implied in the claim. The exact source of energy is also not detailed."
    },
    {
        "verdict": "no",
        "reason": null
    }
]

Score: 0.5
Reason: The score is 0.50 because there are no explicit contradictions provided in the 'contradictions' list, suggesting that the actual output aligns well with the retrieval context.

Keep in mind that now we’re testing a real LLM-generated response based on actual document retrieval and the results are instructive. The execution model (ts-reasoner) retrieved relevant chunks from my paper and generated a response.

The faithfulness score came in at 0.5: the same as our intentionally bad example earlier. But this time, the story is more nuanced. The metric extracted seven truths from the retrieved chunks and found four claims in the generated response. It’s the first claim that’s perhaps the most interesting here:

  • Claim 1: “Focuses on a spacecraft with an ‘exotic power generator’ manipulating extra dimensions” ✘ Rejected

The verdict was “The context does not mention any specific power generator or manipulation of extra dimensions.” But, hold on. Looking at my paper, the entire premise is about manipulating extra dimensions! It’s literally in the title and throughout the document. From page 2:

“My route was to envision a spacecraft with an exotic power generator that could create the necessary energies to locally manipulate the extra dimension(s).”

And from the conclusions (page 13):

“String theory suggests that dimensions are globally held compact by strings wrapping around them. If this is indeed the case, then it may be possible to even locally increase or decrease the string tension, or even locally counter the effects of some string winding modes.”

So Claim 1 is factually correct about the paper, but the faithfulness metric rejected it because that specific information wasn’t in the three chunks the retriever selected. This reveals something crucial about faithfulness evaluation: it’s not measuring whether the answer is correct about the source document: it’s measuring whether the answer is faithful to the retrieved context gathered from the source document.

This is actually the correct behavior for a RAG system, but it exposes a critical distinction. There is document-level correctness (Does this match what’s in the paper?) and retrieval-level faithfulness (Does this match what the retriever found?).

My RAG system’s retriever chose three chunks that apparently discussed cosmological constants, energy calculations, and general concepts, but missed the chunks containing the exotic power generator and extra dimension manipulation details.

What this means is the output you got depends on the chunks your code gathered. That probably suggests to you a way to refine this script, right? We’re actually going to come back to this in a future post.

This is a perfect example of why retrieval quality is paramount in RAG systems. You can have a perfect source document, a capable language model, and accurate faithfulness evaluation, but if your retrieval selects the wrong chunks, the entire pipeline fails. The model might generate something true about the document but get penalized because it wasn’t in the retrieved context. Or, worse, it might only use the retrieved context and miss key information entirely.

Observability

A logical question here is, “Well, can I see the chunks it actually picked?” You sure can! Modify the code like this:

This will display each chunk that was retrieved, which page it came from (helpful for tracing back to the PDF), and the actual text content.

Recapping

To understand faithfulness, consider forensic document examination. When an examiner analyzes a questioned signature, they can only make claims based on the exemplars (known samples) they have in front of them. If they say “this signature matches the exemplars,” they’re being faithful to their evidence. If they say “this person also uses distinctive flourishes on capital letters,” but none of the exemplars show capital letters, they’ve gone beyond their source material. That’s the case even if they’re entirely correct about the person’s general handwriting style. Faithfulness isn’t about being correct about reality; it’s about being traceable to the specific evidence at hand.

Imagine if we had our RAG looking at, say, documents of famous conspiracy theories about the Moon landings or extraterrestrial cover ups. Even though what those sources say may bear no correspondence to reality, someone can still be faithful to the concepts those theories discuss or the conclusions those theories come to. Faithfulness to the material does not depend on that material being factually accurate or truthful.

At the risk of an aside, I’m a Thomist in orientation and the same principle appears in Thomistic philosophy, particularly in how Thomas Aquinas approached scriptural interpretation. Aquinas distinguished between what a text actually says (sensus litteralis) and what can be legitimately inferred from it. He insisted that theological arguments must be grounded in what Scripture explicitly states, not merely in what seems consistent with it. A faithful reading meant you could point to the actual words on the page. An unfaithful reading (even a plausible one) introduced claims the text itself didn’t support. This is precisely what the Faithfulness metric does: it checks whether your LLM is working from the text it retrieved or smuggling in knowledge from elsewhere.

This is why Faithfulness matters as a metric: without it, you might have a RAG system that’s knowledgeable and sounds authoritative but is actually blending retrieved facts with training data, making it impossible to verify claims or trust the sourcing. The metric helps you catch that pattern before it undermines the reliability of your system.

What Does the Testing Tell Us

I really want to hammer home these points because they are crucial to understanding interaction with LLMs. Keep in mind that our testing reveals something crucial about the distinction between document accuracy and retrieval fidelity in RAG systems. When we evaluated ts-reasoner against my warp drive paper question, we observed three notably different scenarios that expose the layers where faithfulness can succeed or fail.

The controlled valid case (scoring 1.0) demonstrated that when given the right context, ts-reasoner generates responses that stay disciplined to the source material. All four claims traced directly back to explicit statements in the retrieval context: matter/antimatter annihilation, the 1028 kilograms antimatter requirement, Jupiter’s mass-energy equivalence, and the thin-shell optimization.

Test Finding: This tells us that ts-reasoner can operate in “faithful scholar” mode when the retrieval context contains the information needed to answer the question. It doesn’t pad responses with plausible-sounding additions from its training data.

The controlled invalid case (scoring 0.5) was deliberately designed to fail, but how it failed matters. The metric caught two distinct failure modes: direct contradiction (claiming zero-point energy when the context explicitly stated matter/antimatter) and unsupported invention (introducing Casimir effects and Earth-diameter calculations that appeared nowhere in the source). The “idk” verdict on the second claim was particularly instructive.

Test Finding: The faithfulness metric doesn’t just flag “wrong answers.” It distinguishes between claims that contradict the source versus claims that simply aren’t supported by it. Both are faithfulness failures, but they represent different types of hallucination. This granularity matters when diagnosing why a RAG system is producing unfaithful outputs.

The RAG case (also scoring 0.5) exposed the most subtle and important finding. The response mentioned “exotic power generator” and “manipulating extra dimensions”; claims that are absolutely correct about the paper itself, appearing explicitly on pages 2 and 13. Yet the faithfulness metric rejected them because they weren’t in the three chunks the retriever selected.

Test Finding: This reveals that faithfulness metrics operate at the retrieval level, not the document level. The score of 0.5 wasn’t measuring whether ts-reasoner was correct about the paper; it was measuring whether ts-reasoner stayed faithful to what the retriever provided from the paper. This is the correct behavior for a RAG system, but it exposes a critical vulnerability: poor retrieval quality will cause faithfulness failures even when the model’s response is factually accurate about the source document.

What this testing taught us is that Faithfulness functions as a source-traceability detector, not a factual correctness detector. The RAG response wasn’t wrong about the paper; it was drawing from knowledge beyond the retrieved chunks. But in a production RAG system, that’s precisely what we want to prevent! If the model starts adding “correct” information that wasn’t in the retrieval context, users lose the ability to verify claims by checking the source chunks.

Students who use AI to craft their essays run into this problem quite a bit!

For our purposes, this means we now have a baseline understanding of a critical failure mode in RAG systems: the retrieval bottleneck. You can have a perfect source document, a capable language model, and accurate faithfulness evaluation, but if your retriever selects the wrong chunks, the entire pipeline fails.

Test Finding: The testing also revealed that monitoring faithfulness scores in production could serve as an early warning system for retrieval quality degradation. If faithfulness scores start dropping while your source documents and model remain unchanged, you know to investigate your retrieval mechanism. Perhaps your chunking strategy needs adjustment, or your embedding model isn’t capturing the right semantic relationships.

The variance between our controlled cases (where we hand-picked context) and the RAG case (where retrieval was automatic) quantifies the risk: moving from manual context selection to automated retrieval cut our faithfulness score in half, from 1.0 to 0.5. That’s not because ts-reasoner became less capable. It’s because the system stopped providing it with the right source material.

That distinction between the model I’m testing and the system in which that model is being utilized matters tremendously in testing AI. It would do me no good to say the ts-reasoner is terrible if, in fact, it was working just fine and it was the RAG portion that was suboptimal. Similarly, if the RAG portion is working demonstrably fine, but we’re still getting errors, that would suggest ts-reasoner needs some fine-tuning work.

Next Steps

I think we have a form of cadence here with these metrics, so in the next post I’m going to dive into another one to keep our momentum.

Share

This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.