AI and Testing: Contextual Precision

In the previous post we looked at the Faithfulness metric with DeepEval and got some intuitions in place about how to start thinking about using metrics in general. In this post, we’ll look at a third metric.

Putting First Things First

I think it’s safe to assume that when you ask someone a question and they have multiple pieces of information to share, you would prefer the most relevant details upfront. If you ask “Why did my car fail inspection?” and someone responds with “Well, your manufacturer is generally known for reliability but cars have complex systems. Especially modern cars. And there are emissions standards to consider. Oh, and your brake pads are completely worn out,” they’ve buried the needed answer under a pile of context.

LLMs working with RAG systems face this exact challenge. When retrieving multiple chunks of information from a knowledge base, the model has to decide what order to present them in, or which ones to emphasize. A system might pull ten relevant passages, but if the passage that actually answers your question is sitting at position eight while tangentially related context fills positions one through seven, you’ve got a precision problem. The information is there, but it’s not where it needs to be. This is where contextual precision comes in. For this, we’ll look at the aptly named ContextualPrecisionMetric.

Initial Test Case

Let’s jump right into some logic.

You’ll notice we’re using similar variables here that we did in the previous posts. We have our question, which serves as the input. We have the actual_output, which is what the model actually generated (or, in this case, what we’re stubbing in for the model). We also have the retrieval_context, which indicates what chunks were retrieved (and in what order).

This time, however, we also have the expected_output. This is the ground truth for what the answer should contain. The Contextual Precision metric uses actual_output to do two primary things.

  1. Determine which nodes were actually used. The metric needs to see what the model generated to understand which retrieved chunks were relevant to producing that particular output.
  2. Evaluate usefulness of the ordering. It checks if the chunks that contributed to the actual output were ranked highly in the retrieval.

Given those aspects, the metric then evaluates: “Given what the model generated and what it should have generated, were the most useful chunks ranked first?” Thus, note what’s happening here: Contextual Precision is evaluating both retrieval quality and how that retrieval translated into the actual generation.

It’s worth noting that in the low precision context test case, we have three sections: the first is irrelevant, the second is tangential, and the third is relevant.

Run that script and let’s consider some possible execution output.

High Precision Case

Here is the output I got for the high precision test case:


===========================================================
HIGH PRECISION EXAMPLE (relevant chunks, well-ordered)
===========================================================
**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "yes",
        "reason": "'Jeff Nyman's paper proposes that an arbitrarily advanced civilization would utilize matter/antimatter annihilation as the most efficient energy production method. The paper determines this warp bubble would require around 10^28 Kg of antimatter to generate, roughly the mass-energy of the planet Jupiter.' directly addresses the question by providing the specific energy source and quantity needed."
    },
    {
        "verdict": "no",
        "reason": "'The paper notes this energy requirement would drop dramatically if using a thin-shell of modified space-time instead of a bubble encompassing the volume of the craft' does not provide the required information about the matter/antimatter annihilation and its quantity."
    },
    {
        "verdict": "no",
        "reason": "'Through calculations based on the cosmological constant and spacecraft volume, the energy requirements for faster-than-light travel are determined.' is too general and does not specify the exact energy source or the required amount of antimatter."
    }
]

Score: 1.0
Reason: The score is 1.00 because the first node directly addresses the question by providing the specific energy source (matter/antimatter annihilation) and its quantity (around 10^28 Kg, roughly the mass-energy of the planet Jupiter). The subsequent nodes are ranked lower as they either provide irrelevant information about alternative methods or are too general without specifying the exact requirements.

What does this output tell us? Well, Contextual Precision measures whether your retrieval system ranks relevant chunks highly and avoids polluting results with noise. Unlike faithfulness (which checks if answers match sources), this metric evaluates the quality of your retrieval ordering. For the question “What energy source does the paper propose?”, I provided three retrieved chunks in order. Here’s how the metric evaluated them:

  • Chunk 1 ✔ Relevant (Verdict: “yes”)
  • Chunk 2 ✘ Not directly relevant (Verdict: “no”)
  • Chunk 3 ✘ Too general (Verdict: “no”)

The perfect score of 1.0 reflects that the most relevant chunk appeared first in the retrieval results. The metric’s scoring formula rewards this heavily. Having your best information at position #1 is exactly what you want. Notice that chunks 2 and 3 were marked as “no” (not directly relevant), but this didn’t hurt the score. Why? Because they appeared after the highly relevant chunk. The metric is primarily checking: “Did irrelevant chunks push relevant ones down the ranking?”

This is ideal retrieval behavior. When a user asks a specific question, your system should surface the most directly relevant information first, follow with supporting or related details, and keep tangential information ranked lower. In practice, achieving this consistently is challenging, especially when you have multiple documents with overlapping terminology.

Low Precision Case

Here is some output I got for the low precision case.


===========================================================
LOW PRECISION EXAMPLE (noise + relevant chunk buried)
===========================================================
**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The first document does not mention anything about Jeff Nyman's paper or the energy source needed for a warp bubble."
    },
    {
        "verdict": "no",
        "reason": "The second document provides context on the concept of warp drives but does not specify the energy source proposed by Jeff Nyman."
    },
    {
        "verdict": "yes",
        "reason": "The third document directly states that 'Jeff Nyman's paper proposes that an arbitrarily advanced civilization would utilize matter/antimatter annihilation as the most efficient energy production method, requiring around 10^28 Kg of antimatter.' This information is crucial to the expected output."
    }
]

Score: 0.3333333333333333
Reason: The score is 0.33 because nodes 1 and 2 are ranked higher than node 3 despite being irrelevant, as they do not address Jeff Nyman's paper or the energy source needed for a warp bubble.

Here, we gave the system the same question, but this time the retrieved chunks were poorly ordered, with irrelevant content ranked higher than the actual answer.

  • Chunk 1 ✘ Irrelevant (Verdict: “no”)
  • Chunk 2 ✘ Tangentially related (Verdict: “no”)
  • Chunk 3 ✔ Actually answers the question! (Verdict: “yes”)

The score of 0.33 reflects a significant precision problem. The metric penalizes this ordering because, as it says, “nodes 1 and 2 are ranked higher than node 3 despite being irrelevant.” The relevant information exists; it’s just buried at position #3 beneath noise. This is a classic RAG failure mode.

It’s easy to see why this matters in practice. Imagine you’re building a RAG system with a token limit. If you can only pass the top two chunks to your LLM, then, in our high precision case, the model gets the perfect chunk at position #1. Wonderful! However, in the low precision case, the model gets two irrelevant chunks and never sees the answer. Not so wonderful.

Thus, even if you pass all three chunks, many models give more weight to earlier context. Having your best information at position #3 means the model might prioritize the wrong information, the answer might be less confident or accurate, or generation might waste tokens processing irrelevant content.

Fine, but then why did this happen in the above case? The example was crafted to show the retriever matching on semantic similarity rather than actual relevance. In reality, you would have to look at how your model reasoned but what I’m showing here in this control case is a good example of what does happen. “Kaluza-Klein” and “extra dimensions” are mentioned in my paper. “Warp drive” appears in both irrelevant chunks. The vector embeddings may have concluded these were relevant because of keyword overlap. However, topical similarity does not equal query relevance. Just because a chunk discusses related concepts doesn’t mean it answers the specific question asked.

What this shows us is that contextual precision exposes a critical truth about RAG systems: retrieval quality determines generation quality. You can have the perfect answer in your document store, a very capable language model, and incredibly poor retrieval ranking. If the latter is true, your model will produce bad results. This is why evaluating retrieval separately from generation is essential, and why metrics like Contextual Precision are so valuable for diagnosing RAG systems.

Hooking Up the Model

Let’s now hook up the ts-reasoner model and execute against that.

This is very similar to what we did for the Faithfulness metric. Let’s consider some possible output (and I’ll truncate it a bit just to save space).


============================================================
RETRIEVED CHUNKS
============================================================
...
============================================================
GENERATED RESPONSE
============================================================

============================================================
CONTEXTUAL PRECISION EVALUATION
============================================================
**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The first document does not provide any specific information about the energy source needed for generating a warp bubble. It only discusses the conceptual framework and goals of the paper."
    },
    {
        "verdict": "no",
        "reason": "The second document also does not address the energy requirements or sources for creating a warp bubble, focusing instead on the plausibility and physical constraints of a warp drive."
    },
    {
        "verdict": "yes",
        "reason": "The third document explicitly discusses the energy required to create the necessary warp bubble. It states that 'I can now look at the energy required to create the necessary warp bubble' and provides detailed calculations, such as Equation 29, which leads to the conclusion that 'the total amount of energy \u2018injected\u2019 locally would equal [Equation 31]'."
    }
]

Score: 0.3333333333333333
Reason: The score is 0.33 because nodes 1 and 2 are ranked higher than node 3, despite node 3 being relevant as it explicitly discusses the energy source needed for generating a warp bubble. Nodes 1 and 2 provide no useful information regarding the input question.

One thing you might notice there is mention of a “second” and “third” document. What’s going on here? We’re only reading in one document, aren’t we? Yes, we are. However, in DeepEval’s terminology, “document” refers to chunks in the retrieval context, not separate PDF files. Each item in your retrieval_context list is considered a “document” or “node” by the metric.

As before, with this example, we’re seeing what happens when we let the actual RAG system retrieve chunks from the paper: no hand-crafted examples, just the retriever doing its job.

The retriever pulled three chunks from my paper:

  • Chunk 1 (from page 1) ✘ Not relevant
  • Chunk 2 (from page 6) ✘ Not relevant
  • Chunk 3 (from page 10) ✔ Actually answers the question!

I’m getting the page number information from the chunks that I truncated out of the output.

The score of 0.33 reveals, as before, a significant retrieval problem. Despite having the correct information in the paper, the retriever ranked two irrelevant chunks (1 and 2) higher than the relevant one (3) and buried the actual answer at position #3. Note that this test would have completely failed if we had only passed the top two chunks to the model.

The metric’s reasoning makes this clear: “Nodes 1 and 2 are ranked higher than node 3, despite node 3 being relevant as it explicitly discusses the energy source… Nodes 1 and 2 provide no useful information regarding the input question.”

Why did this happen? The question asked specifically about energy sources, but, as with the control example I showed, the retriever matched on broader semantic similarity. My chunks 1 and 2 from the original output looked like this:


--- Chunk 1 (Page 1) ---
I find it very challenging to make predictions on how this warp drive might actually function. My route was to envision a spacecraft with an exotic power generator that could create the necessary energies to locally manipulate the extra dimension(s). In this way, an advanced spacecraft would expand/contract the compactified space-time around it, thereby creating the propulsion effect. That being said, my goal in this paper is to work on realistic model rather than a physically realizable device. The first four sections of this paper revie w the necessary physics required to appreciate the new warp drive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding speed limits and energy requirements will also be presented.

--- Chunk 2 (Page 6) ---
of Relativity. An element missing from all the papers is that there is little or no suggestion as to how such a warp bubble may be created. I do not plan to buck that trend too much in that the aim of this paper is not to discuss the plausibility of a warp drive. This means I am not addressing the valid questions associated with violation of the null

Those chunks mention “warp bubble” and “warp drive.” They discuss the concept rather than the energy requirements. The embedding similarity was high enough to rank them above the calculation section. This is the same failure mode we saw in our control example: topical relevance doesn’t equal query relevance. The retriever found chunks about warp drives when it needed chunks about energy calculations.

This 0.33 score reveals the core challenge in RAG systems. Looking at what the model generated, it mentioned concepts like “exotic power generators” and “extra dimension manipulation.” Those concepts are in the paper, but they weren’t in the retrieved chunks that the model had access to. This means the model was either drawing on its training data (hallucinating from general knowledge about warp drive papers; mine isn’t the only one out there), or making inferences beyond what was explicitly stated in the retrieved context.

The root cause isn’t a generation problem: it’s a retrieval problem. The chunks that would have supported an accurate, grounded answer existed in the document but were either never retrieved or were ranked too low to be useful. This is exactly what Contextual Precision measures: not whether the right information exists in your knowledge base, but whether your retrieval system surfaces it appropriately.

This demonstrates why RAG evaluation must assess retrieval separately from generation. When you see poor answer quality, the diagnostic questions should be:

  1. Did the retriever find the right chunks at all?
  2. Did it rank them appropriately?
  3. Was all necessary information retrieved?

In this case, the answers are: “eventually” (it found the right chunk), “no” (it ranked it last), and “partially” (the information was there but buried). This single retrieval failure cascades through the entire pipeline, leading to answers that lack grounding in the actual source material.

If this was our test result, what would the test be telling us to do? The test is specifically indicating that improving this system would require tuning. What kind of tuning? Well, that would have to be experimented with. Some ideas:

  • Chunk size (maybe 1000 tokens is too large?)
  • Retrieval strategy (maybe semantic search alone isn’t enough?)
  • Number of chunks retrieved (maybe k=5 would help?)
  • Query reformulation (maybe “energy calculations” vs “energy source”?)

Notice how the Contextual Precision metric helps us identify these problems before they become answer quality issues.

What Does the Testing Tell Us

A point I really want to bring home here is that our Contextual Precision testing reveals the root cause of problems we observed in the Faithfulness evaluation. When we tested ts-reasoner in the previous post, we saw a faithfulness score of 0.5, with the model making claims about “exotic power generators” and “manipulating extra dimensions” that weren’t in the retrieved context. At the time, we identified this as a retrieval bottleneck. Now, with Contextual Precision, we can see exactly how that bottleneck operates.

The controlled high precision case (scoring 1.0) demonstrated ideal retrieval behavior: the most relevant chunk appeared first, followed by supporting information, with tangential content ranked last. When retrieval works this way, the generation model has exactly what it needs at the top of its context window.

Test Finding: Perfect retrieval ordering enables faithful generation. When the retriever surfaces the right information in the right order, downstream metrics like Faithfulness naturally improve because the model has no reason to reach beyond its context.

The controlled low precision case (scoring 0.33) showed what happens when retrieval fails: two irrelevant chunks ranked above the one that actually answers the question. The relevant information existed; it was simply buried beneath noise. This is exactly the failure mode that leads to faithfulness problems.

Test Finding: Contextual Precision measures the same retrieval bottleneck that Faithfulness detects, but it catches it at the source. Where Faithfulness tells you what (“the model made unsupported claims”), Contextual Precision tells you why (“because the retriever failed to surface the right chunks appropriately”).

This relationship mirrors forensic investigation. Faithfulness is like examining a crime scene and noting “the evidence doesn’t support this conclusion.” Contextual Precision is like discovering “because the evidence collection team photographed the wrong area first, missing the actual evidence site.” The crime scene tells you what went wrong; the collection protocol tells you why. Just as forensic failures cascade from collection through analysis to conclusion, RAG failures cascade from retrieval through generation.

The RAG case (also scoring 0.33) exposed how this plays out with real retrieval. The system pulled three chunks from the paper, but ranked the relevant one last. This ordering explains the faithfulness failures we saw earlier. When ts-reasoner generated claims about “exotic power generators” and “extra dimension manipulation,” it wasn’t hallucinating random concepts; it was filling gaps left by poor retrieval with knowledge from its training data. Those concepts are in the paper, on pages 2 and 13, as we saw with Faithfulness, but they weren’t in the chunks the retriever selected.

Notice how subtle that can be! An AI model can pull information from somewhere else that actually does align with what you’re conversing with it about, yet it can still be muddied up by what came from the source and what came from somewhere else.

Think of this like archaeological stratigraphy. In a well-preserved site, the most recent and relevant artifacts appear in upper layers, clearly dated and contextualized. In a disturbed site, later materials get mixed into earlier strata. The information exists but its ordering is corrupted. Finding our answer at position #3 beneath irrelevant chunks is like discovering Roman coins beneath Bronze Age pottery: the artifacts are genuine, but something has compromised their positional relationship.

Test Finding: The cascade from retrieval failure to generation failure is measurable and predictable. A Contextual Precision score of 0.33 (relevant chunk ranked last) directly led to a Faithfulness score of 0.5 (claims from outside retrieved context). The model isn’t the problem; the retrieval mechanism is.

What makes this particularly important is that both metrics scored the same scenario differently, revealing different aspects of the same failure. Faithfulness measured the symptom (unfaithful output), while Contextual Precision measured the cause (poor retrieval ordering). This demonstrates why RAG evaluation requires multiple metrics: a single metric tells you that something failed, but multiple metrics working together tell you why and where.

To bring that point home, consider what these two metrics reveal when used together:

  • High Faithfulness + High Contextual Precision: System working well
  • High Faithfulness + Low Contextual Precision: Model compensating for poor retrieval (risky)
  • Low Faithfulness + High Contextual Precision: Generation problem, not retrieval
  • Low Faithfulness + Low Contextual Precision: Retrieval failure cascading to generation (our case)

If you’ll forgive yet another analogy, this multi-metric analysis works like differential diagnosis in medicine. A patient presenting with fever could have causes A, B, C, or D. You need multiple tests to triangulate the root cause. Faithfulness alone is like measuring temperature; it tells you that something is wrong. Adding Contextual Precision is like checking white blood cell count; now you can distinguish between infection types and localize the problem to specific subsystems. In our case, both metrics pointing to the same failure mode (low scores on both) indicates the infection site is in retrieval, not generation.

The variance between our controlled cases and the RAG case quantifies the impact of retrieval quality. In our high precision example, the retriever scored 1.0 because we hand-crafted optimal ordering. In the RAG case, automatic retrieval scored 0.33, a 67% degradation. That degradation directly correlates with our faithfulness drop from 1.0 (controlled valid case) to 0.5 (RAG case) in the previous post.

Test Finding: Retrieval quality determines generation quality in RAG systems. Improving Contextual Precision scores should improve Faithfulness scores, because better chunk ordering gives the model better source material to work from. This is why the tuning suggestions matter: adjusting chunk size, retrieval strategy, or the number of chunks retrieved aren’t just optimizations; they’re interventions that address the root cause of faithfulness failures.

For our purposes, this testing establishes that ts-reasoner is capable of faithful generation when provided with appropriate context. The 0.5 faithfulness score, from the previous post, wasn’t indicating a model problem; it was indicating a retrieval problem. The 0.33 contextual precision score in this post confirmed this diagnosis by showing exactly where the retrieval failed: relevant information ranked last, irrelevant information ranked first.

This is why monitoring multiple metrics in production is essential. A faithfulness drop without a corresponding contextual precision drop would suggest investigating the generation model. But faithfulness and contextual precision dropping together? That’s your retrieval system degrading.

The most important insight from this testing is that RAG failures cascade from the bottom up, not the top down. If your retrieval is excellent but your model hallucinates, that’s a generation problem. But if your retrieval is poor, even a perfect model will struggle to maintain faithfulness because it’s working from inadequate source material. As stated earlier, but worth repeating, Contextual Precision gives us the diagnostic tool to catch retrieval problems before they become answer quality problems, which is exactly what effective testing should do: identify failure points early in the pipeline where they’re easier and cheaper to fix.

This is the cost of mistake curve I previously talked about in action.

Next Steps!

We now have two metrics that we looked at and I started to show how they interrelate. Next up are going to be three related posts, which will be published simultaneously, where I bring together a few of these ideas and give you an idea of what actually testing an LLM looks like in context. Check out the first of these posts to continue the thread!

Share

This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.