AI and Testing: Recall, Relevancy, and Richer Evaluation

In the previous posts we looked at the Faithfulness and Contextual Precision metrics with DeepEval, and started building an intuition for how retrieval failures cascade into generation failures. Those two metrics told us what went wrong and where in the pipeline. In this post, we’ll add three more tools to the diagnostic kit: Contextual Recall, Contextual Relevancy, and G-Eval.

As part of this post, we’ll also switch documents. Instead of the warp drive paper we looked at before, we’ll evaluate a philosophical essay, which turns out to be a surprisingly instructive stress test for a RAG system.

If you’ve gone through my previous posts, you likely have all the dependencies you need, from LangChain, to DeepEval, to BeautifulSoup. I will also use my own ts-reasoner and ts-evaluator models, which, again, we’ve used in previous posts.

A Different Kind of Document

The warp drive paper was a good first subject for RAG evaluation because it has discrete factual claims: specific equations, specific energy quantities, specific proposals. When a retriever missed the energy calculation section, the failure was visible and measurable. Either the chunk about 10²⁸ kg of antimatter was retrieved or it wasn’t.

A philosophical essay presents a different challenge. Consider an essay that uses the words “limits,” “coherence,” “history,” and “Creator” in nearly every section, but with shifting meanings depending on context. In one section, “limits” refers to the speed of light. In another, it refers to the limits of mathematical elegance as a guide to physics. In another, it refers to creaturely epistemic finitude. A semantic retriever working from vector embeddings will see these chunks as highly similar. The vocabulary trap is built into the document itself.

This is actually a gift for evaluation purposes. It means we can construct questions where a naive retriever will almost certainly pull topically-related but query-irrelevant chunks, and we can measure exactly how badly that hurts the downstream response.

The essay we’ll use is my “Coherence at the Edge: Colliders, Constraints, and the Architecture of Reality”, which is a draft piece that moves from particle physics through cosmology into speculative metaphysics and theology, and back to particle physics again.

The essay is a working draft, which means it’s structurally complete but the argument is still being refined. That’s actually a realistic condition for RAG evaluation: real knowledge bases contain drafts, working documents, and evolving content alongside polished material. You could argue that’s what most of the Internet is!

Three Metrics, One Pipeline

Before we look at code, it’s worth being clear about what each new metric is measuring and how it relates to what we’ve already seen.

Contextual Recall asks whether the retriever found all the chunks needed to fully answer the question. Contextual Precision, from my previous post, penalizes irrelevant chunks ranked too high. Recall penalizes missing chunks: relevant information that existed in the document but never made it into the retrieval context. A question whose complete answer is distributed across multiple sections of the essay will expose recall failures cleanly.

Contextual Relevancy measures the noise-to-signal ratio of the retrieved set. It doesn’t care about ordering (that’s Precision) or completeness (that’s Recall); instead, it asks what fraction of the retrieved chunks were actually relevant at all. This is where the vocabulary trap pays off as a test design: when “history” and “Creator” appear in six sections with different meanings, a retriever matching on semantic similarity will almost certainly pull topically-related but query-irrelevant chunks.

G-Eval is different in kind from the other two. Rather than measuring a RAG-specific property, it lets you define your own evaluation criteria in natural language and uses a judge model to score against them. This makes it well-suited to a philosophical essay, where the interesting failures aren’t about missing chunks but about flattening nuance: presenting a deliberately hedged argument as a settled conclusion, or collapsing a careful distinction between two philosophical positions into an oversimplification.

Think of these three metrics as completing the diagnostic picture. Faithfulness told us whether the output was grounded in the retrieved context. Precision told us whether the right chunks were ranked first. Now:

Recall tells us whether the retriever found everything it needed
Relevancy tells us how much noise was in what it did find
G-Eval tells us whether the response honored the document’s actual argument

Test Design

As we get into this, the code for this is in my repo for this series, specifically evaluate_coherence_essay.py. Feel free to grab that. I’ll reference it in some of what follows.

Just as with my posts using DeepEval, we’ll run controlled cases first (hand-crafted retrieval contexts with known properties) before running live RAG cases against the actual essay. This gives us a baseline before we introduce the unpredictability of real retrieval.

We need questions whose answer structure matches what each metric is designed to expose. Here’s the reasoning behind each one.

Q1: For Contextual Recall

We need a question whose complete answer is genuinely distributed across multiple chunks. Answering it partially is easy; answering it fully requires the retriever to have found everything. The “Guardrails or Geometry” section of the essay offers exactly this: it enumerates three distinct interpretations of why physical limits exist. A retriever can easily find one or two of them, but finding all three requires the right chunk.

Here is the proposed question: In the essay “Coherence at the Edge”, what three interpretations does the author offer for why physical limits like the speed of light exist?

The expected answer names all three: the structural interpretation (limits arise from mathematical consistency), the evolutionary interpretation (life requires such constraints), and the metaphysical/theological interpretation (creaturely knowledge is bounded by design). The low-recall context replaces the third chunk with a chunk that discusses limits in a different section of the essay; it’s topically related but doesn’t contain the third interpretation.

Q2: For Contextual Relevancy

Here we need a question where the essay’s vocabulary overlap across sections will reliably seduce the retriever into the wrong places. The word “history” and the concept of “Creator” appear in at least six sections of the essay. But the specific argument that history might be constitutive of a Creator — not merely something a Creator interacts with — lives only in the “History as the Medium of Divinity” section.

Here is the proposed question: What does the essay “Coherence at the Edge” mean when it says history might be ‘constitutive’ of a Creator rather than merely something a Creator interacts with?

The low-relevancy context for this question contains chunks that discuss history and Creator from other sections — “The God at the End of Time,” “The Universe as Tapestry” — all of which are topically related but don’t contain the constitutive argument. The vocabulary trap is working exactly as designed.

Q3: For G-Eval

G-Eval needs a question where the interesting failure mode isn’t a missing chunk but a flattened argument. The essay explicitly positions itself between pantheism (the universe is God) and panentheism (the universe is in God but God exceeds it), rather than committing to either. A model summarizing too quickly will almost certainly collapse this into one position or the other.

Here is the proposed test question for this: Does the essay “Coherence at the Edge” argue that the universe is God, or does it take a more nuanced position? What exactly is that position?

We evaluate this response against three criteria we define ourselves: epistemic humility (does the response respect the essay’s speculative hedges), argumentative fidelity (does it represent the essay’s actual position without collapsing its nuance), and conceptual precision (does it use the essay’s key terms correctly). These are criteria a RAG-specific metric can’t assess, which is precisely G-Eval’s value.

Loading the Essay

Since our source is an HTML page rather than a PDF, I’m going to use LangChain’s WebBaseLoader instead of PyPDFLoader. The loading call needs one small but important configuration: I’m going to tell BeautifulSoup to preserve the h2 section headers alongside paragraph text.

from bs4.filter import SoupStrainer
from langchain_community.document_loaders import WebBaseLoader

ESSAY_URL = "https://testerstories.com/files/ai_testing/coherence-at-the-edge.html"

loader = WebBaseLoader(
  web_paths=[ESSAY_URL],
  bs_kwargs={
    "parse_only": SoupStrainer(["h2", "p", "li"])
  },
)
documents = loader.load()

from bs4.filter import SoupStrainer

from langchain_community.document_loaders import WebBaseLoader

ESSAY_URL = "https://testerstories.com/files/ai_testing/coherence-at-the-edge.html"

loader = WebBaseLoader(

web_paths=[ESSAY_URL],

bs_kwargs={

"parse_only": SoupStrainer(["h2", "p", "li"])

)

documents = loader.load()

The SoupStrainer call is doing meaningful work here. Without
it, BeautifulSoup strips everything to plain text and the section headers (“Guardrails or Geometry,” “History as the Medium of Divinity,” and so on) disappear into the extracted content as unmarked text. By explicitly preserving h2 tags, each chunk that falls within a section will tend to carry that section’s header text, giving the embeddings a stronger signal for distinguishing chunks that share vocabulary across sections.

This is worth noting for testing your own RAG pipelines. If your source documents have meaningful structural markers — headers, labels, section titles — preserving them in the extracted text often improves retrieval quality more than tuning chunk size does.

We also use a smaller chunk size than the warp drive example:

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
  chunk_size=600,
  chunk_overlap=100,
)
chunks = text_splitter.split_documents(documents)

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=600,

chunk_overlap=100,

)

chunks = text_splitter.split_documents(documents)

The warp drive script used 1000 characters. Here we use 600. The essay’s paragraphs are short and the argument is tightly interwoven. At 1000 characters, chunks routinely span two sections and blur the very boundaries we’re trying to use as test levers. Smaller chunks preserve section identity better, which makes the vocabulary trap more instructive rather than less.

The Full Script

The complete script follows the same three-part structure as the previous posts I provided around DeepEval: controlled cases first, then live RAG cases, then, in this particular case, the G-Eval section.

The script itself is heavily commented so I won’t detail all of that here. One thing I will point out is that you’ll notice the G-Eval metric takes a criteria parameter written in plain English, which I indicated above. This is what separates it from the other metrics: instead of a fixed formula for what “good” means, you define it yourself.

The judge model reads your criteria and scores the response against them. This makes G-Eval powerful for any domain where quality is more about reasoning and nuance than about factual grounding: philosophical writing, legal analysis, clinical summaries, anything where “did it get the facts right” is less important than “did it represent the argument faithfully.”

G-Eval doesn’t take a retrieval_context parameter. It evaluates the generated response against your defined criteria directly, using input, actual_output, and expected_output. Retrieval still happens to produce the response; it just isn’t part of what G-Eval judges. This is intentional: you’re separating the question of retrieval quality from the question of response quality.

The controlled cases in the script establish our expected baselines before live retrieval introduces real-world uncertainty. For Recall, a high-recall context contains all three interpretations in order; a low-recall context replaces the third interpretation with a thematically-related chunk from another section. For Relevancy, a high-relevancy context comes directly from the “History as the Medium of Divinity” section; a low-relevancy context contains chunks about history and Creator from other sections, which is the vocabulary trap in action.

Once you have the script from the repository, you can run it directly (making sure as in previous posts to be in your virtual environment):

python evaluate_coherence_essay.py

1	python evaluate_coherence_essay.py

Let’s look at what comes back. I’m not going to show the full output here but I’ll give you a flavor and explain what I’m seeing. Let’s first look at the PART 1: CONTROLLED CASES first.

High Recall Case (Q1)

The high recall case is our baseline: a hand-crafted context where we’ve placed exactly the right chunks in the retrieval set, one for each of the three interpretations the question asks about. Here’s what the metric returned:


Verdicts:
[
    {
        "verdict": "yes",
        "reason": "First node: 'One interpretation is purely structural:
        these limits are simply consequences of deep mathematical
        consistency.'"
    },
    {
        "verdict": "yes",
        "reason": "Second node: 'Another interpretation is evolutionary:
        perhaps life can only arise in universes with such constraints.'"
    },
    {
        "verdict": "yes",
        "reason": "Third node: 'A third interpretation is more metaphysical
        or theological: that reality has built-in finitude.'"
    }
]

Score: 1.0
Reason: The score is 1.00 because all sentences in the expected output
are clearly and directly supported by the nodes in the retrieval context,
with no unsupportive reasons present.

A perfect score of 1.0, and the reasoning tells us exactly why: every sentence in the expected output found support in the retrieval context. The metric works by checking each sentence of the expected output against the retrieved chunks and asking whether the chunks together are sufficient to produce that sentence. All three interpretations were present, so all three sentences were supported.

This is worth pausing on because Contextual Recall works differently from Contextual Precision in a subtle but important way. Precision evaluates the retrieval context relative to the actual output, by which I mean what the model generated. Recall evaluates the retrieval context relative to the expected output, meaning what the model should have generated. The question Recall is asking is: “Given what a correct answer would look like, does the retrieved context contain everything needed to produce it?”

This distinction matters practically. A model might generate a confident-sounding response that only addresses two of the three interpretations, and Precision might not flag anything wrong if those two chunks were well-ranked. Recall catches the gap at the source: the third chunk simply wasn’t there to support the third interpretation, regardless of how well the model handled what it did receive.

The 1.0 score here is our control. It tells us the question is well-formed, the expected output is grounded in identifiable chunks, and the metric is working correctly. Now we can deliberately break it.

Low Recall Case (Q1)

The low recall case replaced the third interpretation chunk with a chunk that discusses limits in a different section of the essay: topically related, but not containing the theological/metaphysical interpretation the question requires. Here’s what came back:


Verdicts:
[
    {
        "verdict": "no",
        "reason": "The expected output does not match any part of the
        retrieval context."
    },
    {
        "verdict": "yes",
        "reason": "1st node: 'One interpretation is purely structural:
        these limits are simply consequences of deep mathematical
        consistency.'"
    },
    {
        "verdict": "yes",
        "reason": "2nd node: 'Another interpretation is evolutionary:
        perhaps life can only arise in universes with such constraints.'"
    },
    {
        "verdict": "no",
        "reason": "The expected output does not match any part of the
        retrieval context."
    }
]

Score: 0.5
Reason: The score is 0.50 because the expected output contains
interpretations that partially align with the node(s) in retrieval
context, but there are significant gaps where no matching content
was found.

The score drops to 0.5, and the verdict breakdown tells the story precisely. Two of the four sentences in the expected output found support in the context; two did not. The two “no” verdicts correspond to the opening framing sentence (“The essay offers three interpretations…”) and the third interpretation itself, both of which require the missing theological chunk to be supportable.

Notice something interesting here compared to the Contextual Precision output from my previous post. In that post, a score of 0.33 meant the relevant chunk was present but buried at position three. Here, a score of 0.5 means two of three interpretations are present but the third simply doesn’t exist in the retrieval context at all. These are different failure modes with a similar surface appearance (both produce poor scores) but the diagnostic implications are different.

A Precision failure says: the right information was retrieved, just ranked poorly. Fix the ranking. A Recall failure says: the right information was never retrieved in the first place. Fix the retrieval coverage, perhaps by increasing k, adjusting chunk boundaries, or reconsidering your embedding strategy.

A score of 0.5 here also surfaces something worth noting for your own test design. The metric evaluated four sentences from the expected output, not three interpretations as we might intuitively expect. The opening framing sentence (“The essay offers three interpretations”) was treated as a claim requiring support just like the interpretations themselves. Since no single chunk explicitly states that there are exactly three interpretations, that sentence drew a “no” verdict. This is the metric being appropriately strict: it checks whether the retrieval context could ground every sentence of a correct answer, including structural framing sentences, not just the content claims.

This is a useful reminder when writing expected outputs for Recall tests. Framing sentences that summarize or enumerate (“the essay offers three interpretations,” “the paper proposes two strategies”) need to be supportable by the retrieved chunks just as much as the substantive claims do. If your expected output contains summary framing that no single chunk explicitly states, expect those sentences to draw “no” verdicts even in a high-recall context.

High Relevancy Case (Q3)

This is where the evaluation produces something unexpected, and worth examining carefully before moving on.


Score: 0.125
Reason: The score is 0.12 because most of the statements in the retrieval
context are irrelevant to the question about history being constitutive
of a Creator. The only relevant statement is 'So I'm wondering whether
history is similarly constitutive for a Creator. Not just something the
Creator interacts with, but something without which "Creator" becomes
unintelligible.'

We have a score of 0.125 on the high relevancy case. But, hold on. We hand-crafted this context from the exact section of the essay that contains the answer. How did this happen?

Well, the verdict breakdown tells the story. Contextual Relevancy doesn’t evaluate chunks as whole units. Instead, it decomposes each chunk into individual statements and evaluates each statement independently. Look at what it did with our three chunks:

From chunk 1 (the human/physics analogy):

✘ “A human cannot exist apart from physics because human embodiment is constituted by physical processes” — marked irrelevant
✘ “Physics is not an external container; it is part of what makes a human be a human” — marked irrelevant
✔ “So I’m wondering whether history is similarly constitutive for a Creator” — marked relevant

From chunk 2 (the silence/sound analogy):

✘ “Is timelessness parasitic on time the way silence is parasitic on sound?” — marked irrelevant
✘ “If there were never any sound at all, would ‘silence’ even be intelligible?” — marked irrelevant
✘ “Atemporality only makes sense in contrast to temporality” — marked irrelevant
✘ “If there were no change, no becoming anywhere, the concept of ‘outside time’ might dissolve into meaninglessness” — marked irrelevant

From chunk 3 (the ontological inversion):

✘ “My model inverts the classical view. It suggests that history might be ontologically basic” — marked irrelevant

Only one statement out of eight was considered relevant. Hence 0.125.

The judge model is reading the question very narrowly: “what does ‘constitutive’ mean in this context?” It sees that one sentence directly uses that word in relation to a Creator, and judges everything else as context-setting rather than answer-bearing. From a strictly literal standpoint, it isn’t wrong. The human/physics analogy, the silence/sound analogy, and the ontological inversion are all argumentative scaffolding for the constitutive claim in that they explain and justify it, but they don’t directly state it.

This exposes something important about how Contextual Relevancy works that differs from how we might intuitively think about relevance. The metric is asking “does this statement answer the question?” not “does this statement help the model answer the question?” Analogical reasoning and argumentative setup are highly relevant to understanding an answer, but the metric scores them as noise because they don’t contain a direct assertion about the query topic.

This is a meaningful distinction for philosophical or argumentative documents. A technical paper can often answer a question with a single sentence containing the key fact. A philosophical essay characteristically answers questions through analogy, negation, and staged argument: the setup is doing real work. The issue is that Contextual Relevancy, as currently defined, treats that setup as irrelevant.

This doesn’t make the metric wrong; it makes it a precise tool with a specific definition of relevance that may not match your intuitions about a particular document type.

This result is actually useful information about your RAG pipeline. If your retriever is consistently pulling chunks whose statements score as irrelevant, it suggests the chunks are providing context the model uses but the metric can’t credit. That’s worth knowing! It means your faithfulness scores and your relevancy scores may diverge in ways that require interpretation rather than simple optimization.

Now let’s see what the low relevancy case produces, where the vocabulary trap is working as designed.

Low Relevancy Case (Q3)

The low relevancy context replaced the direct-answer chunks with chunks drawn from other sections of the essay that discuss history and Creator and, again, this is the vocabulary trap we designed deliberately. Here’s what came back:


Score: 0.0
Reason: The score is 0.00 because none of the provided statements from
the retrieval context are relevant to the question about what 'history
might be constitutive of a Creator' as asked in the input.

A clean zero. Every statement across all three chunks was marked irrelevant:

✘ “The ‘God at the End of Time’ idea preserves teleology without requiring temporal priority”
✘ “If the Creator guarantees the outcome by guiding the evolutionary arc, then the loop is no longer precarious”
✘ “The Creator is the emergent totality; perhaps the final integrated consciousness of the whole”
✘ “Evolution becomes the mechanism by which divinity realizes itself”

Every one of these sentences mentions Creator. Two of them are directly about the relationship between Creator and history. And yet all four score zero relevancy. I’ll say it again: this is the vocabulary trap working exactly as designed.

The contrast with the high relevancy case is revealing. There, we had chunks from the right section and still scored only 0.125 because most of the argumentative scaffolding didn’t register as directly relevant. Here, we have chunks that are thematically adjacent — they come from sections that genuinely discuss related ideas — and they score 0.0. The metric is correctly distinguishing between “discusses similar concepts” and “addresses this question.”

Put side by side, the two cases make a clean diagnostic pair:

High relevancy context (correct section, argumentative prose): 0.125
Low relevancy context (wrong section, vocabulary overlap): 0.0

The gap between them is meaningful even if neither score is high. A score differential of 0.125 across a vocabulary-trap document is actually a reasonable signal. It tells you the retriever found something closer to the answer even if it couldn’t find a chunk with a single direct assertion. In a production system, a relevancy score near zero is a stronger alarm than one near 0.1, even if both feel low in absolute terms.

This also reframes the high relevancy result from earlier. A score of 0.125 on our hand-crafted “best possible” context now looks less like a failure and more like a property of the document type. Philosophical prose built on analogy and staged argument will tend to produce low Contextual Relevancy scores even when the retrieval is correct. This is because the metric’s definition of relevance favors direct assertion over argumentative development. The 0.0 on the vocabulary-trap chunks, by contrast, is a genuine failure signal that the right section was never found.

This pair of results suggests a practical guideline: for argumentative or philosophical documents, treat Contextual Relevancy scores as relative signals rather than absolute ones. The question to ask isn’t “did we score above some threshold?” but “did the live RAG case score meaningfully higher than a known-bad context?” If your live retrieval scores near 0.0, the vocabulary trap is likely winning. If it scores near 0.125 or above, the retriever probably found the right neighborhood even if not the single best sentence.

With that calibration in mind, the live RAG cases become particularly interesting. Let’s see how the actual retriever performs against both questions before moving to G-Eval.

Setting Up the Live Retrieval

Before the live RAG cases run, the script loads the essay, chunks it, and reports what it’s working with:


Loading essay from: https://testerstories.com/files/ai_testing/coherence-at-the-edge.html
Loaded 1 document(s).
Total characters: 65,808
Split into 132 chunks

A few things are worth noting here. The essay comes in as a single document: WebBaseLoader fetches the page and the SoupStrainer extracts the h2, p, and li content into one continuous text block. That 65,808 characters represents the essay’s prose after stripping the HTML structure, the draft banner, the byline, and any other markup that didn’t match our tag filter.

The 132 chunks from a 600-character split with 100-character overlap is a notably higher chunk count than the warp drive script produced in my earlier posts. That’s partly a consequence of the smaller chunk size, but also a reflection of how differently the two documents are structured. The warp drive paper has long technical paragraphs with dense notation, thus chunks fill up quickly. The essay has short, punchy paragraphs, many only two or three sentences long. At 600 characters, a substantial number of those paragraphs become their own chunks, which is actually what we want: it keeps section-level content from bleeding across boundaries.

132 chunks from a document of this length also means the retriever is working with reasonable granularity. With k=3, we’re asking it to find the three most relevant chunks out of 132, which is roughly the top 2% of the document. Whether it finds the right 2% is exactly what the next two cases will tell us.

Contextual Recall — Live RAG Case (Q1)

This is where the live retrieval pipeline takes over. No hand-crafted contexts, just the retriever working against 132 chunks of the actual essay. Here’s what it found:


--- Chunk 1 ---
and sharply restrictive in others. Take faster-than-light travel. In
special relativity, the light-speed limit isn't just about propulsion
difficulty. It's woven into spacetime geometry itself. If you could
transmit information faster than light, you could create causal
paradoxes, effectively enabling...

--- Chunk 2 ---
signs. They may simply reflect that spacetime and causality have a
specific architecture. A two-dimensional creature cannot step "above"
its plane, not because of prohibition but because of structure. Here's
a deeper question hiding in what I'm talking about here: Are these
limits frustrating because...

--- Chunk 3 ---
the model leans toward plenitude (everything possible exists) or
purpose (some possibilities are preferred). And that, interestingly,
brings me back to the question of limits: not just physical limits,
but modal limits. If this kind of thing is possible for the Creator...

All three chunks discuss limits. None of them contain the three interpretations. The retriever found the right general neighborhood — “Guardrails or Geometry” and surrounding sections — but landed on the examples and questions surrounding the enumeration rather than the enumeration itself. The three-item list (“one interpretation is purely structural… another interpretation is evolutionary… a third interpretation is more metaphysical”) was split across chunks that ranked below these three.

The score confirms the damage:


Score: 0.0
Reason: The score is 0.00 because none of the sentences in the expected
output relate to or reference any of the specific examples or concepts
(such as faster-than-light travel, wormholes, or colliders) from the
retrieval context.

Every verdict is “no,” and the reasoning is identical across all five: the expected output doesn’t reference faster-than-light travel, wormholes, or colliders. This is the metric making a correct observation in an unexpected direction. It isn’t just saying the retrieved chunks lacked the three interpretations; it’s saying the expected output and the retrieved chunks are talking about different parts of the same argument.

The chunks discuss the illustrations of limits; the expected output describes the taxonomy of interpretations. They’re from the same section of the essay, separated by only a few paragraphs, but the retriever found one and missed the other entirely.

The generated response is instructive here too. The model produced something that looks superficially reasonable:


Interpretation 1: Protecting Causal Structure
Interpretation 2: Spacetime Architecture
Interpretation 3: A Question of Design/Modal Limits

But look at where these came from. The model synthesized three interpretations from the chunks it received: chunks about faster-than-light travel, the two-dimensional creature analogy, and modal limits. It reverse-engineered a three-part answer from illustrative material, producing something that rhymes with the correct answer without actually being grounded in the passage where the essay explicitly enumerates its three interpretations.

This is, in fact, the exact same failure mode we saw with the warp drive paper: the model filling retrieval gaps with inference and synthesis rather than source material.

It’s worth calling out that this is a compound failure. The retriever missed the right chunk. The model then compensated by constructing a plausible answer from what it did receive. The result looks reasonable on the surface but scores 0.0 on Recall because the expected output, grounded in the actual enumeration, finds no support in the retrieved context.

This is one of the subtler failure modes in RAG systems, and one of the harder ones to catch without evaluation tooling. A model that generates a confident, well-structured three-part answer from the wrong chunks is more dangerous than a model that says “I don’t know.” This is so because the confident wrong answer passes casual inspection. Contextual Recall catches it precisely because it checks whether the expected output is supportable, not whether the actual output looks plausible.

The root cause here is the same as what we saw in the warp drive Contextual Precision case: the retriever matched on topic rather than on the specific argumentative move the question was targeting. “Limits” as a concept saturates the essay. The retriever found vivid, concrete chunks about limits — faster-than-light travel, the two-dimensional creature — and ranked them above the more abstract, enumeration-style chunk that actually answers the question.

Fixing this would require either a smaller chunk size to isolate the enumeration from its surrounding illustrations, or a retrieval strategy that rewards structural signals like numbered lists over narrative prose.

Contextual Relevancy — Live RAG Case (Q3)

The live retriever pulled three chunks for the constitutive history question. Before we look at the scores, it’s worth reading what it actually found:


--- Chunk 1 ---
is not an external container; it is part of what makes a human be a
human. So I'm wondering whether history is similarly constitutive for
a Creator. Not just something the Creator interacts with, but something
without which "Creator" becomes unintelligible. Classical theism answers
my question differently...

--- Chunk 2 ---
of a Creator constituted by history itself. The Higgs and the Quiet
After. It's interesting to consider the Large Hadron Collider. Many
people are aware that it found something called the Higgs Boson back
in 2012...

--- Chunk 3 ---
of divinization. But here's a question that may sharpen the issue. If
we imagine an atemporal Creator entirely apart from history, we preserve
transcendence but risk relational distance. If we imagine a Creator
constituted by history, we preserve intimacy but risk contingency...

Chunk 1 is from the right section. Chunk 3 is also from a relevant part of the essay. Chunk 2, however, is a striking retrieval artifact: it appears to straddle a chunk boundary that falls between the introduction’s closing sentence (“the possibility of a Creator constituted by history itself”) and the opening of the first section, “The Higgs and the Quiet After.” The retriever found a chunk whose first sentence is directly relevant and whose remaining sentences are about the Large Hadron Collider.

This is a chunking boundary problem made visible. The introduction ends with a sentence containing “Creator constituted by history itself,” which has high embedding similarity to the query. But the chunk that sentence belongs to also contains the first several sentences of the opening section on particle physics, which is content that has nothing to do with the question. The retriever selected this chunk because of one highly relevant sentence, then dragged four irrelevant ones along with it.

The score reflects this mixed bag:


Score: 0.4666...
Reason: The score is 0.47 because the majority of statements provided
are irrelevant to the input question. However, several relevant
statements directly address the core issue.

The statement-level breakdown, which I won’t reproduce here, told the story precisely. From chunk 1: four irrelevant statements, one relevant. From chunk 2: one relevant statement (“of a Creator constituted by history itself”), then four statements about the Higgs boson marked irrelevant. From chunk 3: five statements all marked relevant: the atemporal Creator trade-off, the contingency problem, the question of plural histories.

Counting across all three chunks: 7 relevant statements out of 15 total, which tracks with the 0.47 score.

Now compare this to our controlled cases:

High relevancy (hand-crafted correct chunks): 0.125
Low relevancy (vocabulary trap chunks): 0.0
Live RAG (actual retrieval): 0.47

The live retrieval outscored our hand-crafted high relevancy context. This is initially surprising, but it makes sense on reflection. The live retriever found chunk 3 — a passage with five directly relevant statements about the Creator/history trade-off — which our controlled high relevancy context didn’t include. Our hand-crafted context was built around the constitutive argument itself; the live retriever additionally found the passage that develops the implications of that argument, which the metric credits as relevant.

This outcome carries an important lesson for how we interpret controlled versus live results. The controlled cases establish what the metric does under known conditions. The live cases show what actually happens, and sometimes the retriever finds things we didn’t anticipate, for better or worse. Chunk 2’s boundary artifact is a genuine problem; chunk 3’s retrieval is a genuine win. The 0.47 score bundles both together.

The Higgs boson sentences appearing in chunk 2 are a direct consequence of the chunk boundary falling mid-introduction. This is the kind of artifact that’s invisible until you print the retrieved chunks and read them carefully, which is why the script prints them before running the metric. In a production pipeline, regularly inspecting retrieved chunks alongside metric scores is more diagnostic than relying on scores alone.

The generated response is worth noting here too. The model produced a genuinely thoughtful answer, correctly identifying the constitutive argument, the contrast with classical theism, and the transcendence/contingency trade-off. It had enough relevant material in chunks 1 and 3 to reason well, and it largely ignored the Higgs boson content in chunk 2. This is an example of a model compensating gracefully for a noisy retrieval context, which is, I should note, the opposite of what we saw in the Recall case, where the model compensated poorly by synthesizing from illustrative rather than argumentative content.

G-Eval — Live RAG Case (Q5)

Here, the retriever pulled three chunks for the pantheism/panentheism question, and for once the retrieval was well-matched to the query:


--- Chunk 1 ---
The "Creator at the end" isn't intervening from the future; rather,
the universe's entire timeline is a coherent solution to a larger
equation whose final state is maximal awareness. That starts to sound
less like classical theism and more like a kind of teleological
monism; where the uni...

--- Chunk 2 ---
In pantheism, God and the universe are identical. In panentheism, the
universe is in God but God is more than the universe. My idea hovers
somewhere in between. The Creator is not prior to the universe in
time, but neither is the Creator reducible to any single stage within
it. The Creator is the em...

--- Chunk 3 ---
"the risk of the latter is that it can start to feel circular: the
universe produces the Creator who explains the universe." Yes, but we
also assume the universe had a beginning...

Chunk 2 is exactly the passage where the essay explicitly names both pantheism and panentheism and then says its own position “hovers somewhere in between.” This is the most important chunk for the question, and the retriever found it. The vocabulary in the query — “universe,” “God,” “nuanced position” — mapped cleanly onto this passage in a way that the more abstractly-phrased questions about interpretations and constitutive history did not. Concrete philosophical terminology tends to retrieve better than structural or argumentative terminology, and this result confirms that pattern.

The generated response is the most sophisticated of the set:


The essay takes a nuanced position, arguing that the universe is the
Creator — specifically, the emergent totality and final integrated
consciousness of the whole. It's not a simple pantheistic
identification, but a teleological monism where the universe embodies
the unfolding of divinity through evolutionary processes.

The model correctly identified that the essay doesn’t simply say “the universe is God.” It named the pantheism/panentheism distinction, located the essay’s position between them, used “teleological monism” correctly, and noted the circular reasoning problem the essay itself acknowledges.

That’s a genuinely faithful summary of a genuinely subtle argument!

The G-Eval score reflects this:


Score: 0.8

Reason: The response accurately represents the nuanced position of the
essay, avoiding oversimplification by distinguishing between pantheism
and panentheism. It correctly uses key terms like 'emergent totality'
and 'teleological monism.' However, it could have more explicitly
acknowledged the speculative nature of the argument as suggested in
the expected output.

The deduction came from Criterion 1 — Epistemic Humility. The judge model noted that the response presents the essay’s position with more confidence than the essay itself does. The essay uses “perhaps,” “might,” and “one could frame it as” throughout this argument. The model’s summary, while accurate about what the position is, states it as if the essay is asserting it rather than exploring it. “The essay takes a nuanced position, arguing that the universe is the Creator” reads more settled than the essay’s own hedged, exploratory register.

This is exactly the kind of failure mode G-Eval was designed to catch and that no RAG-specific metric could have flagged. The retrieval was good. The response was factually accurate. The argumentative structure was correctly represented. The only problem was a subtle shift in epistemic register — from speculative to assertive — that changed the character of the essay’s argument without getting any individual claim wrong.

That distinction matters for a document like this one. The essay is explicit that it is not arguing “therefore design” or committing to a theological position. A summary that presents its speculations as conclusions misrepresents the intellectual posture of the writing even while accurately describing its content. The 0.2 deduction is the metric putting a number on that misrepresentation.

The 0.8 score on G-Eval alongside the 0.0 on Contextual Recall and 0.47 on Contextual Relevancy illustrates something important about the relationship between these metrics. G-Eval is measuring a different dimension entirely. The retrieval for Q5 happened to be good, which gave the model enough material to reason faithfully. If the retrieval had been poor, as it was for Q1, the G-Eval score would likely have dropped too, but for reasons the other metrics would already have identified. When retrieval is adequate, G-Eval surfaces the next layer of failure: not what was missing from the context, but how well the model honored the argument it found there.

Reading the Summary

Once all six cases have run, the script prints a consolidated results table:


============================================================
RESULTS SUMMARY
============================================================

Controlled Cases:
  Recall     — High Recall (Q1):      1.00 | The score is 1.00 because all
    sentences in the expected output are clearly and directly supported by
    the nodes in the retrieval context, with no unsupportive reasons present.
  Recall     — Low Recall (Q1):       0.50 | The score is 0.50 because the
    expected output contains interpretations that partially align with the
    node(s) in retrieval context, but there are significant gaps where no
    matching content was found.
  Relevancy  — High Relevancy (Q3):   0.12 | The score is 0.12 because most
    of the statements in the retrieval context are irrelevant to the question
    about history being constitutive of a Creator.
  Relevancy  — Low Relevancy (Q3):    0.00 | The score is 0.00 because none
    of the provided statements from the retrieval context address the specific
    concept of history being constitutive of a Creator.

Live RAG Cases:
  Recall     — Live RAG (Q1):         0.00 | The score is 0.00 because none
    of the sentences in the expected output relate to or reference any of the
    specific examples or concepts from the retrieval context.
  Relevancy  — Live RAG (Q3):         0.47 | The score is 0.47 because the
    majority of statements provided are irrelevant to the concept of history
    constituting a Creator. However, several relevant statements directly
    address the input's query.

G-Eval:
  PhilosophicalEssayFidelity (Q5):    0.80 | The response accurately
    identifies the nuanced position of the essay, avoiding oversimplification
    by distinguishing between pantheism and panentheism. However, it could
    have more explicitly acknowledged the speculative nature of the argument.

This table is worth reading because it lets you see the pattern across all six cases at once rather than case by case. The controlled cases form the left side of a before/after comparison: 1.00 and 0.50 for Recall show the metric working as designed; 0.12 and 0.00 for Relevancy establish our baseline for what the metric considers relevant in this document type.

The live RAG cases then land against those baselines: 0.00 for Recall confirming a complete retrieval miss, 0.47 for Relevancy showing the retriever found the right neighborhood despite a chunking artifact. G-Eval’s 0.80 sits apart from the others, measuring a different dimension entirely.

A word on reproducibility is worth repeating here, even though I’ve covered it in many of my other posts: because both the execution model and the judge model are LLMs, the outputs you see when running this script will not be identical to the outputs shown in this post. Scores may shift slightly between runs, retrieved chunks may vary if the vector store is rebuilt, and the judge model’s reasoning text will differ in phrasing even when the score is similar.

This is normal and expected. What should remain stable across runs is the pattern: high recall contexts scoring higher than low recall contexts, vocabulary-trap contexts scoring near zero on relevancy, and G-Eval catching epistemic register failures that the RAG-specific metrics cannot see. If you find the pattern breaking significantly — for example, a low relevancy context scoring higher than a high relevancy context — that’s a signal worth investigating in your own judge model configuration rather than a sign the script is wrong.

What Does the Testing Tell Us

Across six test cases — two controlled and one live for each of Recall and Relevancy, plus G-Eval — a consistent picture has emerged. Let’s pull the thread here.

The controlled cases did their job. The high recall case scored 1.0 and the low recall case scored 0.5, confirming that the metric correctly detects when a required chunk is absent from the retrieval context. The high relevancy case scored 0.125 and the low relevancy case scored 0.0, confirming that the metric correctly distinguishes between chunks from the right section and chunks from the wrong section, even when both discuss the same vocabulary. The gap between 0.125 and 0.0 is meaningful even if neither number looks impressive in isolation.

Test Finding: For philosophical and argumentative documents, Contextual Relevancy scores should be read as relative signals rather than absolute ones. The metric’s definition of relevance favors direct assertion over analogical reasoning and argumentative scaffolding. A score near 0.125 on a correctly-retrieved context is a property of the document type, not a retrieval failure. A score near 0.0 on a live retrieval case is a genuine alarm.

The live RAG cases then revealed how the actual retriever performed against those baselines. For Q1, the retriever scored 0.0 on Recall: a complete miss. It found chunks from the right section of the essay but landed on the illustrative examples surrounding the three-interpretation enumeration rather than the enumeration itself. The model then compensated by synthesizing a plausible-looking three-part answer from illustrative material, producing something that passes casual inspection but finds no support in the expected output. This is the subtler failure mode: not a wrong answer, but a confabulated one that can’t be traced back to the source.

This, by the way, is way “hallucination” in an AI context is not quite the simple matter that many seem to believe it is.

Test Finding: Chunking boundaries matter more for argumentative prose than for technical prose. The three interpretations in the essay are surrounded by vivid concrete examples — faster-than-light travel, the two-dimensional creature, wormholes — that have stronger embedding signals than the abstract enumeration itself. The retriever found the illustrations and missed the taxonomy. Reducing chunk size further, or experimenting with semantic chunking strategies that respect paragraph boundaries, would be the first intervention to try.

For Q3, the live retriever scored 0.47 on Relevancy, actually outperforming the hand-crafted high relevancy context. It found the key constitutive-history passage and a highly relevant trade-off passage from later in the same section. It also found a chunk boundary artifact that dragged several Higgs boson sentences into the retrieval context. The 0.47 score bundles a genuine retrieval win and a chunking artifact together into a single number, which is why reading the raw chunks alongside the score matters.

Test Finding: Chunk boundary artifacts are invisible without inspection. A chunk that straddles a section transition can have high embedding similarity to a query because of one sentence, while carrying several irrelevant sentences from an adjacent section. Printing retrieved chunks before evaluating them is diagnostic practice, not optional housekeeping.

G-Eval told a different story from the other two metrics, because it was measuring something different. The retrieval for Q5 was good: the retriever found the pantheism/panentheism passage directly. The model produced an accurate, well-structured response that correctly represented the essay’s position between two named philosophical stances. It scored 0.8. The 0.2 deduction came entirely from Criterion 1 (Epistemic Humility) because the response stated the essay’s speculative position with more confidence than the essay itself does.

Test Finding: G-Eval catches the failure mode that comes after retrieval succeeds. When the retriever finds the right chunks and the model reasons faithfully from them, RAG-specific metrics have nothing left to flag. G-Eval then surfaces the next layer: not what was missing from the context, but whether the model honored the argument’s register. For a document that is explicitly exploratory rather than assertive, collapsing speculation into conclusion is a meaningful misrepresentation even when every individual claim is accurate.

Taken together, these three metrics extend the diagnostic framework we’ve been building across this series. To the table from the Contextual Precision post, we can now add:

Low Recall + Low Relevancy: The retriever missed the target section entirely and filled the context with noise — the Q1 failure mode
Moderate Relevancy + chunking artifact: The retriever found the right neighborhood but a boundary problem introduced irrelevant content — the Q3 live case
Good retrieval + G-Eval deduction: Retrieval succeeded but the model shifted the essay’s epistemic register — the Q5 result

Each combination points to a different intervention. Low Recall and Relevancy together point to chunking strategy and retrieval coverage. A chunking artifact with moderate Relevancy points to chunk size and boundary handling. A G-Eval deduction on an otherwise successful case points to prompt design, specifically instructing the model to preserve hedging language when summarizing speculative arguments.

So, my essay proved to be a more demanding test subject than the warp drive paper, not because it’s more complex, but because its vocabulary is more uniformly distributed. A technical paper concentrates its key claims in specific sections with distinctive terminology. A philosophical essay returns to the same concepts repeatedly from different angles, which means the retriever has to work harder to find not just the right topic but the right argumentative move within that topic. That difficulty is precisely what makes it a useful stress test for a RAG system you intend to use on real-world documents.

Next Steps!

We now have five (six, if we include answer relevancy) RAG-specific metrics and one open-ended evaluation metric in our toolkit, and we’ve seen how they form a diagnostic system rather than a checklist. Each metric catches something the others miss, and the pattern across scores is more informative than any individual number.

The next post in this series takes the evaluation pipeline in a different direction: conversational evaluation. Everything we’ve done so far has been single-turn: one question, one retrieved context, one response, one score. But many real-world LLM deployments are multi-turn. A user asks a question, the system responds, the user follows up, and the system needs to maintain consistency, remember what was established earlier, and avoid contradicting itself across turns.

This turns out to be a harder evaluation problem than single-turn RAG, and one where intuitions about model quality are often wrong in both directions. Models get criticized for losing conversational thread when they haven’t, and praised for consistency when they’re actually just repeating themselves. DeepEval has metrics specifically designed for this, and applying them rigorously requires the same discipline we’ve built up here: controlled cases first, live cases second, and a clear sense of what each metric is actually measuring before we interpret the numbers. So, in the next post, we’ll dig into all that.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …