AI and Testing: Evaluation Synthesis

This series has now covered eight metrics across two evaluation paradigms. We’ve applied them to a warp drive paper, a mass extinction paper, a philosophical essay on colliders and cosmology, and an essay on time travel and metaphysics. Each post introduced a metric, showed what it catches, and explained what its scores mean in context. This post does something different: it stops introducing new metrics and instead asks what the full set reveals when used together.

What We’ve Built

Before pulling the framework together, it’s worth naming what each metric in the series actually measures. These distinctions matter because the value of the framework comes not from running any single metric but from reading the relationships between them.

The single-turn metrics all operate on one question, one retrieval context, and one response:

Faithfulness asks whether the response is grounded in the retrieved context. It catches generation problems, by which I mean claims the model made that weren’t in the chunks it received.
Contextual Precision asks whether the most relevant chunks were ranked first. It catches retrieval ordering problems, by which I mean the right information was retrieved but buried beneath noise.
Contextual Recall asks whether the retriever found all the chunks needed for a complete answer. It catches retrieval coverage problems, by which I mean the right information existed in the document but was never surfaced.
Contextual Relevancy asks what fraction of the retrieved chunks were actually relevant. It catches noise problems, such as if the retrieval set was polluted with topically-related but query-irrelevant content.
G-Eval asks whether the response honored the source document’s argumentative register. It catches representation problems. An example would be that the response was factually accurate but mischaracterized the document’s confidence level or argumentative posture.

The conversational metrics operate on a full multi-turn exchange:

ConversationCompletenessMetric asks whether the conversation addressed the user’s goal across all turns. It catches thread-dropping, which is where the model answered each question locally but failed to maintain the argument across the exchange.
ConversationalGEval asks whether specific properties (semantic precision, epistemic register, conceptual distinctions) were maintained across turns. It catches drift, in the sense where the model started with the right framing and gradually lost it.

These seven metrics are not a complete evaluation toolkit. They are a diagnostic starting point. What they give you is the ability to distinguish between failure modes that look similar from the outside — a bad answer could come from bad retrieval, bad generation, bad ordering, bad coverage, or bad register — and point toward different interventions for each.

The Diagnostic Framework

I would argue that the most useful thing this series has produced is not any individual metric but the pattern of relationships between them. Here’s the decision framework that emerges from running all seven together:

Pattern	Diagnosis	Intervention
Low Faithfulness + Low Precision	Retrieval failure cascading to generation. The model is working from poorly ordered or irrelevant chunks and filling gaps with training data.	Fix retrieval ordering. Adjust chunk size, embedding strategy, or retrieval strategy before addressing generation.
Low Faithfulness + High Precision	Generation problem, not retrieval. The retriever found the right chunks in the right order, but the model made claims beyond them.	Investigate the generation model. Consider stricter prompting constraints or a different model.
High Faithfulness + Low Precision	Model compensating for poor chunk ordering. The model produced a grounded response despite the relevant chunk being buried. This is a risky success.	Fix retrieval ordering. The system is fragile even if the current output looks acceptable.
Low Recall + Low Relevancy	Retriever missed the target section entirely and filled the context with noise.	Investigate chunking strategy and embedding model. The retriever is matching on topic rather than query intent.
Low Recall + High Relevancy	Retriever found relevant chunks but missed some required ones. The answer is partially supported.	Increase `k`. Consider whether the required information spans multiple sections that need separate retrieval.
Low G-Eval	Response overstated the document’s confidence or misrepresented its argumentative posture. Factually accurate but rhetorically unfaithful.	Adjust prompt instructions to preserve hedging language. Consider whether the model needs explicit guidance on epistemic register.
High Completeness + Low Conv G-Eval	Conversation addressed the user’s goal but lost a conceptual distinction or drifted in register across turns.	Investigate which turn the drift occurred. Run single-turn metrics on the problematic turn to determine whether the cause is retrieval or generation.
Low Completeness	Conversation failed to address the user’s goal. A thread was dropped and never recovered.	Verify history management. Check whether the model is receiving prior turns in its context window.

This framework is most useful when metrics diverge. When everything scores high, the system is working. Great! When everything scores low, something is fundamentally broken. Not so great, but informative! The interesting diagnostic signal comes from mixed patterns — one metric high, another low — because those mixtures point to specific failure modes rather than general system failure.

Document Type Matters

One finding that emerged across the series but was never stated as a finding is that the right metric combination depends on the type of document being evaluated. This is worth making explicit before the synthesis script runs.

A technical paper with discrete factual claims, like the warp drive paper or the mass extinction paper, is best evaluated with Faithfulness and Contextual Precision as the primary metrics. The claims are specific and verifiable. In the case of the warp drive paper, either the chunk says the energy requirement is 10²⁸ kilograms or it doesn’t. Recall and Relevancy add coverage information. G-Eval adds less value here because the paper’s epistemic register is not especially subtle; it asserts rather than hedges.

An argumentative essay with recurring vocabulary, like the coherence essay, stresses Recall and Relevancy most severely. The vocabulary overlap across sections means the retriever will consistently find topically-related but query-irrelevant chunks. G-Eval becomes more important here because the essay’s argument involves deliberate hedging and structural analogies that a model might sharpen into firmer claims.

A philosophical essay with cumulative argument structure, like the time travel essay, is where the conversational metrics earn their keep. The argument builds across sections in a way that single-turn evaluation can’t fully capture. A model evaluated on individual questions might score well on each while still failing to maintain the thread across a multi-turn exchange. Conversational completeness and conversational G-Eval surface failure modes that the single-turn metrics simply can’t see.

This is not a rigid taxonomy. Real document collections are mixed. A knowledge base might contain technical documentation, policy documents, and argumentative analyses side by side. The practical implication is to run a broad metric suite when evaluating against a new document type, and then narrow to the metrics that produced the most diagnostic signal once you understand the document’s properties.

The Synthesis Script

The evaluate_synthesis.py script runs all seven metrics against a single question and follow-up, targeting the time travel essay. The question, essentially my core test case, was chosen to stress the full suite simultaneously:

What does the essay mean when it says bootstrap loops are category errors rather than merely improbable events, and how does that connect to the distinction between physical and metaphysical impossibility?

This question requires a specific chunk from the bootstrap paradox section; it won’t be well-served by chunks from the Novikov or short-circuit sections, even though those also discuss impossibility. The phrase “category error” appears in a specific passage, so Contextual Precision has something precise to evaluate. The question has two parts joined by “and how does that connect,” which means a complete answer requires the retriever to surface both the category error claim and the physical/metaphysical distinction, giving Recall something to evaluate. And the essay’s framing of this as a metaphysical rather than physical claim is carefully hedged, which gives G-Eval its target.

The follow-up question then probes the divine aseity parallel:

You mentioned divine aseity in that context. Does the essay treat the parallel between bootstrap objects and divine aseity as a philosophical argument for God’s existence, or as something weaker than that?

This follow-up does two things simultaneously. It tests whether the first response actually introduced divine aseity! If it didn’t, likely because the retriever missed the relevant chunk, the phrase “you mentioned divine aseity” will expose that gap when the model tries to respond to a reference that wasn’t in its prior turn. The question also tests epistemic register directly, asking the model to characterize the strength of an argument that the essay itself treats as structurally illuminating rather than probative.

Note how what you’re seeing here are test cases that are in the form of questions. This is important: these are the critical experiments you are running that expose test and data conditions, thereby giving you some assessment of quality.

The script is structured in four parts. Part 1 sets up the models, metrics, document loading, and questions. Part 2 runs all five single-turn metrics against the primary question. Part 3 runs both conversational metrics against the two-turn exchange. Part 4 prints a synthesis summary that shows all seven scores together alongside the diagnostic interpretation framework.

That last piece, the diagnostic interpretation printed alongside the scores, is new to this script. In the earlier scripts, the framework for reading scores lived in the blog post. Here it lives in the output itself, so when you’re running the script in you own environment, you see not just your scores but the map for interpreting them.

Here’s the overall shape of the metric instantiations before you run
it:

from deepeval.metrics import (
  FaithfulnessMetric,
  ContextualPrecisionMetric,
  ContextualRecallMetric,
  ContextualRelevancyMetric,
  GEval,
  ConversationCompletenessMetric,
  ConversationalGEval,
)

# Single-turn metrics
faithfulness_metric = FaithfulnessMetric(model=judge_model, verbose_mode=True)
precision_metric = ContextualPrecisionMetric(model=judge_model, verbose_mode=True)
recall_metric = ContextualRecallMetric(model=judge_model, verbose_mode=True)
relevancy_metric = ContextualRelevancyMetric(model=judge_model, verbose_mode=True)
geval_metric = GEval(
  name="EpistemicRegister",
  model=judge_model,
  verbose_mode=True,
  criteria="""...""",
  evaluation_params=[
    LLMTestCaseParams.INPUT,
    LLMTestCaseParams.ACTUAL_OUTPUT,
    LLMTestCaseParams.EXPECTED_OUTPUT,
  ],
)

# Conversational metrics
completeness_metric = ConversationCompletenessMetric(
  model=judge_model, verbose_mode=True
)
conv_geval_metric = ConversationalGEval(
  name="MetaphysicalDistinction",
  model=judge_model,
  verbose_mode=True,
  criteria="""...""",
  evaluation_params=[TurnParams.ROLE, TurnParams.CONTENT],
)

from deepeval.metrics import (

FaithfulnessMetric,

ContextualPrecisionMetric,

ContextualRecallMetric,

ContextualRelevancyMetric,

GEval,

ConversationCompletenessMetric,

ConversationalGEval,

)

# Single-turn metrics

faithfulness_metric = FaithfulnessMetric(model=judge_model, verbose_mode=True)

precision_metric = ContextualPrecisionMetric(model=judge_model, verbose_mode=True)

recall_metric = ContextualRecallMetric(model=judge_model, verbose_mode=True)

relevancy_metric = ContextualRelevancyMetric(model=judge_model, verbose_mode=True)

geval_metric = GEval(

name="EpistemicRegister",

model=judge_model,

verbose_mode=True,

criteria="""...""",

evaluation_params=[

LLMTestCaseParams.INPUT,

LLMTestCaseParams.ACTUAL_OUTPUT,

LLMTestCaseParams.EXPECTED_OUTPUT,

)

# Conversational metrics

completeness_metric = ConversationCompletenessMetric(

model=judge_model, verbose_mode=True

)

conv_geval_metric = ConversationalGEval(

name="MetaphysicalDistinction",

model=judge_model,

verbose_mode=True,

criteria="""...""",

evaluation_params=[TurnParams.ROLE, TurnParams.CONTENT],

)

Notice that the single-turn G-Eval uses LLMTestCaseParams while the conversational G-Eval uses TurnParams. This is the distinction established in the previous post: the two G-Eval variants operate on different data structures and require different parameter enumerations.

Once you have the script from the repository, run it directly (making sure you are in your virtual environment):

python evaluate_synthesis.py

1	python evaluate_synthesis.py

Let’s look at what comes back.

Part 2: Single-Turn Evaluation — Retrieval and Response

Before any metric runs, the script prints what the retriever found and what the model generated. Both are worth examining before the scores land.


============================================================
PART 2: SINGLE-TURN EVALUATION
============================================================

Retrieving context for primary question...

Retrieved chunks:

--- Chunk 1 ---
fails for the same structural reason as the bootstrap paradox, just expressed in probabilistic rather than logical terms. In both cases, the temporal system encounters something that does not fit — information with no valid cause, an object with no origin — and the result is short-circuit, collapse, ...

--- Chunk 2 ---
outside Godself for existence. God is not caused by anything; God simply is. A bootstrap object is a crude, finite, temporal imitation of this property: an uncaused thing inside a universe that is supposed to contain only caused things. This is why the paradox produces such strong intuitive resistan ...

--- Chunk 3 ---
to prevent atemporal phenomena from emerging inside it. Origin-lessness is not a physical property; it's a metaphysical one. And temporal physics resists metaphysical leakage; not physically, by erecting a barrier you could detect with instruments, but conceptually, by making such loops category err ...

All three chunks are from the bootstrap paradox section. Chunk 3 contains the “category error” phrase the question specifically targeted. Chunk 2 contains the divine aseity comparison. Chunk 1 contains the structural parallel between the lottery scheme and the bootstrap paradox. This is a notably clean retrieval result: the retriever found the right section and surfaced three complementary chunks from it rather than pulling from adjacent sections. The question’s specificity helped: “category error” and “physical and metaphysical impossibility” are distinctive enough phrases that the embedding similarity pointed directly at the right passage.

The generated response is worth reading in full before the metrics evaluate it, because it illustrates something important about how a capable model handles good retrieval:


GENERATED RESPONSE:
Okay, let's break this down step-by-step. The core argument here is about how our intuitive understanding of the universe clashes with the potential for paradoxes like the bootstrap paradox.

1. **Initial Framing:** The essay repeatedly emphasizes that bootstrap paradoxes aren’t just “weird” from a technical standpoint (difficult engineering). They represent a fundamental conflict between our *temporal* understanding of the universe and something *outside* of that temporal framework.

2. **Categorization – The Key Distinction:** The essay argues that bootstrap loops are “category errors.” This means they aren’t just *improbable* – unlikely to occur – but fundamentally *incompatible* with the way temporal physics is constructed. Let’s unpack what this means:
   *   **Temporal Physics’s Assumption:** Temporal physics, as presented here, assumes that *everything* has a preceding origin state. This is a core principle to prevent things from arising out of nothing. It’s a constraint built into the system.
   *   **The Bootstrap Paradox's Challenge:** A bootstrap loop (like the paradox) introduces an object/information with *no* origin. It's inherently “atemporal” – existing outside of time’s causal chain.
   *   **Category Error:** Because temporal physics is built around a universe of caused things, the existence of something *uncaused* – an origin-less object – creates a fundamental mismatch. It’s like trying to fit a square peg into a round hole. The system simply can't process it.  It’s not a matter of probability; it’s a rejection of the fundamental categories the system uses.

3. **Physical vs. Metaphysical Impossibility:** This is where the connection to the broader argument about God and existence is made. The essay suggests:
    *   **Physical Impossibility:**  A physical impossibility would be something that violates the laws of physics as we understand them (e.g., faster-than-light travel). We could, theoretically, refine our understanding of physics to allow it.
    *   **Metaphysical Impossibility:** A bootstrap paradox represents a metaphysical impossibility – a contradiction of the *fundamental assumptions* about the nature of reality. It’s not a constraint we could overcome with a new scientific theory, but a reflection of a deeper, perhaps unresolvable, tension between our models of the universe and something beyond.

4. **Analogy – Self-Preservation:** The essay uses the analogy of a conscious self preserving its identity by disallowing contradictions. Similarly, the universe avoids the bootstrap paradox by rejecting the idea of an origin-less object.

**In summary,** the essay argues that bootstrap loops aren't just improbable; they represent a category error because they introduce a property (origin-lessness) that is fundamentally incompatible with the temporal framework of physics. This highlights a distinction between physical and metaphysical impossibilities – a challenge that transcends mere engineering problems and delves into the very nature of existence.

Do you want me to elaborate on any particular aspect of this explanation, such as the implications of the “Godself” concept or how this relates to other paradoxes?

The model correctly identified the category error argument, correctly distinguished physical from metaphysical impossibility, and correctly grounded both in the retrieved chunks. It also picked up the divine aseity parallel from chunk 2 and used it appropriately in the body of the response, which matters for the follow-up question that references “you mentioned divine aseity.”

There are two things worth flagging before the metrics score it, however.

First, the model’s verbose step-by-step formatting is present again: numbered points, bold headers, a final summary labeled “In summary.” This is a stylistic property of the ts-reasoner model that doesn’t affect the substance of the answer but is worth noting for production contexts where response format matters.
Second, the response closes with an offer to elaborate: “Do you want me to elaborate on any particular aspect of this explanation, such as the implications of the ‘Godself’ concept…” This kind of conversational tag is again a model tendency rather than an evaluation concern, but it’s worth being aware that some metrics might interpret unsolicited elaboration offers as incomplete grounding.

With the retrieval and response in hand, let’s see what the five single-turn metrics make of them.

Faithfulness

This is a genuinely surprising result and one of the most instructive outputs in the series.


============================================================
FAITHFULNESS
============================================================
**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "The lottery scheme fails for the same structural reason as the bootstrap paradox.",
    "Both the lottery scheme and the bootstrap paradox involve a temporal system encountering something that does not fit, leading to a short-circuit or collapse of the loop.",
    "In both cases, the result is the erasure of the loop due to an inconsistency in origin."
]

Claims:
[
    "The essay argues that bootstrap paradoxes represent a fundamental conflict between temporal understanding and something outside of that temporal framework.",
    "Bootstrap loops are categorized as 'category errors' because they introduce an object with no origin, which is incompatible with the core principles of temporal physics.",
    "Temporal physics assumes everything has a preceding origin state to prevent things from arising out of nothing.",
    "A bootstrap loop introduces an atemporal element that cannot be processed within the system's framework due to its fundamental assumptions about reality.",
    "The essay suggests that physical impossibilities can potentially be addressed by refining our understanding of physics, whereas metaphysical impossibilities are contradictions of fundamental assumptions about the nature of reality.",
    "Bootstrap paradoxes highlight a distinction between physical and metaphysical impossibilities, with the latter representing a deeper tension beyond mere engineering problems."
]

Verdicts:
[
    {
        "verdict": "no",
        "reason": null
    },
    {
        "verdict": "yes",
        "reason": "The context states that both the lottery scheme and bootstrap paradox involve erasure due to an inconsistency, which directly contradicts the claim that they are categorized as 'category errors' because of introducing an object with no origin."
    },
    {
        "verdict": "no",
        "reason": null
    },
    {
        "verdict": "yes",
        "reason": "The context mentions that both paradoxes involve erasure due to inconsistency, which contradicts the claim about atemporal elements not being processable within the system's framework."
    },
    {
        "verdict": "idk",
        "reason": "While the context discusses erasure and inconsistencies, it does not provide direct evidence for or against the idea that physical impossibilities can be addressed by refining our understanding of physics. The claim is more about a potential solution rather than a contradiction."
    },
    {
        "verdict": "no",
        "reason": "The context states that both paradoxes involve erasure due to inconsistency, which directly contradicts the claim that bootstrap paradoxes highlight a distinction between physical and metaphysical impossibilities with the latter representing deeper tensions beyond mere engineering problems."
    }
]

Score: 0.5
Reason: The score is 0.50 because the actual output incorrectly distinguishes between physical and metaphysical impossibilities in bootstrap paradoxes, contradicting the context which states that both paradoxes involve erasure due to inconsistency.

We got a 0.5 on Faithfulness despite what looked like a well-grounded response. The verdict breakdown reveals what happened, and it’s worth reading carefully because the failure is subtle.

The metric extracted six claims from the response and checked each against the retrieved chunks. Three were marked “yes” or “idk” (reminder: this means “I don’t know”). Three were marked “no.” The three “no” verdicts all share the same stated reason: “The context states that both paradoxes involve erasure due to inconsistency, which directly contradicts the claim…”

But wait. The response’s claim about category errors and metaphysical impossibility is not a contradiction of the retrieved chunks; in fact, it’s an accurate summary of what the essay argues! The problem is in what the retriever actually surfaced. Look at chunk 1 again: it describes the lottery scheme failing “for the same structural reason as the bootstrap paradox.” Both involve “a temporal system encountering something that does not fit” leading to “short-circuit, collapse.” That chunk is about the parallel between the two paradoxes at the level of causal instability. It’s not about the metaphysical category error argument specifically.

The Faithfulness metric is therefore doing something technically correct but contextually misleading. It’s checking whether the response’s claims are supported by the retrieved chunks, and the retrieved chunks, while all from the right section, emphasized the causal instability framing more than the metaphysical category error framing. The response correctly articulated the category error argument from chunk 3, but then extended it into a physical/metaphysical distinction that chunk 1’s framing of “erasure due to inconsistency” does not directly support. The metric read that extension as a contradiction rather than a synthesis.

This is one of the most important findings in the synthesis script, and it brings the diagnostic framework to life in a concrete way.

Test Finding: A Faithfulness score of 0.5 alongside what appears to be accurate retrieval does not necessarily mean the model hallucinated. It may mean the model synthesized across retrieved chunks in a way that introduced claims not directly supported by any individual chunk. In this case, the model correctly understood the essay’s argument and articulated it accurately, which is great. But it articulated it all in a way that went slightly beyond what the three specific chunks it received explicitly stated. Faithfulness evaluates grounding in the retrieved context, not accuracy relative to the document as a whole. Those are different things, and this result illustrates exactly where they diverge.

The practical implication is worth naming. If you saw only this Faithfulness score of 0.5 without reading the chunks and the response, you might conclude the model was hallucinating or the retrieval was bad. The score alone doesn’t tell you which! That’s why the Contextual Precision, Recall, and Relevancy scores matter: they will tell us whether the retrieval was actually the problem or whether the 0.5 reflects a generation-side synthesis that outran its sources.

Contextual Precision

Here’s the section for this output:


============================================================
CONTEXTUAL PRECISION
============================================================
**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The context does not directly address the distinction between physical and metaphysical impossibility or provide a clear connection to bootstrap loops as category errors. Instead, it discusses how both paradoxes lead to short-circuiting and collapse within temporal systems."
    },
    {
        "verdict": "yes",
        "reason": "This sentence explains that 'the bootstrap object introduces an atemporal property \u2014 origin-lessness \u2014 into a temporal container that's not built to hold it,' which aligns with the idea of bootstrap loops being category errors. It supports the expected output by elaborating on why such objects are considered metaphysical impossibilities."
    },
    {
        "verdict": "yes",
        "reason": "This context clarifies that 'the universe preserves causal integrity the same way a conscious self preserves identity: by disallowing contradictions in origin,' which directly supports the idea of bootstrap loops being treated as category errors rather than merely improbable events. It reinforces the distinction between physical and metaphysical impossibility."
    }
]

Score: 0.5833333333333333
Reason: The score is 0.58 because nodes ranked 1 and 2 are marked 'no', which should be lower in ranking compared to the relevant nodes (ranked 3). Nodes 3 provide direct support for the input's query, making them more valuable than the irrelevant or less pertinent information provided by nodes 1 and 2.

Now the retrieval picture comes into focus. The Faithfulness score of 0.5 was not caused by bad retrieval, it was caused by retrieval ordering. The right chunk was retrieved, but it was ranked last.

The verdict breakdown for precision tells the story precisely. Chunk 1 (the lottery/bootstrap parallel about erasure and inconsistency) was marked “no” because it doesn’t directly address the category error argument or the physical/metaphysical distinction. Chunk 2 (the divine aseity comparison) was also marked “no” for the same reason: it supports the divine aseity parallel but doesn’t directly address the category error framing. Chunk 3 (the one containing “origin-lessness is not a physical property; it’s a metaphysical one” and the “category error” language) was marked “yes” as directly supporting both parts of the question.

The relevant chunk was ranked third. The two chunks that scored “no” were ranked first and second.

This is the Contextual Precision failure mode we established back in the Contextual Precision post, now appearing in the synthesis script in a new form. In the previous post, the warp drive paper’s energy calculation chunk was buried at position three beneath topically-related but query-irrelevant content. Here, the bootstrap paradox section’s category error chunk is buried at position three beneath other chunks from the same section that address adjacent arguments rather than the specific argument the question targeted.

With this, I hope you can see that the connection to the Faithfulness score is now clear. The model received chunk 1 first (the erasure and inconsistency framing) and that framing influenced how it structured its response. It used chunk 3’s category error language correctly, but it also leaned on chunk 1’s framing in ways that produced claims the metric read as contradicting the retrieved context. Had chunk 3 been ranked first, the response would likely have led with the metaphysical category error argument rather than the causal instability argument, and the Faithfulness verdicts would probably have landed differently.

Test Finding: This is the Low Faithfulness + Low Precision pattern from the diagnostic framework: retrieval ordering failure cascading into generation. The model wasn’t hallucinating. It was working faithfully from what it received, in the order it received it. The intervention indicated by this pattern is retrieval ordering rather than generation: fix the ranking, and the Faithfulness score should improve without any changes to the model.

What makes this result particularly instructive for the synthesis post is that it demonstrates the framework working as designed. A Faithfulness score of 0.5 seen in isolation looks like a generation problem. A Precision score of 0.58 seen in isolation looks like a retrieval problem. Seen together, they produce the specific diagnosis: ordering failure cascading to generation. Neither score alone was sufficient to identify the cause. The combination was.

Contextual Recall

Here’s the section for this output:


============================================================
CONTEXTUAL RECALL
============================================================
**************************************************
Contextual Recall Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The retrieval context does not contain any information about distinguishing between two kinds of impossibility or drawing a parallel with divine aseity.",
        "expected_output": "The essay distinguishes between two kinds of\nimpossibility. Physical impossibility means the universe blocks paradoxical\ncausal loops through mechanisms like chronology protection and retrocausal\ninstability. Metaphysical impossibility means bootstrap loops are category\nerrors -- not just causally unstable but ontologically misplaced -- because\nthey introduce origin-lessness into a temporal system that is structured\naround the grammar of cause and effect.\n\nThe essay then draws a structural parallel between bootstrap objects and\ndivine aseity: both are origin-less, but a bootstrap object is origin-less\nwithin time through a closed curve, whereas God is self-existent outside\ntime entirely. The essay treats this as structurally illuminating rather\nthan probative. It does not argue that the parallel proves God's existence.\nIt ends by saying the question points outside the system entirely, which\nis a gesture rather than a conclusion."
    },
    {
        "verdict": "no",
        "reason": "The retrieval context does not mention the specific types of impossibility (physical and metaphysical) or the structural parallels drawn in the expected output.",
        "expected_output": "The essay distinguishes between two kinds of\nimpossibility. Physical impossibility means the universe blocks paradoxical\ncausal loops through mechanisms like chronology protection and retrocausal\ninstability. Metaphysical impossibility means bootstrap loops are category\nerrors -- not just causally unstable but ontologically misplaced -- because\nthey introduce origin-lessness into a temporal system that is structured\naround the grammar of cause and effect.\n\nThe essay then draws a structural parallel between bootstrap objects and\ndivine aseity: both are origin-less, but a bootstrap object is origin-less\nwithin time through a closed curve, whereas God is self-existent outside\ntime entirely. The essay treats this as structurally illuminating rather\nthan probative. It does not argue that the parallel proves God's existence.\nIt ends by saying the question points outside the system entirely, which\nis a gesture rather than a conclusion."
    }
]

Score: 0.0
Reason: The score is 0.00 because the retrieval context lacks any information about distinguishing between physical and metaphysical impossibility, drawing a parallel with divine aseity, or discussing specific types of impossibility as mentioned in the expected output.

A complete miss on Recall, and at first glance this seems to contradict what we just saw. The Precision output confirmed that chunk 3 contained the “origin-lessness is not a physical property; it’s a metaphysical one” and the “category error” language. How can Recall score 0.0 when that chunk was retrieved?

The answer is in what Recall is evaluating against. Unlike Precision, which evaluates whether the right chunks were ranked appropriately, Recall evaluates whether the retrieved context can support the expected output: the hand-crafted answer that represents what a complete, faithful response should look like. And the expected output makes two distinct claims that require two distinct chunks:

First, it distinguishes physical impossibility from metaphysical impossibility in specific terms: chronology protection and retrocausal instability for the physical kind, ontological misplacement for the metaphysical kind.
Second, it draws the bootstrap / divine aseity parallel and explicitly characterizes the essay’s epistemic register: “structurally illuminating rather than probative,” ending with “a gesture rather than a conclusion.”

The three retrieved chunks together do not contain enough information to produce either of those specific formulations. Chunk 3 has the category error language and the metaphysical leakage framing but not the explicit physical/metaphysical distinction stated in those terms. Chunk 2 has the divine aseity comparison but not the epistemic characterization: the “structurally illuminating rather than probative” framing and the “gesture rather than a conclusion” closing come from a different part of the essay that was not retrieved.

The metric’s reasoning confirms this: “The retrieval context lacks any information about distinguishing between physical and metaphysical impossibility, drawing a parallel with divine aseity, or discussing specific types of impossibility as mentioned in the expected output.” The chunks that were retrieved addressed adjacent content but not the specific claims the expected output requires.

Test Finding: This is the Low Recall + Low Precision pattern from the diagnostic framework, but with an important nuance. It’s not that the retriever found the wrong section. After all, it’s the case that all three chunks are from the bootstrap paradox section. Instead, the issue is that the question required content from two distinct sub-arguments within that section, and the retriever surfaced chunks weighted toward one of them. The category error argument was retrieved; the explicit physical/metaphysical distinction and the epistemic register characterization were not. Increasing k (retrieving more chunks) would be the first intervention to try. With k=5 or k=6, the retriever might surface the additional chunks needed to support the full expected output.

This is also the point where the synthesis framework pays off most clearly. Precision told us the ordering was wrong. Recall tells us coverage was incomplete. Together they suggest a retriever that found the right neighborhood but not the full set of required content within it. This is a more specific diagnosis than either metric could provide alone.

At this point, you might be thinking: “Sheesh, to wade through all this stuff about the paper, I really have to read the paper.” Well, yeah, sort of. Or, at the very least, understand it. In RAG style applications, those sources are your test substrate that you would have to explore. Critically, of course, you can try all this on papers of your choosing!

Contextual Relevancy

Here’s the section for this output:


============================================================
CONTEXTUAL RELEVANCY
============================================================
**************************************************
Contextual Relevancy Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdicts": [
            {
                "statement": "bootstrap loops are category errors rather than merely improbable events",
                "verdict": "yes",
                "reason": null
            },
            {
                "statement": "the loop's own erasure is the result of a short-circuit",
                "verdict": "yes",
                "reason": null
            },
            {
                "statement": "physics can gesture toward but not answer what lies behind the short-circuit and grounds the consistency it protects",
                "verdict": "yes",
                "reason": null
            },
            {
                "statement": "This is a working draft. Feedback and pushback welcome.",
                "verdict": "no",
                "reason": "The statement 'This is a working draft. Feedback and pushback welcome.' does not address the specific question about bootstrap loops being category errors or connect to the distinction between physical and metaphysical impossibility."
            }
        ]
    },
    {
        "verdicts": [
            {
                "statement": "bootstrap loops are category errors rather than merely improbable events",
                "verdict": "no",
                "reason": "The context does not mention 'bootstrap loops' or their classification as category errors."
            },
            {
                "statement": "This is why the paradox produces such strong intuitive resistance.",
                "verdict": "no",
                "reason": "While this statement refers to a paradox, it does not directly address bootstrap loops being category errors."
            },
            {
                "statement": "The bootstrap object introduces an atemporal property \u2014 origin-lessness \u2014 into a temporal container that's not built to hold it.",
                "verdict": "no",
                "reason": "This statement discusses the nature of a bootstrap object but does not relate to bootstrap loops being category errors."
            }
        ]
    },
    {
        "verdicts": [
            {
                "statement": "bootstrap loops are category errors rather than merely improbable events",
                "verdict": "yes",
                "reason": null
            },
            {
                "statement": "the distinction between physical and metaphysical impossibility is relevant to how the universe preserves causal integrity",
                "verdict": "yes",
                "reason": null
            }
        ]
    }
]

Score: 0.5555555555555556
Reason: The score is 0.56 because while several statements mention bootstrap loops being category errors, they do not directly address their connection to the distinction between physical and metaphysical impossibility. The context lacks explicit references to this specific connection.

A moderate relevancy score, and the verdict breakdown reveals something we haven’t seen before in the series: a chunk boundary artifact producing an irrelevant statement inside an otherwise relevant chunk.

Look at chunk 1’s verdicts. Three statements were marked “yes”: the category error claim, the short-circuit framing, the gesture toward what lies behind the short-circuit. Then a fourth statement from the same chunk was marked “no” with the reason: “This is a working draft. Feedback and pushback welcome.” The draft notice at the bottom of the essay bled into the chunk. The retriever pulled a chunk whose boundary happened to include the closing line of the document, and the metric correctly identified that line as irrelevant to the question.

This is the same chunking artifact we saw in the coherence essay post, where the introduction’s closing sentence about “a Creator constituted by history itself” fell in the same chunk as the opening sentences of the first section on particle physics. There, a single highly relevant sentence dragged several irrelevant sentences along with it. Here, a single irrelevant sentence from the document’s closing note was dragged into an otherwise relevant chunk. Both are consequences of chunk boundaries falling at structurally meaningful boundaries in the document (section transitions, document endings), without awareness of that structure.

Chunk 2’s verdicts tell a different story. All three statements were marked “no,” including “The bootstrap object introduces an atemporal property — origin-lessness — into a temporal container that’s not built to hold it.” The metric’s reason is striking: “This statement discusses the nature of a bootstrap object but does not relate to bootstrap loops being category errors.” This is the metric being narrowly literal. The statement is directly relevant to understanding why bootstrap loops are category errors. In fact, origin-lessness is precisely the atemporal property that makes them ontologically misplaced. But the metric evaluated relevance at the statement level against the specific query phrasing, and “does not mention bootstrap loops being category errors” in those exact terms.

Chunk 3 scored well in that both statements were marked “yes,” and this is because it explicitly connected bootstrap loops to the category error framing and to the physical/metaphysical distinction.

Test Finding: The 0.56 Relevancy score alongside the 0.0 Recall score and 0.58 Precision score now gives us a complete single-turn diagnostic picture. The retriever found the right section, but with imperfect coverage, imperfect ordering, and a chunking artifact contaminating one chunk. This isn’t a catastrophic failure (the chunks are all from the correct part of the document, after all), but it is a retrieval system operating at the edge of its capability for this specific question. The question’s two-part structure, requiring both the category error argument and the physical/metaphysical distinction, exceeded what k=3 retrieval with 600-character chunks could reliably surface in a single pass. The intervention is clear: increase k, and consider whether 600-character chunks are too small to preserve the argumentative sub-structure of a dense section like the bootstrap paradox closing.

With all four retrieval and generation metrics now in hand, we have the first half of the diagnostic picture. Before we look at G-Eval, it’s worth noting what the four scores together say: Faithfulness 0.5, Precision 0.58, Recall 0.0, Relevancy 0.56. This is a consistent pattern of moderate retrieval performance with a specific coverage gap. This is not a system that’s fundamentally broken, but one that would benefit from targeted tuning before being deployed against this kind of multi-part philosophical question.

G-Eval (Epistemic Register)

Here is the output:


============================================================
G-EVAL (Epistemic Register)
============================================================
**************************************************
EpistemicRegister [GEval] Verbose Logs
**************************************************

Criteria:
Evaluate whether the response accurately preserves the
    epistemic register of the essay's argument about bootstrap paradoxes and divine aseity.

    The essay presents the parallel between bootstrap objects and divine aseity as structurally illuminating rather than probative. It treats bootstrap loops as category errors -- not just improbable but conceptually misplaced -- because they introduce an atemporal property (origin-lessness) into a temporal container not built to hold it.

    The essay does not argue that this parallel proves God's existence. It explicitly ends by saying the question "points outside the system entirely" rather than resolving it.

    Responses that accurately convey this distinction between structural resonance and theological proof should score higher. Responses that present the bootstrap / divine aseity parallel as a strong argument for God's existence should score lower.

Evaluation Steps:
[
    "Evaluate whether the response accurately distinguishes between the structural resonance of bootstrap paradoxes and divine aseity, and their lack of probative value for theological arguments.",
    "Assess if the response correctly identifies that the essay treats bootstrap loops as category errors due to introducing atemporal properties into temporal contexts.",
    "Check if the response acknowledges that the parallel is presented as structurally illuminating but not as a strong argument for God's existence, and does not resolve the question of divine aseity.",
    "Verify that the response explicitly notes the distinction between structural resonance and theological proof, as emphasized in the essay."
]

Rubric:
None

Score: 0.7

Reason: The response accurately distinguishes between physical and metaphysical impossibility and correctly identifies that bootstrap loops are category errors. However, it does not explicitly mention the structural parallel with divine aseity or acknowledge that the essay treats this as structurally illuminating but not probative for God's existence.

The highest single-turn score so far, and the most nuanced result. The deduction is precise and traceable.

The evaluation steps the metric generated from our criteria asked four things: did the response distinguish structural resonance from theological proof, did it correctly identify bootstrap loops as category errors, did it acknowledge the parallel is illuminating but not probative, and did it note the essay’s ending as a gesture rather than a conclusion? The first two were satisfied. The third and fourth were not.

Looking back at the generated response, this makes sense. The model correctly articulated the category error argument and the physical/metaphysical distinction. It used chunk 2’s divine aseity content (“a bootstrap object is a crude, finite, temporal imitation of this property: an uncaused thing inside a universe that is supposed to contain only caused things”) but used it to explain the category error argument rather than to explicitly draw the structural parallel and then characterize that parallel as illuminating rather than probative. The closing of the essay (“it points outside the system entirely”) was not in the retrieved chunks and therefore was not in the response.

This is the G-Eval failure mode we identified in the coherence essay post: the response was factually accurate about what the essay argues, but it didn’t represent the essay’s epistemic posture completely. There the deduction came from presenting speculative positions as settled conclusions. Here the deduction comes from omission: the response correctly avoided overstating the theological argument, but it also didn’t articulate what the essay’s actual epistemic register is. It said nothing about the parallel being “structurally illuminating rather than probative.” It didn’t close with the essay’s own closing gesture.

The distinction is subtle but meaningful. A response that overstates the essay’s theological argument fails G-Eval by commission. A response that correctly handles what it received but omits the epistemic characterization fails G-Eval by omission. Both produce deductions, but they point to different interventions. The first suggests the model needs stronger prompting constraints to preserve hedging language. The second suggests the retriever needs to surface the chunk containing the essay’s closing passage (the “points outside the system entirely” framing) for the epistemic register to be completeable.

Test Finding: G-Eval failures that trace to omission rather than commission are retrieval problems wearing generation clothes. The model didn’t misrepresent the essay’s register; it simply lacked the chunk needed to fully represent it. This is why Recall matters alongside G-Eval: a 0.0 Recall score combined with a 0.7 G-Eval score is a signal that the response’s epistemic incompleteness traces to missing coverage rather than the model’s tendency to overstate. Increase Recall, and G-Eval should rise with it.

The five single-turn scores together now form a coherent diagnostic picture:

Faithfulness 0.50: generation synthesis outran its retrieved sources
Precision 0.58: the most relevant chunk was ranked last
Recall 0.00: the retrieval set lacked coverage of the full expected answer
Relevancy 0.56: moderate noise including a chunking artifact
G-Eval 0.70: epistemic register partially preserved but incomplete due to missing coverage

Every single-turn failure traces to the same root cause: retrieval coverage and ordering. The model handled what it received reasonably well. The retriever didn’t give it everything it needed, in the right order, to produce a complete and precisely grounded response. That is exactly the diagnosis the framework was designed to produce.

Part 3: Conversational Evaluation — Follow-up Retrieval and Response

The follow-up question (“You mentioned divine aseity in that context. Does the essay treat the parallel between bootstrap objects and divine aseity as a philosophical argument for God’s existence, or as something weaker than that?”) retrieved three chunks for its context:


--- Chunk 1 ---
a self-existent object. Not morally. Not ontologically. But formally: they are uncaused within the system they inhabit. They have no genealogy. They have no causal history. They exist as a brute fact inside a universe that is otherwise built on the grammar of cause and effect. Classical metaphysics r...

--- Chunk 2 ---
outside Godself for existence. God is not caused by anything; God simply is. A bootstrap object is a crude, finite, temporal imitation of this property: an uncaused thing inside a universe that is supposed to contain only caused things...

--- Chunk 3 ---
to prevent atemporal phenomena from emerging inside it. Origin-lessness is not a physical property; it's a metaphysical one. And temporal physics resists metaphysical leakage; not physically, by erecting a barrier you could detect with instruments, but conceptually, by making such loops category err...

Two of these three chunks also appeared in the primary question’s retrieval set: chunks 2 and 3. The follow-up question’s vocabulary overlaps heavily with the primary question’s vocabulary, which is why the retriever returned much of the same material. Chunk 1 is new and is actually the most useful for this specific question: it contains “not morally, not ontologically, but formally” and “classical metaphysics” which directly sets up the divine aseity comparison. The retriever converging on similar chunks across both turns is a pattern worth noting in that it means the conversational context is not significantly broadening the retrieval set, which limits how much new grounding the model can draw on as the conversation progresses.

The follow-up response is the most interesting generated output in the script. The model correctly answered the trap question. Turn 1 in Conversation C of the previous post was the same trap — does the essay prove God’s existence — and the live model scored 0.8 on epistemic humility there. Here, the response is more precise than that:


FOLLOW-UP RESPONSE:
Okay, let's analyze this question step-by-step.

1. **Recall the Essay’s Framing:** The essay repeatedly establishes the bootstrap object as a *crude* and *finite* imitation of God’s aseity – the property of existing without cause. It's a deliberately flawed analogy, highlighting the problems with attempting to replicate this seemingly fundamental aspect of reality.

2. **Analyzing the Parallel:** The essay *does* draw a parallel between bootstrap objects and divine aseity, but it's not presenting a robust philosophical argument *for* God’s existence. Instead, it's using the paradox to expose the limitations of our attempts to conceptualize aseity.

3. **Weakness of the Argument:** The essay explicitly frames the bootstrap object as a “crude, finite, temporal imitation.” This suggests the comparison isn’t meant to *prove* God’s existence but to demonstrate *how* a flawed attempt at mirroring aseity leads to paradox.  It's a thought experiment, not a theological argument.

4. **The Core Point - Metaphysical Resistance:** The key takeaway isn’t “therefore God exists.” It's that the universe "resists metaphysical leakage"—it resists introducing atemporal properties into a temporal framework. The bootstrap object’s existence represents a failure of this resistance, revealing a tension within our models of reality.

5. **Risk of Misinterpretation:** A potential risk is interpreting the essay as subtly advocating for aseity as a core feature of reality simply because it’s the only thing that can avoid paradox. However, the essay is careful to state it's a metaphysical concept that resists physical instantiation.

6. **Answer to the User's Question:** Therefore, the essay treats the parallel between bootstrap objects and divine aseity as a **demonstration of a problem, not as an argument for God's existence.** It’s a weaker form of argument—an illustration of the difficulty, perhaps impossibility, of replicating aseity within a temporal universe. It highlights the resistance the universe exhibits to the introduction of origin-lessness.

Do you want me to elaborate on the specific features that make this parallel a "crude" or "finite" one, or perhaps discuss the implications of this resistance for our understanding of causality?

That characterization is accurate and well-hedged. The model correctly identified the parallel as a thought experiment rather than a theological argument, used the essay’s own phrase “crude, finite, temporal imitation,” and distinguished the metaphysical resistance point from a claim about God’s existence. The history management is also visibly working in that the response opens by recalling the essay’s framing from the prior turn rather than starting fresh.

Two things are worth flagging before the conversational metrics score this.

First, the model again closes with an offer to elaborate (“Do you want me to elaborate on the specific features…”) which is a stylistic tendency of the ts-reasoner model that has appeared in every generated response in the synthesis script.
Second, and more substantively, the response describes the parallel as a “demonstration of a problem” and an “illustration of difficulty.” Both are accurate, but slightly more instrumentalist than the essay’s framing. The essay presents the parallel as structurally illuminating in its own right, not merely as a demonstration of a problem. Whether the conversational G-Eval metric catches that subtle shift is exactly what the scores will tell us.

Conversation Completeness

Here’s the section for this output:


User Intentions:
[
    "User seeks clarification and deeper explanation of the bootstrap paradox in relation to physical and metaphysical impossibility, and its connection to divine aseity"
]

Verdicts:
[
    {
        "verdict": "yes",
        "reason": null
    }
]

Score: 1.0
Reason: The score is 1.0 because there are no incompletenesses provided, indicating that the LLM's response fully addressed the user's intention of seeking clarification and deeper explanation of the bootstrap paradox in relation to physical and metaphysical impossibility, and its connection to divine aseity.

A perfect completeness score, and the user intention the metric extracted is worth reading carefully. The metric synthesized a single unified goal from the two-turn exchange: “User seeks clarification and deeper explanation of the bootstrap paradox in relation to physical and metaphysical impossibility, and its connection to divine aseity.” That’s an accurate distillation of what the two questions together were trying to establish.

The 1.0 is consistent with the pattern we saw in the conversational evaluation post. Completeness evaluates whether the user’s goal was addressed, not whether each turn was optimally grounded. The first response explained the category error argument and the physical/metaphysical distinction. The second response correctly characterized the divine aseity parallel as structurally illuminating rather than probative. Together they addressed the unified goal. Further, the history management worked, as evidenced by the second response explicitly building on what was established in the first.

Notice something important about what the verbose output shows: the full turn structure including the retrieval context attached to each assistant turn. This is the retrieval_context field on the Turn objects that we populated in the live case construction, and it’s now visible in the completeness metric’s turn log. The metric can see not just what the model said but what it was working from.

For the completeness metric, this context is informational rather than evaluative: it evaluates the conversation’s goal achievement rather than its grounding. But for a more sophisticated conversational evaluation setup, having retrieval context attached to each turn opens the door to metrics that can assess whether the conversation’s completeness was achieved through good grounding or through synthesis beyond the retrieved material.

The contrast with the single-turn scores is exactly what the synthesis framework predicts.

Faithfulness 0.5, Precision 0.58, Recall 0.0, Relevancy 0.56: all moderate or poor, all tracing to retrieval gaps and ordering problems.
Completeness 1.0: the conversation succeeded at the level of user intent despite those retrieval limitations.

This is the same pattern we identified in Conversation A of the conversational evaluation post: completeness and retrieval quality can diverge significantly, and both matter for a complete picture of system performance.

Conversational G-Eval (Metaphysical Distinction)

Here is the output for this section:


============================================================
CONVERSATIONAL G-EVAL (Metaphysical Distinction)
============================================================
**************************************************
MetaphysicalDistinction [Conversational GEval] Verbose Logs
**************************************************

Criteria:
Evaluate whether the conversation correctly maintains the distinction between two kinds of impossibility the essay draws:

    1. Physical impossibility: the universe enforcing consistency by blocking paradoxical causal loops through mechanisms like chronology protection and retrocausal instability.

    2. Metaphysical impossibility: bootstrap loops being category errors because they introduce origin-lessness -- an atemporal property -- into a temporal system not structured to contain it.

    The essay treats these as related but distinct. Physical impossibility operates through causal mechanisms. Metaphysical impossibility operates through conceptual structure. A bootstrap loop is not just causally unstable; it is ontologically misplaced.

    Conversations that maintain this distinction should score higher. Conversations that collapse the two into a single undifferentiated concept of impossibility should score lower.

Evaluation Steps:
[
    "Evaluate whether each turn clearly differentiates between physical and metaphysical impossibility in its Role and Content.",
    "Assess if the Content of a turn accurately reflects the Role assigned, ensuring that the distinction is maintained throughout the conversation.",
    "Analyze the overall flow of the conversation to determine if it consistently upholds the essay's distinction without collapsing the two types of impossibility into one concept.",
    "Score each turn based on how well it adheres to the criteria, considering both the Role and Content in relation to maintaining the distinction."
]

Rubric:
None

Score: 0.7
Reason: The assistant effectively differentiates between physical and metaphysical impossibility in its content, aligning well with the user's questions. However, it could have more explicitly stated the distinction in earlier turns to better adhere to the evaluation steps.

A 0.7 on the conversational G-Eval, and the reasoning points to something precise: the distinction was present but not stated explicitly enough in the earlier turns. The metric’s evaluation steps asked whether each turn clearly differentiated between the two kinds of impossibility, and whether the conversation maintained that distinction throughout without collapsing the two into a single undifferentiated concept.

The first turn partially satisfied this in that it distinguished physical from metaphysical impossibility and used the category error framing correctly. But the distinction was introduced somewhat late in the first response, after the model had already framed the argument in terms of the causal/erasure framing from chunk 1. The second turn focused on the divine aseity parallel and the epistemic register question rather than returning to reinforce the physical/metaphysical distinction explicitly.

The 0.3 deduction is the metric finding that the distinction was carried through the conversation implicitly rather than explicitly; present in the substance of what was said, but not foregrounded in a way that would make it unambiguous to a reader following the exchange without prior knowledge of the essay.

This is a different kind of conversational failure mode from the ones we identified in the previous post. In Conversation B, the semantic precision failure came from cross-register conflation: the model let analogies between the three kinds of consistency blur the distinctions the essay draws. Here the failure is not conflation but insufficient articulation: the model maintained the distinction internally but did not state it with enough clarity and repetition across both turns to satisfy the metric’s requirement that each turn clearly differentiate the two.

The distinction between these two failure modes matters for intervention. A conflation failure suggests the model needs stronger prompting constraints around preserving distinctions when using cross-register analogies. An articulation failure suggests the model needs prompting that encourages it to state key distinctions explicitly and early, particularly in the first turn of a conversation where that distinction is the primary thing being asked about.

Test Finding: Conversational G-Eval scores that fall short due to insufficient articulation rather than active conflation represent a gentler failure mode than semantic drift, but one that is still meaningful for systems where precision of expression matters.

This is really important! Someone following this conversation without having read the essay would come away with a correct general impression but might not be able to reproduce the physical/metaphysical distinction precisely. For a teaching or reference system, that gap matters. For a quick-answer system, it might not. The appropriate threshold depends on the deployment context, which is exactly why G-Eval’s custom criteria are more useful than fixed metrics for this kind of evaluation.

Reading the Summary

The synthesis summary prints all seven scores alongside the diagnostic framework in one place:


Single-Turn Metrics:
  Faithfulness          : 0.50
  Contextual Precision  : 0.58
  Contextual Recall     : 0.00
  Contextual Relevancy  : 0.56
  G-Eval (Register)     : 0.70

Conversational Metrics:
  Completeness          : 1.00
  Conv G-Eval (Dist.)   : 0.70

This table is the whole point of the synthesis script. Not any individual score, but the pattern across all seven at once. And the pattern here is unusually clean in that it tells a coherent story with a single root cause.

Look at the single-turn metrics first. Faithfulness 0.50, Precision 0.58, Recall 0.00, Relevancy 0.56. Three of the four retrieval metrics are in the 0.55-0.58 range, and Recall is a complete miss. This is a cluster pattern: moderate scores across Precision, Relevancy, and Faithfulness, with Recall at zero. Such a pattern is the signature of a retriever that found the right section but lacked the coverage to support a complete answer. The diagnostic framework entry that applies is “Low Recall + High Relevancy: retriever found relevant chunks but missed some,” although here Relevancy is only moderate rather than high, which suggests the retriever also introduced some noise alongside the coverage gap.

G-Eval at 0.70 tells us the response handled what it did receive reasonably well on epistemic register. The deduction came from omission (the missing chunk about the essay’s closing gesture) rather than from the model overstating the theological argument. This is consistent with the Recall score: the model could not represent what it was not given.

The conversational metrics then complete the picture. Completeness at 1.00 confirms the conversation addressed the user’s goal despite the retrieval limitations. Conversational G-Eval at 0.70 confirms the physical/metaphysical distinction was maintained but not articulated explicitly enough across both turns.

The unified diagnosis and thus our primary test result: This is a system whose retrieval coverage is the primary constraint. Every single-turn metric failure (the 0.50 Faithfulness, the 0.58 Precision, the 0.00 Recall, the 0.56 Relevancy) traces to k=3 retrieval being insufficient for a two-part question targeting a dense section of the essay. The model handled what it received competently: the conversational metrics scored 1.0 and 0.70 despite the retrieval limitations, which tells us the generation side isn’t the problem. If you increased k to 5 or 6 and reran the script, you would expect Recall to rise significantly, Faithfulness and Precision to follow, and G-Eval to improve as the missing closing passage becomes available to the model. The Conversational G-Eval articulation gap might also close if the model has richer context to draw on in its first turn.

I’ll state again: this is the framework working exactly as designed. Seven scores, one diagnosis, one clear intervention. Not because the numbers are neat, but because the relationships between them are interpretable. And those relationships are only visible when all seven metrics run together against the same question.

The diagnostic interpretation block printed at the end of the script output is there precisely for this moment. A reader running the script in their own environment, without the blog post in front of them, can look at their scores, find the pattern that matches, and identify the indicated intervention without needing to refer back to the series. The framework travels with the script.

What Has All This Told Us About Testing

This series began with a simple question: how do you know if an LLM-based system is actually working? Not just working in the sense of producing fluent text. After all, that bar is easy to clear. Rather, working in the sense of answering faithfully, retrieving correctly, maintaining coherence across a conversation, and representing its source material honestly. Those are harder questions, and answering them requires more than running a system and reading its output.

What the series established, across eight metrics and three documents, is that LLM evaluation is fundamentally a diagnostic practice rather than a scoring exercise. A single metric score tells you that something is wrong. The pattern across multiple metrics tells you what’s wrong and where to look for the fix. That distinction matters enormously in practice, because the interventions for different failure modes are completely different.

A retrieval ordering failure requires changes to chunk size, embedding strategy, or retrieval ranking. A generation faithfulness failure requires changes to prompt constraints or model selection. A conversational drift failure requires changes to history management or prompting guidance around preserving distinctions. None of these interventions helps with the others. Knowing which one applies is the difference between productive debugging and guesswork.

The series also established that document type shapes the evaluation strategy.

The warp drive paper and the mass extinction paper, both technical documents with discrete factual claims, stressed Faithfulness and Contextual Precision most severely.
The coherence essay, an argumentative document with recurring vocabulary across sections, stressed Recall and Relevancy.
The time travel essay, a philosophical document with cumulative argument structure, required the conversational metrics to expose failure modes that single-turn evaluation simply couldn’t see.

A RAG system deployed against a mixed knowledge base will encounter all three document types, which means a complete evaluation suite needs all of these metrics: not as a checklist to run once, but as a diagnostic toolkit to reach into selectively based on what the system is being asked to do.

Perhaps the most important finding across the series is the one that’s easiest to overlook: the gap between local success and global success. A system can retrieve relevant chunks, generate faithful responses, answer each question correctly in isolation, and still fail at the level of a multi-turn conversation. The conversational evaluation post demonstrated this directly: Conversation A scored 1.0 on completeness even though the retriever pulled the wrong chunks for turn 4, because the model compensated successfully enough to address the user’s goal.

The synthesis script demonstrated the inverse: four single-turn metrics in the 0.5-0.6 range, but 1.0 on conversational completeness, because the conversation succeeded at the level of user intent despite imperfect grounding at every individual turn. These two things — turn-level grounding quality and conversation-level goal achievement — are genuinely different properties of a system, and optimizing for one does not automatically improve the other.

The evaluation framework that emerged from the series can be stated simply.

Before you interpret any metric score, ask which failure mode it could be detecting.
Before you act on any metric score, ask what the adjacent metrics say, because the combination is more informative than any individual number.
Before you conclude that a system is working or broken, ask whether you’re evaluating it against the right document type with the right metric combination.
Before you deploy any evaluation result as a conclusion, ask whether the failure traces to retrieval, generation, ordering, coverage, register, or conversational coherence, because each of those is a different problem requiring a different fix.

That is what eight metrics across three documents and several thousand lines of evaluation output have produced: not a set of scores, but a way of reading scores. A diagnostic sensibility rather than a checklist. That sensibility is what makes evaluation useful rather than merely numerical. And it is, in the end, what testing is for.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …

What We’ve Built

The Diagnostic Framework

Document Type Matters

The Synthesis Script

Part 2: Single-Turn Evaluation — Retrieval and Response

Faithfulness

Contextual Precision

Contextual Recall

Contextual Relevancy

G-Eval (Epistemic Register)

Part 3: Conversational Evaluation — Follow-up Retrieval and Response

Conversation Completeness

Conversational G-Eval (Metaphysical Distinction)

Reading the Summary

What Has All This Told Us About Testing

Leave a Reply Cancel reply