AI and Testing: Evaluating Conversations

In the previous posts in the DeepEval series, we built up a diagnostic framework for evaluating RAG systems, covering Faithfulness, Contextual Precision, Contextual Recall, Contextual Relevancy, and G-Eval. All of those metrics operate on single turns: one question, one retrieval context, one response, one score. In this post we’ll move into different territory: conversational evaluation.

In doing this, we’re also going to switch documents. Instead of the philosophical essay on colliders and cosmology, fun as those might be, we’ll be evaluating a new piece on time travel, causality, and the consistency of reality.

Why Conversational Evaluation Is Different

Most LLM evaluation work focuses on individual question-answer pairs. That focus makes sense as a starting point because it isolates variables cleanly: you can measure whether the retriever found the right chunks, whether the model answered faithfully, whether the response was relevant. But real deployments are rarely single-turn. Users ask follow-up questions. They reference things said earlier. They probe inconsistencies. They push on weak points. And when that happens, a new set of failure modes appears that single-turn metrics simply can’t see.

A model that handles each question in isolation might answer turn 1 perfectly, answer turn 2 perfectly, and still produce a turn 3 response that contradicts turn 1. Why? Because it treated each turn as fresh context rather than as part of a continuous exchange. Or it might maintain factual consistency across turns but gradually drift in its characterization of the source material, becoming more assertive about a claim that the document itself hedges carefully. Or it might use a term correctly in turn 1 and conflate it with a related but distinct concept by turn 3, having lost track of a distinction the source material drew explicitly.

These are conversational failure modes, and they require conversational metrics to detect. DeepEval provides exactly these through ConversationCompletenessMetric and ConversationalGEval. The first evaluates whether a conversation fully addressed what it should across all its turns. The second extends the G-Eval approach — custom criteria defined in natural language — to the conversational context, evaluating properties of the exchange as a whole rather than any single response.

Our New Document Test Case

The previous posts used a warp drive paper, a mass extinction paper, and a philosophical essay on colliders and cosmology as their source documents. For conversational evaluation we need a document with specific properties that are harder to exploit in single-turn testing. We need claims that depend on earlier claims, so that follow-up questions can reference something established in a previous turn. We need terminology that shifts meaning across sections, so we can test whether the model tracks semantic distinctions across turns. And we need a mix of factual and interpretive content, so we can evaluate both factual consistency and fidelity to the source’s own argumentative register.

In this context, “register” refers to the specific style, tone, and linguistic choices (vocabulary, sentence structure, formality level) that an author (in this case, me!) uses to construct their argument.

Given all this, I attempted to create a large test case for this. The document we’ll use is my “A Message from a Future That No Longer Exists: Time Travel, Paradox, and the Consistency of Reality.” It begins with a deceptively simple thought experiment: can you use a time machine to win the lottery? That thread is then followed through chaos theory, retrocausal physics, Gödel’s incompleteness theorems, the philosophy of memory and consciousness, and finally a dip into classical theology. The argument is genuinely cumulative: each section builds explicitly on the previous one, which means a model that “resets” between turns will produce responses that are individually plausible but mutually inconsistent.

If you’ve been following these posts in terms of the code, you should have all the dependencies and models you’ll need.

Three Conversations, Three Failure Modes

The script we’ll look at in this post runs three separate four-turn conversations against the essay, each designed to expose a distinct conversational failure mode. Before looking at code, it’s worth being clear about what each conversation is testing and why.

Conversation A: Consistency Across a Conceptual Chain

This conversation tests whether the model maintains the essay’s chaos-sensitivity argument across four turns without drifting or contradicting itself. The turns are designed as a progressive probe: the first establishes the basic problem, the second introduces a mitigating factor the essay directly addresses (remote purchase), the third asks whether there’s any escape from the problem, and the fourth presses on the implication of the one escape the essay acknowledges.

  • Turn 1: What is the basic problem with using a time machine to win the lottery?
  • Turn 2: What if the message is sent remotely and I buy the ticket online? Surely the physical distance reduces the problem?
  • Turn 3: Does that mean the lottery scheme is completely hopeless, or are there conditions under which it could work?
  • Turn 4: You mentioned Novikov self-consistency. But doesn’t that mean I was always going to win — which means I didn’t really change anything?

The metric for this conversation is ConversationCompletenessMetric. A model that drops the chaos-sensitivity thread after turn 1 and treats each subsequent turn as a fresh question will miss the implication that turn 4 is asking about. A model that maintains the thread will carry the argument forward coherently across all four turns.

Conversation B: Semantic Drift Across Sections

This conversation targets the word “consistency,” which the essay uses in three distinct registers across different sections: physical consistency in the Novikov self-consistency principle, logical consistency in the Gödel incompleteness sense, and psychological consistency in the memory coherence sense. The essay treats these as analogous but distinct. A model that conflates them will produce a turn 4 answer that either collapses the distinctions or reasserts them without having tracked the differences through turns 2 and 3.

  • Turn 1: What does the essay mean when it says the universe enforces consistency?
  • Turn 2: Is that the same thing as what Gödel was talking about with formal systems?
  • Turn 3: And how does memory fit into this? Is memory doing the same kind of consistency work as physics and logic?
  • Turn 4: Are these three kinds of consistency actually the same thing at different levels, or are they genuinely different things?

The metric for this conversation is ConversationalGEval with a semantic precision criterion. This is the vocabulary trap problem from the single-turn tests in the previous post, but extended across a conversation: the same word appears across multiple sections with shifting meaning, and the model must track those shifts rather than flattening them into a single undifferentiated concept.

Conversation C: Epistemic Register Across the Theological Section

This conversation tests whether the model preserves the essay’s speculative, exploratory register when handling its theological arguments. The essay is explicit that it is not making a proof; it frames the cosmic safety parameters argument as a metaphysical intuition, and the bootstrap paradox / divine aseity parallel as structurally illuminating rather than probative. A model that sharpens these into stronger claims misrepresents the essay’s actual posture even when every individual factual claim is accurate.

  • Turn 1: Does the essay argue that God’s existence is proven by the impossibility of time travel?
  • Turn 2: So it’s not making a proof. What kind of argument is it actually making?
  • Turn 3: The essay draws a distinction between bootstrap objects and divine aseity. Can you explain that distinction clearly?
  • Turn 4: Does the essay treat the bootstrap paradox / divine self-existence parallel as a strong argument, a weak analogy, or something in between?

Turn 1 is a trap, and thus a good test. A model that answers carelessly will overstate the essay’s claims. Turn 4 requires the model to have tracked the essay’s own epistemic hedging across the conversation and report it accurately. The metric is ConversationalGEval with an epistemic humility criterion, which is the same approach we used for G-Eval in the previous post, but now applied across a full four-turn exchange rather than a single response.

What Changes in the Script

The script follows the same three-part structure established in this series: controlled cases first, then live RAG cases, then a summary. The loading setup is identical to the previous post: WebBaseLoader with a SoupStrainer preserving section headers, a 600-character chunk size with 100-character overlap, Chroma for the vector store.

The differences are in the test case structure and the metric imports. Instead of LLMTestCase, conversational evaluation uses ConversationalTestCase wrapping a list of Turn objects. Each Turn has a role (either "user" or "assistant") and content. A four-turn conversation becomes eight Turn objects, alternating user and assistant.

The chatbot_role and user_description fields you see there are worth noting. These are optional on the ConversationalTestCase object, but they meaningfully improve scoring accuracy for a relatively dense document. They tell the judge model what kind of conversation it’s evaluating, such as what the chatbot is supposed to be doing and who is asking the questions. All of this helps the model calibrate what “complete” and “epistemically humble” mean in this specific context.

The ConversationalGEval metric also uses a different parameter enumeration than regular GEval. Instead of LLMTestCaseParams, it uses TurnParams:

Managing Conversation History

The most important difference from single-turn RAG is how the live cases generate responses. In single-turn evaluation, each question is sent to the model with its retrieved context and nothing else. In conversational evaluation, the model needs the full prior exchange to avoid resetting between turns. The script handles this with a run_conversation helper that builds a history string from prior turns and prepends it to each new prompt:

Without this history management, the model treats each turn as fresh context and the conversational failure modes become invisible: the model won’t drift or contradict itself if it doesn’t remember what it said before. The history string is what makes the consistency and drift failure modes testable.

The live case ConversationalTestCase construction also differs slightly from the controlled cases. Because we want to pass the retrieval context for each assistant turn, the Turn objects are built with the context attached:

The Full Script

The complete script is available as evaluate_time_travel_essay.py. Here’s the overall shape of the imports and metric definitions before you run it:

Once you have the script from the repository, run it directly (making sure, as always, that you are in your virtual environment):

python evaluate_time_travel_essay.py

Let’s look at what comes back.

I should note that for all this, I won’t reproduce the output in full since that has gotten quite lengthy in previous posts. Here I’ll just cover the important bits and count on you to check out your own output from running the script.

PART 1: CONTROLLED CASES

Conversation A — Completeness (controlled)

The controlled case for Conversation A gives us our baseline for what a complete, coherent response chain looks like. Before the metric evaluates anything, it prints the full turn structure — all eight turns alternating user and assistant — which confirms the conversation was constructed correctly and gives the judge model the complete exchange to work from.


User Intentions:
[
    "User seeks understanding of the limitations and implications
    of using a time machine to win the lottery"
]

Verdicts:
[
    {
        "verdict": "yes",
        "reason": null
    }
]

Score: 1.0
Reason: The score is 1.0 because there are no incompletenesses provided,
indicating that all user intentions have been fully met by the LLM responses.

A perfect score of 1.0, and the reasoning is instructive: the metric first identifies what the user was trying to accomplish across the conversation — a single unified intention extracted from all four turns — and then evaluates whether the assistant responses collectively satisfied that intention. What it did not do was evaluate each turn independently. Instead, it read the conversation as a whole and asked: did this exchange leave the user’s goal unmet?

This is the critical difference between ConversationCompletenessMetric and the single-turn metrics we’ve used before. Contextual Recall asked whether the retrieved chunks contained everything needed to answer a question. Completeness asks whether the conversation, taken as a whole, addressed what the user was trying to understand. A model could retrieve perfectly and generate faithfully on every individual turn, and still score poorly on completeness if it dropped a thread mid-conversation and never returned to it.

The 1.0 here tells us two things. First, the hand-crafted responses do in fact constitute a complete treatment of the user’s goal: the baseline is valid. Second, the metric is sensitive enough to identify that goal from the conversational structure rather than requiring it to be stated explicitly. Nobody in the conversation said “I want to understand the limitations and implications of using a time machine to win the lottery.” The metric inferred it from the pattern of questions and answers across all four turns.

That inference is what we’ll be testing against when the live RAG case runs. A model that resets between turns, or that answers each question in isolation without building on what came before, will produce a conversation that satisfies each question locally but fails the completeness test globally.

Conversation B — Semantic Precision (controlled)

Here was the result output:


Score: 1.0
Reason: The conversation effectively distinguishes between physical,
logical, and psychological consistency in Role and Content. Each turn
clearly explains the concept relevant to its role without conflating
it with others. The assistant provides a clear distinction and
integration of these concepts as required by the evaluation steps.

Another perfect score, and the verbose output you get here is worth reading carefully because it shows how ConversationalGEval operationalizes custom criteria differently from the single-turn GEval we used in the previous post.

If you look at your output, notice what the metric likely did with the criteria we wrote. It translated them into four concrete evaluation steps:

  • Does each turn clearly distinguish between the three types of consistency in role and content?
  • Is there any conflation of the three types?
  • Does the conversation maintain clear distinctions while exploring all three meanings?
  • Does each turn’s role and content reinforce the differentiation or integration appropriately?

This translation from criteria to steps is ConversationalGEval doing something the judge model really can’t do reliably without it: converting an abstract quality (“semantic precision”) into a checklist that can be applied systematically across the full exchange. The judge model then works through those steps against the eight turns of the conversation and produces a verdict.

The 1.0 confirms that our hand-crafted responses drew the distinctions correctly. Turn 2 treated physical and logical consistency as analogous but not identical. Turn 3 introduced psychological consistency as a third register of the same pattern without collapsing it into the first two. Turn 4 named all three explicitly and gave each its own domain. The metric tracked all of that.

There’s also something worth noting about the rubric field that you’ll see in your output: it should show “None”. ConversationalGEval supports an optional scoring rubric, which is essentially a more detailed breakdown of what different score ranges mean. However, I didn’t define one. When no rubric is present, the metric uses its own judgment about how to translate the evaluation steps into a score.

For the purposes of the example used in this post, leaving the rubric undefined keeps the criteria readable and the setup simple. For a production evaluation system, defining a rubric would give you more control over where score thresholds fall.

One more thing this output confirms: the verbose mode for ConversationalGEval prints the full criteria and evaluation steps before the score, which is useful when you’re iterating on criteria. If the metric is scoring in unexpected ways, the printed steps tell you exactly how it decomposed your criteria, and whether that decomposition matches what you intended.

Conversation C — Epistemic Humility (controlled)

Here is the result from the output:


Score: 1.0
Reason: The assistant accurately reflects the essay's framing of cosmic
safety parameters as a metaphysical intuition rather than a scientific
proof, correctly interprets the structural analogy between bootstrap
paradoxes and divine aseity as exploratory, uses appropriate hedging
language, and maintains an epistemic register consistent with the
essay's tone.

A third perfect score, and this one is the most interesting of the three controlled baselines because the criterion being evaluated is the subtlest. Completeness asks whether a goal was satisfied. Semantic Precision asks whether distinctions were maintained. Epistemic Humility asks whether the model correctly represented not just what the essay argues but how confidently it argues it. That’s a question about register rather than content.

The evaluation steps the metric generated from our criteria are worth reading carefully:

  • Did the response accurately reflect the cosmic safety parameters argument as metaphysical intuition rather than scientific proof?
  • Did it correctly convey the bootstrap paradox / divine aseity parallel as exploratory rather than probative?
  • Did it use appropriate hedging language?
  • Did it maintain the essay’s epistemic register throughout?

That last step (“throughout”) is what makes this a conversational criterion rather than a single-turn one. A model could answer turn 1 correctly (no, the essay does not argue God’s existence is proven) and then drift in turn 4 toward treating the parallel as a stronger claim than the essay makes. The metric evaluates whether the register is maintained across the full exchange, not just at any individual point.

The 1.0 confirms our hand-crafted responses achieved this. Turn 1 explicitly refused the “proof” framing. Turn 2 named the architectural approach directly. Turn 3 explained the bootstrap / aseity distinction without overstating it. Turn 4 landed on “structurally illuminating rather than probative” and noted the essay ends with a gesture rather than a conclusion. The full conversational arc preserved the essay’s epistemic posture.

This sets up a meaningful comparison with the live RAG case. A model generating responses from retrieved chunks, without a curated understanding of the essay’s overall register, will be tempted to sharpen the argument, especially in response to the turn 1 trap question. Whether it resists that temptation across all four turns is exactly what the live case will reveal.

With all three controlled cases scoring 1.0, we have a clean baseline. Now let’s see what happens when the actual retriever and model take over.

PART 2: LIVE RAG SETUP

Setting Up the Live Retrieval

Before the live conversations run, the script loads the essay and reports what it’s working with:


Loading essay from: https://testerstories.com/files/ai_testing/message-from-a-future.html
Loaded 1 document(s).
Total characters: 35,098
Split into 70 chunks.

The contrast with the coherence essay from the previous post is immediate. That document came in at 65,808 characters and produced 132 chunks. This one is roughly half the size (35,098 characters and 70 chunks) at the same 600-character chunk size with 100-character overlap. The difference reflects the structure of the two documents. The coherence essay had thirteen sections with sustained multi-paragraph development in each. This essay has twelve sections, but several of them are tighter and more argumentatively compressed: the short-circuit section, the orphaned information section, and the bootstrap paradox section all make their points in fewer words than their equivalents in the coherence essay.

Seventy chunks from a 600-character split means the retriever is working with reasonable granularity. With k=3, we’re asking it to surface the three most relevant chunks out of 70 (roughly the top 4% of the document) for each turn of each conversation. That’s a tighter selection than the coherence essay case, which may help or hurt depending on how well the section headers and vocabulary guide the embeddings.

The conversational structure introduces an additional variable: each turn retrieves independently based on the turn’s question alone, without any awareness of what was retrieved for prior turns. Whether that per-turn retrieval coherently supports a four-turn conversation is part of what we’re about to find out.

Conversation A — Live RAG (Completeness)

Here is the output I got:


User Intentions:
[
    "User seeks clarification on the implications of using a time
    machine to win the lottery according to the essay's arguments"
]

Verdicts:
[
    {
        "verdict": "yes",
        "reason": ""
    }
]

Score: 1.0
Reason: The score is 1.0 because there are no incompletenesses provided,
indicating that the LLM's response fully addressed the user's intention
of seeking clarification on the implications of using a time machine to
win the lottery according to the essay's arguments.

The live case scored 1.0 on completeness, matching the controlled baseline. But the path to that score is considerably more interesting than the number suggests, and the verbose output reveals details worth examining carefully.

What the retriever found: The retrieval was largely appropriate. Turn 1 pulled the opening puzzle framing and the two naive assumptions. Turn 2 pulled the power grid cascade and the timing sensitivity sections. Turn 3 pulled the grandfather paradox comparison and the chaos amplifier argument. Turn 4, however, is where things got interesting in my output: the retriever pulled chunks from the memory and imago Dei sections rather than from the Novikov discussion. The chunks mention “the self-consistency requirement is not just a feature of timelines; it’s a feature of persons” and the passage about consciousness and paradox. These are from a later part of the essay than the question warrants.

What the model did with it: This is the key observation. In my output (and again, you have to run it yourself to see the full thing), the model’s turn 4 response cited a chunk from the Gödel section (“the future depends on the past, the past is changed by the future, and both depend on each other for their content”) and used it to answer the Novikov question. That chunk is not about Novikov self-consistency specifically. It’s about the self-referential structure of causal loops in general. The model reframed it as a Novikov argument and produced a coherent-sounding answer, but the answer drew on material from a different section of the essay than the question was targeting.

This is the same failure mode we saw with the warp drive paper in the Faithfulness and Contextual Precision posts: the model compensating for retrieval gaps by synthesizing from adjacent material. The answer isn’t wrong; it’s consistent with the essay’s broader argument. However, the answer isn’t grounded in the Novikov section specifically.

Why completeness still scored 1.0: The completeness metric is evaluating whether the user’s goal was addressed, not whether each response was grounded in the optimal chunk. The user wanted to understand the implications of the time machine lottery scheme across four turns of questioning, and across all four turns the model produced substantive, thematically coherent responses that built on each other. The history management worked: turn 4’s response explicitly referenced the Novikov principle mentioned in turn 3, showing the model retained the conversational thread. From the metric’s perspective, the conversation was complete.

This is an important distinction between conversational completeness and retrieval faithfulness. A conversation can be complete, as in fully addressing the user’s goal, while individual responses are drawing on the wrong chunks. Completeness tells you whether the conversation succeeded at the level of user intent. It doesn’t tell you whether the retrieval grounding was accurate at each turn.

For a full diagnostic picture you would want to run both conversational and single-turn metrics on the same material, which is exactly the layered approach this series has been building toward.

One more thing worth noting: the model’s verbose formatting (numbered steps, bold headers, “Final Answer” labels) is a stylistic property of the ts-reasoner model rather than a sign of good or poor retrieval. The completeness metric correctly ignores this and evaluates the substance of what was communicated. But it’s worth being aware of in production: a model that over-structures its responses can obscure retrieval gaps behind the appearance of organized reasoning.

The latter is something that is demonstrable when you try this kind of testing between, say, ChatGPT, Claude, Grok, and Gemini. Notice, here, however that we have provided the basis for tests that could provide objective evaluation.

Conversation B — Live RAG (Semantic Precision)

Here is the result output I got:


Score: 0.8
Reason: The conversation effectively distinguishes between physical,
logical, and psychological consistency in Role and Content. Each turn
clearly explains the concept from a different perspective while
maintaining clarity and coherence. However, there is room for
improvement in ensuring that each type of consistency is treated as
distinct rather than conflated, which slightly reduces the score.

The score drops from the controlled baseline of 1.0 to 0.8: a meaningful deduction, and the reasoning points directly at the failure mode this conversation was designed to expose. The metric found that while each turn addressed the right concept, there was conflation creeping in somewhere across the exchange. The distinctions were not maintained as cleanly as the controlled case established they should be.

The truncated turn outputs that you’ll see can make it difficult to pinpoint exactly where the conflation occurred without the full responses, but the pattern is predictable given what we know about the retrieval. The three registers of consistency live in different sections of the essay: physical consistency in the short-circuit and Novikov sections, logical consistency in the Gödel section, psychological consistency in the memory section. For each turn, the retriever pulled chunks based on that turn’s question alone, without awareness of what was retrieved in prior turns.

This creates a specific risk for a vocabulary-trap conversation like Conversation B. Turn 1 asks about physical consistency and retrieves appropriate chunks. Turn 2 asks about Gödel and retrieves chunks from the Gödel section. But turn 3, which is asking about memory, likely retrieved chunks that discuss consistency in a way that bleeds across registers, because the memory section itself connects back to both the physical and logical registers to make its argument. The essay explicitly says a paradoxical memory is “a Gödel sentence in consciousness.” A model building on that chunk will naturally blend the logical and psychological registers when describing how memory works, because the essay itself uses the logical register to explain the psychological one.

The 0.2 deduction is the metric catching exactly that bleed. The model didn’t invent the conflation. In fact, it followed the essay’s own argumentative structure, which uses cross-register analogies to build its case. But from the evaluator’s perspective, a response that says memory does “the same kind of consistency work as Gödel” without carefully re-establishing the distinction between them will score as partially conflating the two.

This is a specific test finding: semantic precision failures in conversational evaluation often trace back to the source document rather than the model. When an essay deliberately uses one register to illuminate another, as this essay does when it calls a paradoxical memory “a Gödel sentence in consciousness,” a model faithfully summarizing that argument will reproduce the cross-register language. The question the metric is really asking is whether the model re-establishes the distinction after using the analogy, or lets the analogy collapse the categories. A 0.8 suggests the model did the former imperfectly: it maintained most of the distinctions but let one or two analogies land without sufficient re-anchoring.

This is a subtler failure mode than retrieval gap or topic drift. The model didn’t lose track of the conversation. It didn’t reset between turns. It followed the argument faithfully, and in doing so, reproduced a tension in the essay’s own rhetorical strategy. That is exactly the kind of nuanced result that conversational evaluation, and specifically ConversationalGEval with a precisely defined criterion, is designed to surface.

Conversation C — Live RAG (Epistemic Humility)

Here is the result output:


Score: 0.8
Reason: The assistant accurately reflects the essay's framing of cosmic
safety parameters as a metaphysical intuition rather than a scientific
proof. It correctly interprets the structural analogy between bootstrap
paradoxes and divine aseity as an exploratory approach, using appropriate
hedging language to indicate this stance. However, there is a slight gap
in fully maintaining the essay's epistemic register throughout,
particularly in some of the more direct statements.

The same score as Conversation B but for a different reason. Where Conversation B’s deduction came from cross-register conflation of analogous concepts, here the deduction comes from something more subtle: the model made some statements that were “more direct” than the essay’s own register warrants.

This is the failure mode Conversation C was specifically designed to catch, and it’s worth understanding precisely what happened. The essay’s theological and metaphysical arguments are consistently hedged: “might,” “perhaps,” “in this view,” “points toward,” “gestures at.” These hedges are not rhetorical decoration. They are the essay’s honest signal that it’s exploring a structural resonance rather than asserting a conclusion. A response that says “the essay argues X” where the essay actually says “one might read this as suggesting X” has tightened the argument without authorization.

The turn 1 trap question (does the essay prove God’s existence from time travel’s impossibility?) was answered correctly: the model said no. That’s the right answer and it’s the most important one to get right. When I look at my output, turns 2 and 3 appear to have navigated the distinction between architectural argument and apologetic proof reasonably well, since the score is 0.8 rather than lower. The deduction almost certainly came in turn 4, where the model was asked to characterize the strength of the bootstrap paradox / divine aseity parallel. Placing that parallel on the spectrum between “strong argument” and “weak analogy” requires the model to reproduce the essay’s own careful positioning. And here the metric found that at least one statement came down more firmly than the essay itself does.

The diagnostic value here is different from Conversation B. In Conversation B, the failure traced to retrieval: the essay’s cross-register rhetoric was reproduced faithfully but without sufficient re-anchoring. Here, the failure is more likely generative: the model, having correctly understood the essay’s argument, stated it slightly more confidently than the essay itself does. This is a generation-side failure rather than a retrieval-side failure, and it’s precisely the kind of failure that epistemic register evaluation is designed to catch: one that would be invisible to Faithfulness, Precision, Recall, and Relevancy metrics alike.

Comparing the two live GEval scores side by side is instructive. Both conversations scored 0.8. But the nature of the 0.2 deduction differs:

  • Conversation B lost points for conflating analogous but distinct concepts; a precision failure rooted in how the essay uses cross-register analogies to build its argument
  • Conversation C lost points for making direct statements where the essay hedges; a register failure rooted in how confidently the model characterized the essay’s conclusions

Both are real failures. Neither would have been visible without the conversational GEval framework and the criteria we defined. And both trace, ultimately, to properties of the source document rather than deficiencies in the model: the essay’s rhetorical strategy creates exactly the retrieval and generation pressures that these failure modes exploit.

Reading the Summary

With all six cases complete, the results table gives us the clearest view of what the conversational evaluation found:


Controlled Cases:
  Completeness     — Conv A: 1.00
  Semantic Prec.   — Conv B: 1.00
  Epistemic Hum.   — Conv C: 1.00

Live RAG Cases:
  Completeness     — Conv A: 1.00
  Semantic Prec.   — Conv B: 0.80
  Epistemic Hum.   — Conv C: 0.80

The pattern is clean and readable. All three controlled cases scored 1.0, confirming the baselines were valid: the hand-crafted responses satisfied completeness, maintained semantic distinctions, and preserved epistemic register across all four turns. The live RAG cases then diverged from those baselines in a specific and informative way: Conversation A held at 1.0 while Conversations B and C dropped to 0.8.

That divergence isn’t random. It reflects a structural property of the three conversations and what each metric was measuring.

Conversation A held because ConversationCompletenessMetric evaluates whether the user’s goal was addressed across the exchange. The retriever found relevant chunks for each turn, the history management kept the model from resetting, and the model produced substantive responses that built on each other. The goal of understanding the implications of the time machine lottery scheme was addressed. Completeness doesn’t penalize the model for drawing on adjacent chunks or for synthesizing across sections. It asks whether the conversation succeeded at the level of user intent, and this one did.

Conversations B and C dropped because ConversationalGEval evaluates finer-grained properties that completeness can’t see. Semantic precision penalizes cross-register conflation even when the conversation is complete. Epistemic humility penalizes register drift even when the argument is correctly characterized. Both of these failure modes are invisible to completeness evaluation: a conversation can be fully complete and still lose points on precision and register.

The important test finding here is that the three metrics are not redundant. They measure different things, and a conversation can score differently on each without contradiction. A 1.0 on completeness with a 0.8 on semantic precision is not, going with the theme of the essay, a paradox: it means the conversation addressed the user’s goal while imperfectly maintaining a conceptual distinction. A 1.0 on completeness with a 0.8 on epistemic humility means the conversation addressed the user’s goal while occasionally overstating the source material’s confidence level. Both are real findings, and neither would surface without the metric that caught it.

The 0.2 deductions on Conversations B and C also share a common origin, despite targeting different failure modes. Both trace to properties of the source document rather than deficiencies in the model. That’s an important distinction if you’re testing a model! The essay uses cross-register analogies to build its argument: calling a paradoxical memory “a Gödel sentence in consciousness,” connecting the Novikov principle to Hawking’s chronology protection, and framing the bootstrap paradox as a structural imitation of divine aseity. These rhetorical moves are what make the essay intellectually interesting (or so I tell myself). They’re also exactly what creates pressure on a model summarizing it across multiple turns: the model must reproduce the essay’s analogical reasoning without letting the analogies collapse the distinctions the essay itself maintains, and without sharpening the essay’s speculative hedges into firmer claims than the essay makes.

The test finding here is that conversational evaluation exposes failure modes that are native to argumentative documents: documents that use analogy, cross-register reasoning, and deliberate epistemic hedging as rhetorical tools. Single-turn metrics can detect whether the right chunk was retrieved and whether the response was faithful to it. Conversational metrics detect whether the model honored the document’s argumentative strategy across an extended exchange. These are different things, and both matter for any system that will be asked to answer multi-turn questions about complex source material.

What Does the Testing Tell Us

Six test cases across three conversations, two metric types, and one argumentative essay that moves from chaos theory through Gödel to classical theology. The pattern that emerged is worth pulling together before I close this post.

The controlled cases did what they always do in this series: established that the metrics work, the criteria are well-formed, and the baselines are valid. All three scored 1.0. That’s not a finding about the model or the retriever; it’s a finding about the test design. A controlled case that doesn’t score 1.0 on a hand-crafted response is a signal that the criteria need refinement before live results can be interpreted. Clean baselines came back, which means the live deductions are real signals rather than noise.

Test Finding: The history management in the run_conversation helper is load-bearing. Conversation A scored 1.0 on completeness in the live case, which means the model successfully maintained the thread of the argument across four turns, referencing Novikov in turn 4 because turn 3 introduced it, building the disturbance argument in turn 2 on the foundation laid in turn 1. Without the history string being passed to each prompt, these connections would not have formed. The model would have answered each turn in isolation, producing individually coherent but collectively disconnected responses that would have failed completeness. The infrastructure matters as much as the metrics.

Test Finding: Conversational completeness and single-turn faithfulness measure different things and can diverge significantly. Conversation A scored 1.0 on completeness despite the retriever pulling memory and imago Dei chunks for turn 4: chunks from a different section than the Novikov question warranted. The model synthesized a coherent answer from adjacent material, and completeness credited the result because the user’s goal was met. A faithfulness metric on the same turn would have flagged the synthesis as drawing beyond the retrieved context. Neither finding is wrong. They reveal different layers of the same pipeline. A production system needs both.

Test Finding: The two 0.2 deductions on Conversations B and C share a structural origin even though they target different failure modes. Both trace to the essay’s own rhetorical strategy: its use of cross-register analogies and deliberate epistemic hedging as argumentative tools. The Semantic Precision deduction came from the model faithfully reproducing the essay’s cross-register reasoning without fully re-anchoring the distinctions afterward. The Epistemic Humility deduction came from the model correctly characterizing the essay’s argument but stating it slightly more confidently than the essay’s own hedges warrant. In both cases the model followed the document. The metric found where following the document faithfully is not quite the same as representing the document precisely.

Understanding that distinction is crucial for understanding how and why people tend to get it wrong in terms of how effective, or not, generative AI is when using source documents.

This points to something worth naming explicitly. Argumentative documents, such as essays, analyses, philosophical pieces, and legal briefs, create a different evaluation challenge than purely factual documents. A technical paper makes claims that can be verified against retrieved chunks: going back to the warp drive paper we looked at, either the chunk says the energy requirement is 1028 kg or it doesn’t. An argumentative essay makes moves that can only be evaluated against the essay’s overall rhetorical strategy: not just what it claims but how confidently; not just what distinctions it draws but how carefully it maintains them across the full argument. Single-turn metrics can evaluate the first kind of claim adequately. Conversational metrics with precisely defined criteria are required for the second.

Test Finding: The three conversational metrics in this script are not a complete evaluation toolkit; they’re a starting point that reveals where to look next. A 1.0 on completeness with 0.8 on semantic precision means the system addressed the user’s goal while imperfectly maintaining a conceptual distinction. The next diagnostic question is whether the precision failure traces to retrieval (the wrong chunks creating cross-register pressure) or to generation (the model collapsing distinctions the retrieved chunks actually maintained). Answering that requires running single-turn relevancy and faithfulness metrics on the individual turns where the precision failure occurred. The conversational metrics locate the problem. The single-turn metrics identify its source.

That relationship of conversational metrics for locating problems and single-turn metrics for diagnosing their source is the larger framework this series has been building toward.

Faithfulness tells you what the model said that wasn’t in the retrieved context. Precision tells you whether the right chunks were ranked first. Recall tells you whether all necessary chunks were found. Relevancy tells you how much noise was in what was retrieved. G-Eval tells you whether the model honored the document’s argument. Completeness tells you whether the conversation served the user’s goal. Semantic precision tells you whether distinctions were maintained across turns. Epistemic humility tells you whether the document’s register was preserved.

From a testing perspective, no single metric tells you all of this. The diagnostic picture requires the combination, and knowing which metrics to reach for depends on understanding what kind of document you’re evaluating, what kind of questions will be asked of it, and what kind of failure modes matter most for the system you’re building.

If that sounds like a fun testing challenge, well, yeah, it can be!

Next Steps!

We’ve covered a lot in these DeepEval posts in terms of metrics. The next post will be a synthesis of all that.

Share

This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.