AI and Testing: Improving Retrieval Quality, Part 1

In the previous post on Contextual Precision, we diagnosed a critical problem in our RAG system: poor retrieval quality was causing failures that we also observed in the Faithfulness post. In this first of three related posts, we’re going to dig in a bit. This will be our first extended example of what testing a generative AI really looks like.

Needless to say, the context of those last posts, including the results I displayed, are critical for understanding this post and the two that follow.

Our previous testing revealed that relevant information was being buried beneath irrelevant chunks, leading to, at least in the cases I showed, a Contextual Precision score of 0.33 and a Faithfulness score of 0.5. The diagnosis was clear: the retrieval mechanism, not the generation model, was the bottleneck.

However, I think we could all agree that diagnosis without treatment is incomplete. In this post, and the following two, we’ll put our testing framework to work doing what it’s designed for: guiding improvement. We’ll experiment with different retrieval strategies, measure the impact of using both Faithfulness and Contextual Precision metrics simultaneously, and demonstrate how targeted changes to the retrieval pipeline can improve overall system quality. Or perhaps show such changes have no effect at all. Which will it be? Well, let’s find out!

Establishing the Baseline

Before we can improve anything, we need to establish where we’re starting from. Our current configuration (from both the Faithfulness and Contextual Precision posts) uses:

Chunk size: 1000 characters
Chunk overlap: 200 characters
Number of chunks retrieved (k): 3
Embedding model: nomic-embed-text
Retrieval strategy: semantic similarity search

This configuration produced our (or at least my) problematic results: Contextual Precision of 0.33 (relevant chunk ranked last) and Faithfulness of 0.5 (claims from outside retrieved context). Let’s formalize this baseline measurement using DeepEval’s evaluate() function, which allows us to run multiple metrics together.

You might remember that I briefly showed evaluate(), as distinct from measure(), in the Answer Relevancy post and I promised to return to it. This post will do exactly that.

In order to do this, I’m going to have to provide you with some boilerplate code that will serve as our test harness.

Notice here that we’re combining the ideas of the original bespoke test harness we wrote prior to looking at DeepEval (our testing example) as well as the DeepEval metrics we’ve been looking at recently.

RAG Implementation Code

To get started, put the following logic in place. (This script is available as retrieval-quality-001.py in my repo for this series.)

from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings, ChatOllama
from deepeval.metrics import ContextualPrecisionMetric, FaithfulnessMetric
from deepeval.models import OllamaModel
from deepeval.test_case import LLMTestCase
from deepeval import evaluate

def create_rag_system(chunk_size=1000, chunk_overlap=200, k=3):
  """Create a RAG system with configurable parameters."""
  loader = PyPDFLoader("./arXiv-jnyman-051011v3.pdf")
  documents = loader.load()

  text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
  )

  chunks = text_splitter.split_documents(documents)

  embeddings = OllamaEmbeddings(model="nomic-embed-text")
  vectorstore = Chroma.from_documents(chunks, embeddings)

  retriever = vectorstore.as_retriever(search_kwargs={"k": k})

  return retriever, len(chunks)

def create_rag_system_semantic(k=3):
  """Create a RAG system with semantically-aware chunking."""
  loader = PyPDFLoader("./arXiv-jnyman-051011v3.pdf")
  documents = loader.load()

  # Use separators that respect document structure
  text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150,
    separators=["\n\n\n", "\n\n", "\n", ". ", " ", ""],
    length_function=len
  )

  chunks = text_splitter.split_documents(documents)

  embeddings = OllamaEmbeddings(model="nomic-embed-text")
  vectorstore = Chroma.from_documents(chunks, embeddings)

  retriever = vectorstore.as_retriever(search_kwargs={"k": k})

  return retriever, len(chunks)

def run_test(retriever, question, expected_output, show_chunks=True):
  """Run a complete test with both metrics."""

  execution_model = ChatOllama(model="jeffnyman/ts-reasoner")
  judge_model = OllamaModel(model="jeffnyman/ts-evaluator")

  # Get relevant context
  retrieved_docs = retriever.invoke(question)
  context = [doc.page_content for doc in retrieved_docs]

  # Generate chunks
  if show_chunks:
    print("\n" + "-" * 60)
    print("RETRIEVED CHUNKS:")
    print("-" * 60)

    for i, chunk in enumerate(context, 1):
      print(f"\n--- Chunk {i} ---")
      print(chunk)

    print("-" * 60 + "\n")

  # Generate response
  prompt = f"Based on this context: {context}\n\nQuestion: {question}"
  response = execution_model.invoke(prompt).content

  # Create test case
  test_case = LLMTestCase(
    input=question,
    actual_output=response,
    expected_output=expected_output,
    retrieval_context=context
  )

  # Create metrics
  precision_metric = ContextualPrecisionMetric(
    model=judge_model,
    verbose_mode=True
  )

  faithfulness_metric = FaithfulnessMetric(
    model=judge_model,
    verbose_mode=True
  )

  # Evaluate with both metrics
  results = evaluate(
    test_cases=[test_case],
    metrics=[precision_metric, faithfulness_metric]
  )

  return results, context, response

def get_scores(results):
  """Safely extract scores from results."""
  if results is not None:
    metrics_data = results.test_results[0].metrics_data
    if metrics_data is not None:
      return {m.name: m.score for m in metrics_data}

  return {}

def print_scores(label, results, baseline_results=None):
  """Print scores with optional comparison to baseline."""
  print(f"\n{label} Scores:")
  scores = get_scores(results)

  if scores:
    print(f"Contextual Precision: {scores.get("Contextual Precision")}")
    print(f"Faithfulness: {scores.get("Faithfulness")}")

    if baseline_results is not None:
      baseline_scores = get_scores(baseline_results)

      if baseline_scores:
        precision_change = scores.get("Contextual Precision", 0) \
          - baseline_scores.get("Contextual Precision", 0)
        faithfulness_change = scores.get("Faithfulness", 0) \
          - baseline_scores.get("Faithfulness", 0)

        print("\nComparison to Baseline:")
        print(f"Contextual Precision: {precision_change:+.2f}")
        print(f"Faithfulness: {faithfulness_change:+.2f}")
  else:
    print("No metrics data available.")

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

from langchain_community.document_loaders import PyPDFLoader

from langchain_community.vectorstores import Chroma

from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_ollama import OllamaEmbeddings, ChatOllama

from deepeval.metrics import ContextualPrecisionMetric, FaithfulnessMetric

from deepeval.models import OllamaModel

from deepeval.test_case import LLMTestCase

from deepeval import evaluate

def create_rag_system(chunk_size=1000, chunk_overlap=200, k=3):

"""Create a RAG system with configurable parameters."""

loader = PyPDFLoader("./arXiv-jnyman-051011v3.pdf")

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=chunk_size,

chunk_overlap=chunk_overlap

)

chunks = text_splitter.split_documents(documents)

embeddings = OllamaEmbeddings(model="nomic-embed-text")

vectorstore = Chroma.from_documents(chunks, embeddings)

retriever = vectorstore.as_retriever(search_kwargs={"k": k})

return retriever, len(chunks)

def create_rag_system_semantic(k=3):

"""Create a RAG system with semantically-aware chunking."""

loader = PyPDFLoader("./arXiv-jnyman-051011v3.pdf")

documents = loader.load()

# Use separators that respect document structure

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=800,

chunk_overlap=150,

separators=["\n\n\n", "\n\n", "\n", ". ", " ", ""],

length_function=len

)

chunks = text_splitter.split_documents(documents)

embeddings = OllamaEmbeddings(model="nomic-embed-text")

vectorstore = Chroma.from_documents(chunks, embeddings)

retriever = vectorstore.as_retriever(search_kwargs={"k": k})

return retriever, len(chunks)

def run_test(retriever, question, expected_output, show_chunks=True):

"""Run a complete test with both metrics."""

execution_model = ChatOllama(model="jeffnyman/ts-reasoner")

judge_model = OllamaModel(model="jeffnyman/ts-evaluator")

# Get relevant context

retrieved_docs = retriever.invoke(question)

context = [doc.page_content for doc in retrieved_docs]

# Generate chunks

if show_chunks:

print("\n" + "-" * 60)

print("RETRIEVED CHUNKS:")

print("-" * 60)

for i, chunk in enumerate(context, 1):

print(f"\n--- Chunk {i} ---")

print(chunk)

print("-" * 60 + "\n")

# Generate response

prompt = f"Based on this context: {context}\n\nQuestion: {question}"

response = execution_model.invoke(prompt).content

# Create test case

test_case = LLMTestCase(

input=question,

actual_output=response,

expected_output=expected_output,

retrieval_context=context

)

# Create metrics

precision_metric = ContextualPrecisionMetric(

model=judge_model,

verbose_mode=True

)

faithfulness_metric = FaithfulnessMetric(

model=judge_model,

verbose_mode=True

)

# Evaluate with both metrics

results = evaluate(

test_cases=[test_case],

metrics=[precision_metric, faithfulness_metric]

)

return results, context, response

def get_scores(results):

"""Safely extract scores from results."""

if results is not None:

metrics_data = results.test_results[0].metrics_data

if metrics_data is not None:

return {m.name: m.score for m in metrics_data}

return {}

def print_scores(label, results, baseline_results=None):

"""Print scores with optional comparison to baseline."""

print(f"\n{label} Scores:")

scores = get_scores(results)

if scores:

print(f"Contextual Precision: {scores.get("Contextual Precision")}")

print(f"Faithfulness: {scores.get("Faithfulness")}")

if baseline_results is not None:

baseline_scores = get_scores(baseline_results)

if baseline_scores:

precision_change = scores.get("Contextual Precision", 0) \

- baseline_scores.get("Contextual Precision", 0)

faithfulness_change = scores.get("Faithfulness", 0) \

- baseline_scores.get("Faithfulness", 0)

print("\nComparison to Baseline:")

print(f"Contextual Precision: {precision_change:+.2f}")

print(f"Faithfulness: {faithfulness_change:+.2f}")

else:

print("No metrics data available.")

As with all code in this series, I recommend spending some time looking at what the code is actually doing.

This code is setting up an experiment to test how well an AI system can answer questions about a document. Specifically, we’ll use my warp drive paper again since I want to play off of the previous results. You can think of what we’re doing above as something like a reading comprehension test. Our goal is not just grading the AI’s final answer, but also examining how it found and used information from the source material. The RAG I’m showing you here is largely what we developed in the Faithfulness post, but with some more structure.

First, there are the document preparation functions (create_rag_system and create_rag_system_semantic). These functions take the PDF and chunk it into smaller pieces. The code implements two approaches: basic chunking (like cutting a piece of paper with scissors at regular intervals), and semantic chunking (like cutting the paper along natural boundaries, such as paragraphs, sections, and sentences).

Then there’s the testing function (run_test). This is where the actual evaluation happens. Given a question (which I haven’t put in place yet), the system searches through all the chunks to find the most relevant ones (the “retrieval” part). An AI model (ts-reasoner) reads those chunks and generates an answer (the “generation” part). A separate “judge” AI model (ts-evaluator) evaluates both the retrieval quality and the answer quality.

Then there’s comparison logic. The print_scores function lets you compare different configurations against a baseline: essentially A/B testing of different chunking strategies to see which helps the AI perform better.

The whole system is designed to be a controlled experiment: you can adjust parameters (chunk sizes, how many chunks to retrieve) and measure whether those changes improve the AI’s ability to accurately answer questions about the paper we’re using as the source material.

I’ll call out again here that instead of running measure() on individual metrics, we’re using evaluate() to run both Contextual Precision and Faithfulness together. This gives us a holistic view of system performance: we can see both how well the retriever orders chunks and how faithfully the model uses those chunks in a single test run.

Regarding that ability to configure, the create_rag_system() function is parameterized, which means we can easily experiment with different configurations. This is exactly what we want: a testing framework that makes experimentation easy and results comparable.

Now add the following to the end of the script:

question = """Please consider Jeff Nyman's warp drive paper.
What energy source does the paper propose would be needed to
generate the warp bubble for faster-than-light travel?"""

expected_output = """Matter/antimatter annihilation, requiring
approximately 10^28 kg of antimatter (equivalent to Jupiter's
mass-energy)."""

# =========================================================
# BASELINE
# =========================================================
print("=" * 60)
print("BASELINE: chunk_size=1000, chunk_overlap=200, k=3")
print("=" * 60)

retriever, num_chunks = create_rag_system(
  chunk_size=1000,
  chunk_overlap=200,
  k=3
)

print(f"Document split into {num_chunks} chunks")

baseline_results, baseline_context, baseline_response = run_test(
  retriever,
  question,
  expected_output
)

print_scores("Baseline", baseline_results)

question = """Please consider Jeff Nyman's warp drive paper.

What energy source does the paper propose would be needed to

generate the warp bubble for faster-than-light travel?"""

expected_output = """Matter/antimatter annihilation, requiring

approximately 10^28 kg of antimatter (equivalent to Jupiter's

mass-energy)."""

# =========================================================

# BASELINE

# =========================================================

print("=" * 60)

print("BASELINE: chunk_size=1000, chunk_overlap=200, k=3")

print("=" * 60)

retriever, num_chunks = create_rag_system(

chunk_size=1000,

chunk_overlap=200,

k=3

)

print(f"Document split into {num_chunks} chunks")

baseline_results, baseline_context, baseline_response = run_test(

retriever,

question,

expected_output

)

print_scores("Baseline", baseline_results)

This section is running our first experiment: establishing a baseline measurement that you’ll compare all future tests against. With this we’re doing something we’ve been doing all along: we’re asking the system a specific question about my warp drive paper and what energy source would power the warp bubble. We’ve also provided the correct answer (matter/antimatter annihilation with a specific mass requirement), which serves as the answer key.

The baseline uses our standard configuration: chunks of 1000 characters, 200 characters of overlap between chunks (so adjacent pieces share some context), and retrieving the top 3 most relevant chunks for each question.

Think of this like calibrating an instrument before taking measurements. We’re establishing a reference point: “Here’s how well the system performs with reasonable default settings.” Once we have these baseline scores for Contextual Precision and Faithfulness, we can experiment with different configurations (smaller chunks, more overlap, retrieving more pieces) and see whether each change makes things better or worse.

Running this baseline should give us scores similar to what we saw in previous posts:

Contextual Precision: ~0.33 (relevant chunk ranked last)
Faithfulness: ~0.5 (model making claims beyond retrieved context)

Your actual scores may vary slightly due to the non-deterministic nature of LLMs, but they should be in this range if, and this is an important caveat, you’re seeing the same retrieval problems I diagnosed in those previous posts.

Note that any divergences you see from what I have reported are an interesting thing to consider when you think about reproducible tests and, even more importantly, reproducible test results and (potentially) reproducible bug reports.

Running the Baseline

You should see something like this when you start running:


============================================================
BASELINE: chunk_size=1000, chunk_overlap=200, k=3
============================================================
Document split into 38 chunks

------------------------------------------------------------
RETRIEVED CHUNKS:
------------------------------------------------------------

--- Chunk 1 ---
I find it very challenging to mak e predictions on how this warp drive might actually function. My route was to envision a spacecraft with an exotic power generator that could create the necessary energies to locally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract the compactified space-time around it, thereby creating the propulsion effect. That being said, my goal in this paper is to work on realistic model rather than a physically realizable device. The first four sections of this paper revie w the necessary physics required to appreciate the new warp drive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding speed limits and energy requirements will also be presented.

--- Chunk 2 ---
of Relativity. An element missing from all the papers is that there is little or no suggestion as to how such a warp bubble may be created. I do not plan to buck that trend too much in that the aim of this paper is not to discuss the plausibility of a warp drive. This means I am not addressing the valid questions associated with violation of the null

--- Chunk 3 ---
I can now look at the energy required to create the necessary warp bubble. The accepted value of the cosmological constant is ? ? 10 -47(GeV)4. Converting again into SI units gives ? ? 10 -10J/m3. Now, for a warp bubble expanding at the speed of light I would need to increase this again by a factor of 10 52 as I have H ? ?? . I can thus say this:

Equation 29
Here ?c is the local value of the cosmological constant when space is expanding at c. To make this a concrete example, I will consider a spacecraft of these dimensions:

Equation 30
If I postulate that the warp bubble must, at least, encompass the volume of the craft,  the total amount of energy ‘injected’ locally would equal

Equation 31
Assuming some arbitrarily advanced civilization was able to create such an effect ,  I will further postulate that this civilization would be able to utilize the most efficient method of  energy production,
------------------------------------------------------------

If you don’t want to see the actual chunks as part of your output, note that run_tests() does have a show_chunks parameter. It defaults to true, but you could set this to false when running a given test.

You might notice some odd spacing in between words or even within words. This is a classic PDF extraction artifact. PDFs are notoriously tricky because they don’t actually store text the way you would expect. They store rendering instructions for where to position individual glyphs (character shapes) on a page.

You really have to think of a PDF like a mosaic made of tiny tiles. From a distance it looks like continuous text, but up close you see it’s individual pieces positioned precisely. When you try to “read” the mosaic back into text, you have to guess which tiles belong together as words. Case in point here, the PDF likely has specific spacing commands between certain letters for visual alignment that the extractor interprets as word boundaries.

But, wait! Isn’t this bad? If the extractions are wrong, can’t that compromise things? For RAG purposes, this might not actually matter much. The embeddings model likely handles “mak e” and “make” similarly enough that retrieval still works. The LLM reading the context can also be robust to these quirks, as ts-reasoner is.

Notice that the script is grabbing three chunks but, at the top of the output, it says “Document split into 38 chunks.” What’s going on there? There are actually two different “chunks” being referred to.

38 chunks refers to the total number of chunks the document was split into when building the vector database
3 chunks (k=3) refers to the number of most relevant chunks retrieved for this specific question

Think of it like a library with thirty-eight books on your shelf (the full chunked document), but when someone asks you a specific question, you only pull down the three most relevant books to answer it. You don’t read all thirty-eight every time.

After that output, I got this:


**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The first document does not provide any specific information about the energy source needed for generating a warp bubble. It only discusses the conceptual framework and goals of the paper."
    },
    {
        "verdict": "no",
        "reason": "The second document also does not address the energy requirements or sources for creating a warp bubble, focusing instead on the plausibility and physical constraints of a warp drive."
    },
    {
        "verdict": "yes",
        "reason": "The third document explicitly discusses the energy required to create the necessary warp bubble. It provides detailed calculations and equations that directly relate to the expected output: 'Matter/antimatter annihilation, requiring approximately 10^28 kg of antimatter (equivalent to Jupiter's mass-energy).'"
    }
]

Score: 0.3333333333333333
Reason: The score is 0.33 because nodes 1 and 2 are ranked higher than node 3, despite the third document explicitly addressing the energy source needed for generating a warp bubble. Nodes 1 and 2 should be ranked lower as they do not provide relevant information.

======================================================================
**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "The paper focuses on a realistic model of a warp drive rather than discussing its physical realizability.",
    "The first four sections review necessary physics for understanding the new warp drive model.",
    "Calculations regarding speed limits and energy requirements are presented in the paper.",
    "The cosmological constant ? is approximately 10^-47 (GeV)^4 or 10^-10 J/m3 in SI units.",
    "For a warp bubble expanding at the speed of light, the required energy would be increased by a factor of 10^52.",
    "The paper considers a spacecraft with dimensions that must be encompassed by the warp bubble.",
    "It is assumed that an advanced civilization could utilize the most efficient method of energy production."
]

Claims:
[
    "The paper focuses on modeling the energy requirements for creating a warp bubble, rather than its feasibility.",
    "The paper references the cosmological constant (?) and Hubble's constant (H) as fundamental to the energy calculation.",
    "Equation 29 in the paper shows how the expansion rate of space directly relates to ?.",
    "The paper calculates the energy required by scaling the cosmological constant by a factor of 10^52, justified by the relationship H ? ??.",
    "The author assumes that an advanced civilization would use the most efficient energy production method available.",
    "Equation 31 in the paper explicitly states that the energy 'injected' locally equals the volume of the craft multiplied by the scaled cosmological constant."
]

Verdicts:
[
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The context states that calculations regarding speed limits and energy requirements are presented, but it does not mention Hubble's constant (H) as a fundamental part of the energy calculation."
    },
    {
        "verdict": "no",
        "reason": "While the context mentions equations related to \u039b, Equation 29 is not referenced in the provided information."
    },
    {
        "verdict": "no",
        "reason": "The context does mention scaling the cosmological constant by a factor of 10^52 but does not explicitly state that this is justified by H\u221d \u221a\u039b. The relationship between H and \u039b is not mentioned in the provided information."
    },
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The context mentions an assumption about advanced civilizations using efficient energy production methods, but it does not explicitly state that Equation 31 in the paper makes this claim."
    }
]

Score: 0.3333333333333333
Reason: The score is 0.33 because the actual output introduces elements (Hubble's constant H and Equation 29) not supported by the context, and omits key relationships (the connection between H and ?) that are present in the contradictions.

We’ve seen these outputs before but let’s unpack what these metrics are telling us about our RAG system’s performance. The Contextual Precision metric evaluated our three retrieved chunks and found exactly the problem we’ve been diagnosing in these posts:

Chunk 1: Verdict “no”
✘ Discusses conceptual framework, not the specific energy source
Chunk 2: Verdict “no”
✘ Focuses on plausibility, not energy requirements
Chunk 3: Verdict “yes”
✔ Actually answers the question with calculations and equations

The score of 0.33 tells us that two irrelevant chunks were ranked higher than the one relevant chunk. As the metric explains: “nodes 1 and 2 are ranked higher than node 3, despite the third document explicitly addressing the energy source.” This is the classic retrieval failure mode we identified in the Contextual Precision post: the right information exists in our source document, but the retriever buried it beneath noise.

Think of this like a search engine showing you two unrelated Wikipedia articles before finally showing the one that actually answers your question. The information is there, but the ordering undermines usability.

Now let’s consider Faithfulness and here’s where things get interesting, and perhaps surprising. Our Faithfulness score is also 0.33, significantly lower than what we saw in the standalone Faithfulness post. What happened? The metric extracted seven “truths” from our retrieved chunks: facts that were explicitly stated in the context we provided. Then it evaluated six claims that ts-reasoner made in its response. Let’s look at the verdicts:

Claim 1: “The paper focuses on modeling energy requirements…”
✔ Supported
Claim 2: References cosmological constant Λ and Hubble’s constant H
✘ H not mentioned in context
Claim 3: “Equation 29 shows expansion rate relates to Λ”
✘ Equation 29 not in retrieved chunks
Claim 4: Scaling by 10⁵² justified by H ∝ √Λ
✘ Relationship not stated in context
Claim 5: Advanced civilization uses efficient energy production
✔ Supported
Claim 6: Equation 31 explicitly states energy formula
✘ Not explicitly stated this way

Only two claims out of six were fully supported by the retrieved context, giving us 2/6 ≈ 0.33.

Here’s the crucial insight: ts-reasoner isn’t hallucinating nonsense. Hubble’s constant, Equation 29, and the H ∝ √Λ relationship are actually in the paper; they’re just not in the three chunks the retriever selected. The model is drawing on legitimate information from the broader document (or possibly its training data about warp drive physics), but that information wasn’t provided in the retrieval context.

This is the retrieval-generation cascade we’ve been discussing. Poor retrieval quality (Contextual Precision = 0.33) directly causes faithfulness failures (Faithfulness = 0.33) because the model doesn’t have access to the complete information it needs. The model is trying to provide a comprehensive answer, but it’s working from incomplete source material.

Notice how both metrics, at least in this case, scored identically: 0.33. This isn’t coincidence. Rather, it’s confirmation of our diagnosis: Low Contextual Precision + Low Faithfulness = Retrieval Problem. If Contextual Precision were high but Faithfulness low, we would suspect a generation problem (the model ignoring good context). But when both are low together, it indicates the retrieval system isn’t surfacing the right chunks, which then cascades into faithfulness failures.

The model needs Equation 29, the H ∝ √Λ relationship, and explicit references to Hubble’s constant to give a fully faithful answer. Those details exist in the paper, albeit on pages that weren’t included in our three retrieved chunks. The retriever found some relevant information (Chunk 3 with the energy calculations), but missed critical supporting details.

This baseline measurement confirms what we suspected: our current retrieval configuration (chunk_size=1000, chunk_overlap=200, k=3) is insufficient for this question. The system is splitting the document in ways that separate related concepts, and retrieving too few chunks to capture the complete answer.

Test Finding: When both Contextual Precision and Faithfulness score low on the same test, the root cause is almost certainly retrieval quality, not model capability. The generation model can only be as faithful as the context it receives allows.

The next bit of output I got was this:


Metrics Summary

  - ❌ Contextual Precision (score: 0.3333333333333333, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.33 because nodes 1 and 2 are ranked higher than node 3, despite the third document explicitly addressing the energy source needed for generating a warp bubble. Nodes 1 and 2 should be ranked lower as they do not provide relevant information., error: None)
  - ❌ Faithfulness (score: 0.3333333333333333, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.33 because the actual output introduces elements (Hubble's constant H and Equation 29) not supported by the context, and omits key relationships (the connection between H and Λ) that are present in the contradictions., error: None)

For test case:

  - input: Please consider Jeff Nyman's warp drive paper.
What energy source does the paper propose would be needed to
generate the warp bubble for faster-than-light travel?

  - actual output: Okay, let's break down Jeff Nyman's reasoning regarding the energy source for the warp bubble. Here’s my step-by-step thought process:

1.  **Identify the Core Problem:** The central challenge is generating the energy to create the warp bubble itself. The paper isn't focused on the *feasibility* of the warp drive, but rather on modeling its energy requirements.

2.  **Key Physics Reference:** The paper explicitly references the cosmological constant (Λ) and Hubble's constant (H) as fundamental to the energy calculation. The equation 29 shows how the expansion rate of space directly relates to Λ.

3.  **Calculation Approach:**  The paper calculates the energy required by scaling the cosmological constant by a factor of 10^52. This scaling is justified by the equation H ∝ √Λ (Hubble constant is proportional to the square root of the cosmological constant).  This is a critical step – the energy needed isn’t just the cosmological constant itself, but amplified by this relationship.

4.  **Assumed Advanced Technology:** The author postulates that this advanced civilization would use the *most efficient* energy production method. This is a crucial assumption. Without knowing what this "most efficient" method is, we can’t pinpoint a specific energy source.

5.  **Explicit Equation 31:**  Equation 31 explicitly states that the energy “injected” locally would equal the volume of the craft multiplied by the scaled cosmological constant. This reinforces that it's about creating a localized distortion of spacetime.

**Final Answer:**

Based on Jeff Nyman’s reasoning, the paper proposes that an advanced civilization would utilize an energy source that allows them to produce the energy needed to scale the cosmological constant by a factor of 10^52. The exact *type* of energy source is left unspecified due to the focus on the modeling process and the assumption of an arbitrarily advanced civilization employing the most efficient method.

  - expected output: Matter/antimatter annihilation, requiring approximately 10^28 kg of antimatter (equivalent to Jupiter's mass-energy).

  - context: None

  - retrieval context: ['I find it very challenging to mak e predictions on how this warp drive might actually function. My route \nwas to envision a spacecraft with an exotic power generator that could create the necessary energies to \nlocally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract \nthe compactified space-time around it, thereby creating the propulsion effect. That being said, my goal \nin this paper is to work on realistic model rather than a physically realizable device.  \nThe first four sections of this paper revie w the necessary physics required to appreciate the new warp \ndrive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding \nspeed limits and energy requirements will also be presented.', 'of Relativity. An element missing from all the papers is that there is little or no suggestion as to how \nsuch a warp bubble may be created. \nI do not plan to buck that trend too much in that the aim of this paper is not to discuss the plausibility of \na warp drive. This means I am not addressing the valid questions associated with violation of the null', 'I can now look at the energy required to create the necessary warp bubble. The accepted value of the \ncosmological constant is Λ ? 10 -47(GeV)4. Converting again into SI units gives Λ ? 10 -10J/m3. Now, for a \nwarp bubble expanding at the speed of light I would need to increase this again by a factor of 10 52 as I \nhave H ∝ √Λ . I can thus say this: \n \nEquation 29 \nHere Λc is the local value of the cosmological constant when space is expanding at c. To make this a \nconcrete example, I will consider a spacecraft of these dimensions: \n \nEquation 30 \nIf I postulate that the warp bubble must, at least, encompass the volume of the craft,  the total amount \nof energy ‘injected’ locally would equal \n \nEquation 31 \nAssuming some arbitrarily advanced civilization was able to create such an effect ,  I will further \npostulate that this civilization would be able to utilize the most efficient method of  energy production,']

======================================================================

Overall Metric Pass Rates

Contextual Precision: 0.00% pass rate
Faithfulness: 0.00% pass rate

Baseline Scores:
Contextual Precision: 0.3333333333333333
Faithfulness: 0.3333333333333333

Let’s dig into this a bit.

Evaluation Output

The evaluate() function from DeepEval generates a structured report card for each test case. Think of it like a teacher grading an exam: not just giving you a final score, but showing their work and explaining why you got that grade.

Each metric gets its own line with several components. One of these is a Pass/Fail Indicator (✅ or ❌). This is binary: did the metric meet its threshold? You also get the score, which is the actual numeric value the metric calculated. These are the same scores you saw in the verbose logs at the start of the output. You also get the threshold, which is the minimum acceptable score (default: 0.5). Combining all that, we see that both Contextual Precision and Faithfulness got ❌ because 0.333 is less than 0.5 (threshold).

Notice that mention of strict: False. When strict is set to false, the evaluation is more lenient with edge cases. When strict is true, borderline cases fail more often. This essentially affects how ambiguous situations are judged. Also notice that you get the reason. This is the judge’s explanation copied from the verbose logs, giving you the specific diagnosis: “nodes 1 and 2 are ranked higher than node 3” for Contextual Precision, and “introduces elements (Hubble’s constant H and Equation 29) not supported by the context” for Faithfulness.

The “For test case:” section is DeepEval showing you the complete picture of what was being evaluated. Think of this like a lab report: it’s documenting all the inputs and outputs of your experiment. The input is your original query about the energy source. DeepEval echoes it back so you can see exactly what question was being evaluated. The actual_output (populated by the execution model) is where you see what ts-reasoner generated. Notice how detailed and structured the response is: the model gave us a step-by-step breakdown citing Equation 29, discussing H ∝ √Λ, and referencing Equation 31. This looks like a thoughtful, well-reasoned answer.

The expected_output comes directly from the script. This is our gold standard answer: matter/antimatter annihilation requiring approximately 10²⁸ kg of antimatter. It’s what we wanted the model to say. Here’s the critical observation: the actual output never mentions matter/antimatter annihilation. The model focused on the mathematical framework but concluded with “the exact *type* of energy source is left unspecified,” which is factually incorrect! The paper does specify matter/antimatter annihilation later on.

Finally, the retrieval_context (coming from our RAG system) shows the three document chunks our retriever pulled from the vector store. Looking at this context, you can now see why the model struggled: none of these chunks contain the words “matter,” “antimatter,” or “annihilation.” The information simply wasn’t available to the model, regardless of how capable ts-reasoner might be.

Notice how the current output format for the retrieval context makes it hard to distinguish where one chunk ends and another begins. The three chunks are just concatenated in a list, which creates a wall of text that’s difficult to parse visually. DeepEval is fundamentally an evaluation framework, not a debugging tool. The primary consumer of this output might be expected to be other code (parsing logs, generating reports) rather than human eyes scanning console output. In that context, a simple list representation is actually easier to work with programmatically. This is why it helps to generate the chunks separately, as we did earlier.

So, in total, what we have here is the dynamic result of our execution model processing the question, where everything is either predefined (input, expected_output) or retrieved (retrieval_context).

Baseline Results

Running our baseline configuration produced a Contextual Precision score of 0.33 and a Faithfulness score of 0.33. Both metrics failed their threshold of 0.5, giving us a 0% pass rate. These identical scores aren’t coincidence: they reveal the cascade of failure in our RAG system.

Here I’m going to repeat some points I said earlier because I want to reinforce the analysis now that we did a deep dive into the display of the results. The Contextual Precision metric shows the familiar pattern: the retriever pulled three chunks from the paper, but ranked them poorly. The first two chunks discussed the conceptual framework and goals of the paper without addressing the specific energy source question. The third chunk explicitly discussed energy requirements and calculations, but was buried at position #3. This 0.33 score quantifies what we diagnosed earlier: relevant information exists but gets outranked by topically similar introductory content.

The Faithfulness score of 0.33 reveals why poor retrieval cascades into generation problems. The metric evaluated six claims that ts-reasoner made:

Claim 1: “The paper focuses on modeling energy requirements…”
✔ Supported
Claim 2: References cosmological constant Λ and Hubble’s constant H
✘ H not mentioned in retrieved chunks
Claim 3: “Equation 29 shows expansion rate relates to Λ”
✘ Equation 29 not in retrieved chunks
Claim 4: Scaling by 10⁵² justified by H ∝ √Λ
✘ Relationship not stated in context
Claim 5: Advanced civilization uses efficient energy production
✔ Supported
Claim 6: Equation 31 explicitly states energy formula
✘ Not explicitly stated this way

Only two claims out of six were fully supported by the retrieved context, giving us 2/6 ≈ 0.33. But the crucial insight stated earlier is one I’ll state again here: ts-reasoner isn’t hallucinating nonsense. Hubble’s constant, Equation 29, and the H ∝ √Λ relationship are actually in the paper. Look at Chunk 3, where it says “I have H ∝ √Λ” and references “Equation 29.” The model is accurately extracting information from the retrieved chunks, but the Faithfulness metric is correctly flagging that these details, while present, aren’t stated as explicitly or completely as the model’s claims suggest.

Think of this like archaeological stratigraphy. In a well-preserved site, the most recent and relevant artifacts appear in upper layers, clearly dated and contextualized. In a disturbed site, later materials get mixed into earlier strata. Finding our answer at position #3 beneath irrelevant chunks is like discovering critical evidence buried beneath layers of contextual noise: the information is genuine, but its positional relationship is compromised.

This baseline demonstrates the diagnostic pattern we’ve been tracking: Low Contextual Precision + Low Faithfulness = Retrieval Problem. When both metrics score identically low, it indicates the retrieval system isn’t surfacing the right chunks, which then cascades into faithfulness issues. The model is trying to provide a comprehensive answer, but it’s working from incomplete source material.

Test Finding: When evaluation output shows both metrics failing with identical scores, and the model produces detailed but incomplete responses, the diagnostic pattern points to retrieval quality as the bottleneck. The system is splitting the document in ways that separate related concepts, and retrieving too few chunks to capture the complete answer. This establishes our improvement target: we need to enhance retrieval quality to give the generation model better source material to work with.

The Unanswered Question

Let’s step back and look at what actually happened here from the user’s perspective. We asked a straightforward question: “What energy source does the paper propose would be needed to generate the warp bubble for faster-than-light travel?”

The expected answer is clear and specific: matter/antimatter annihilation, requiring approximately 10²⁸ kg of antimatter (equivalent to Jupiter’s mass-energy).

What did ts-reasoner tell us? The model provided a detailed, well-structured analysis of the cosmological constant, the H ∝ √Λ relationship, scaling factors, and equations. It concluded by saying that “an arbitrarily advanced civilization would utilize the most efficient method of energy production,” but then added: “The exact type of energy source is left unspecified due to the focus on the modeling process.”

The question was not answered. We asked “what energy source?” and the model essentially responded “the paper doesn’t say.” But that’s incorrect! The paper explicitly discusses matter/antimatter annihilation as the energy source.

So what went wrong? It’s tempting to blame the model: “ts-reasoner failed to identify the energy source!” But look at the retrieval context again. Search through all three chunks for the words “matter,” “antimatter,” or “annihilation.” You won’t find them. That information simply wasn’t in the chunks the retriever selected.

This is the forensic principle at work: you can only make claims based on the evidence you have access to. If a document examiner is asked “Did this person sign the contract?” but is only given photocopies of pages 1 through 3 of a 5-page contract, and the signature is on page 4, the examiner can’t answer the question. Not because they lack expertise, but because they lack access to the relevant evidence.

The ts-reasoner model did exactly what it should do: it analyzed the context it received, extracted relevant details faithfully (the cosmological constant calculations, the assumptions about advanced civilizations), and acknowledged the limitation of that context by stating the energy source type was “left unspecified” in what it had access to. The model didn’t hallucinate an answer. It didn’t make something up. It worked responsibly with incomplete information.

The failure is in retrieval, not generation. This distinction is critical for diagnosing and fixing RAG systems. If we misidentify this as a generation problem (“the model isn’t smart enough”), we might try fine-tuning ts-reasoner or switching to a more powerful model. But that would be treating the symptom, not the cause. The real problem is that our retriever ranked chunks poorly and didn’t surface the information the model needed to answer the question.

Test Finding: When a model produces well-reasoned but incomplete answers, check the retrieval context before assuming model failure. If the missing information isn’t in the retrieved chunks, you’re looking at a retrieval problem that no amount of model improvement will fix. The diagnostic question isn’t “Why didn’t the model know this?” but rather “Why didn’t the retriever provide this?”

This is why we measure both Contextual Precision and Faithfulness together. Faithfulness alone might make it look like the model is failing (0.33 score). But Contextual Precision reveals the root cause: poor chunk ranking (also 0.33). When both metrics fail together, it’s the retrieval system pointing us toward where we need to focus our improvement efforts.

So, what do we do with this? Well, we can test it! Specifically, we can see if some experimental interventions can improve the retrieval precision to surface that missing matter/antimatter information.

A Note on Non-Determinism

I should note that one time I ran this script my Faithfulness baseline actually came back as 0.8! Quite a bit different from the 0.33 shown here.

This variance isn’t a bug; it’s a fundamental characteristic of working with LLMs and RAG systems. Several sources of non-determinism can affect your results and it’s worth pausing a bit here to explain what those are.

Vector similarity search: Chroma’s retrieval can have slight variations in how it ranks semantically similar chunks, especially when multiple chunks have comparable similarity scores.
LLM generation: Unless you set temperature=0, the execution model (ts-reasoner) will produce different responses each run. Even the same retrieved chunks can lead to different phrasings, emphasis, or detail levels in the generated answer.
Evaluation model variability: The judge model (ts-evaluator) is also an LLM, and its assessment of claims and verdicts can vary between runs, particularly for borderline cases.

When I got a Faithfulness score of 0.8, ts-reasoner likely generated a response that happened to align more closely with what was explicitly stated in the retrieved chunks, making fewer inferential claims about Hubble’s constant or equation relationships. The retrieval context might have been identical, but the model’s interpretation and presentation of that context differed.

This is, again, analogous to archaeological field work. If you send three different teams to excavate the same site using the same methodology, they’ll produce similar but not identical results. One team might find an artifact two centimeters deeper than another team recorded. The stratigraphy is the same, but human (or in our case, algorithmic) interpretation introduces variance.

What this means for you: If you run this exact code and get different scores (even significantly different scores) that’s expected. You might see Contextual Precision of 0.5 where I got 0.33, or Faithfulness of 0.6 where I got 0.33. The specific numbers matter less than the diagnostic patterns.

The methodology I’ve shown you is how to interpret whatever results you get:

Are both metrics low? That’s a retrieval problem.
Is Contextual Precision low but Faithfulness high? That’s likely the model compensating for poor retrieval.
Is Contextual Precision high but Faithfulness low? This is likely a generation problem, not retrieval.
Are both metrics high? The system is working well.

The chunk analysis technique (examining what was actually retrieved), the claim-by-claim verdict review (understanding why the metrics scored as they did), and the multi-metric diagnostic approach (using both together to triangulate the problem): these are all tools that work regardless of your specific numeric results.

In fact, while it’s frustrating as a blog writer trying to teach, experiencing this variance firsthand is valuable. It teaches you something critical about AI testing: you can’t rely on exact reproducibility the way you might with traditional software testing. Instead, you need to understand the patterns of behavior, the types of failures, and the diagnostic signals that point toward root causes. That’s what we’re building here: not a script that always produces the same numbers, but a framework for understanding and improving RAG systems regardless of the specific numbers any given run produces.

Test Finding: Non-determinism in AI testing isn’t a problem to eliminate; it’s a characteristic to understand and work with. Build your testing methodology around pattern recognition and diagnostic thinking rather than exact reproducibility. The goal is not “this test always scores 0.33” but rather “when I see this pattern of scores, here’s what it tells me about my system.”

Next Steps!

Keep this script handy because we’re going to add experiments to it in the next post.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …