AI and Testing: Improving Retrieval Quality, Part 2

In the previous post we set up a test experiment around DeepEval and used DeepEval’s evaluation function to establish a quality baseline. That post ended with the need for experiments to confirm against that baseline, and that’s what we’ll do in this post.

We’re going to continue right from the previous post and the script we ended up with, so make sure you have that loaded up in your editor. You can get the final version we ended up from my repo: retrieval-quality-001.py.

The Experimental Approach

Before we dive into experiments, let me explain how we’re going to work through this post, because there’s an important distinction between what I did to generate the analysis you’re about to read, and what I recommend you do as you follow along.

What I did: I set up all four experiments in my script at once, then ran the entire test suite in a single session: one baseline followed by all four experiments sequentially. This gave me a consistent baseline to compare against and minimized the temporal variance between experiments. All the results and analysis you’ll see in this post come from that single run.

What you should do: As you read through this post, I’ll show you each experiment incrementally: explaining the hypothesis, showing the code to add, and walking through my results. But here’s the key: don’t run your script after adding each experiment. Instead, add all four experiments to your script as we go through the post, and then run everything once at the end. This way, you’ll get comparable results from a single session, just like I did.

This can matter quite a bit, at least from a pedagogical perspective. As I mentioned at the end of the previous post, RAG systems are non-deterministic. If you run your baseline, then add Experiment 1 and run again, then add Experiment 2 and run again, you’re essentially getting a fresh baseline each time. Your Experiment 1 might compare against a baseline of 0.33, but by the time you run Experiment 2, your “re-baselined” system might score 0.5, making the comparisons meaningless.

Running everything in one session keeps your baseline consistent across all experiments, making it possible to actually compare which interventions helped most. Think of it like a scientific experiment with a control group. If you change your control group between each test, you can’t meaningfully compare the experimental groups. The baseline needs to remain stable for the comparisons to be valid.

The practical upshot is that this post is more about showing you my analysis of the results I got. You can then run the script and perform a similar analysis on the results you got.

What We’re Testing

We established in the previous post that our baseline configuration (chunk_size=1000, chunk_overlap=200, k=3) produced poor results: Contextual Precision of 0.33 and Faithfulness of 0.33. Both metrics indicated that relevant information was being retrieved but poorly ranked, and the model was working with incomplete context.

Our experiments will test different retrieval strategies to see if we can improve these scores:

Experiment 1: Smaller chunks – Does reducing chunk size improve retrieval precision?
Experiment 2: More retrieval – Does retrieving more chunks (k=5) provide better context?
Experiment 3: Semantic chunking – Does respecting document structure boundaries help?
Experiment 4: Combined approach – What happens when we combine multiple strategies?

For each experiment, I’ll show you:

The hypothesis behind the change
The code to implement it
The complete output from my run
How to analyze what the results mean

Again, your specific scores will differ from mine due to non-determinism, but the analytical framework (looking at directional changes, chunk quality, and metric consistency) will help you understand your own results.

In fact, for this post, I took my own advice: I set up all four experiments in my script and ran everything in one session. And as expected, my baseline scores differ slightly from the previous post. Here is the full output of my current baseline:


============================================================
BASELINE: chunk_size=1000, chunk_overlap=200, k=3
============================================================
Document split into 38 chunks

------------------------------------------------------------
RETRIEVED CHUNKS:
------------------------------------------------------------

--- Chunk 1 ---
I find it very challenging to mak e predictions on how this warp drive might actually function. My route was to envision a spacecraft with an exotic power generator that could create the necessary energies to locally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract the compactified space-time around it, thereby creating the propulsion effect. That being said, my goal in this paper is to work on realistic model rather than a physically realizable device. The first four sections of this paper revie w the necessary physics required to appreciate the new warp drive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding speed limits and energy requirements will also be presented.

--- Chunk 2 ---
of Relativity. An element missing from all the papers is that there is little or no suggestion as to how such a warp bubble may be created. I do not plan to buck that trend too much in that the aim of this paper is not to discuss the plausibility of a warp drive. This means I am not addressing the valid questions associated with violation of the null

--- Chunk 3 ---
I can now look at the energy required to create the necessary warp bubble. The accepted value of the cosmological constant is Λ ≈ 10 -47(GeV)4. Converting again into SI units gives Λ ≈ 10 -10J/m3. Now, for a warp bubble expanding at the speed of light I would need to increase this again by a factor of 10 52 as I have H ∝ √Λ . I can thus say this:

Equation 29
Here Λc is the local value of the cosmological constant when space is expanding at c. To make this a concrete example, I will consider a spacecraft of these dimensions:

Equation 30
If I postulate that the warp bubble must, at least, encompass the volume of the craft,  the total amount of energy ‘injected’ locally would equal

Equation 31
Assuming some arbitrarily advanced civilization was able to create such an effect ,  I will further postulate that this civilization would be able to utilize the most efficient method of  energy production,
------------------------------------------------------------

**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The first document does not provide any specific information about the energy source needed for generating a warp bubble. It only discusses the conceptual framework and goals of the paper."
    },
    {
        "verdict": "no",
        "reason": "The second document also does not address the energy requirements or sources for creating a warp bubble, focusing instead on the plausibility and physical constraints of a warp drive."
    },
    {
        "verdict": "yes",
        "reason": "The third document explicitly discusses the energy required to create the necessary warp bubble. It provides detailed calculations and equations that directly relate to the expected output: 'Matter/antimatter annihilation, requiring approximately 10^28 kg of antimatter (equivalent to Jupiter's mass-energy).'"
    }
]

Score: 0.3333333333333333
Reason: The score is 0.33 because nodes 1 and 2 are ranked higher than node 3, despite the third document explicitly addressing the energy source needed for generating a warp bubble. Nodes 1 and 2 should be ranked lower as they do not provide relevant information.

======================================================================
**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "The paper focuses on a realistic model of a warp drive rather than discussing its physical realizability.",
    "The first four sections review necessary physics for understanding the new warp drive model.",
    "Calculations regarding speed limits and energy requirements are presented in the paper.",
    "The cosmological constant Λ is approximately 10^-47 (GeV)^4 or 10^-10 J/m3 in SI units.",
    "For a warp bubble expanding at the speed of light, the required energy would be increased by a factor of 10^52.",
    "The paper considers a spacecraft with dimensions that must be encompassed by the warp bubble.",
    "It is assumed that an advanced civilization could utilize the most efficient method of energy production."
]

Claims:
[
    "Jeff's primary goal is to build a model that can analyze the energy requirements for a warp drive, not to create a physically realistic device.",
    "The core of the model involves creating a warp bubble through localized manipulation of extra dimensions using an exotic power generator.",
    "The paper leverages the cosmological constant (Λ) as a fundamental parameter, noting its value in SI units as approximately 10^-10 J/m^3.",
    "A key calculation in the model involves multiplying the cosmological constant by a factor of 10^52 due to the relationship H ∝ √Λ.",
    "The paper assumes that an advanced civilization would use the most efficient method of energy production for the generator's purpose.",
    "The entire model hinges on the assumption of an incredibly efficient energy production method, which is far beyond current understanding of physics.",
    "The magnitude of the energy required (10^52) is significant and reinforces the technological challenges involved."
]

Verdicts:
[
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The context states that calculations regarding speed limits and energy requirements are presented, indicating that the model does involve creating a warp bubble through localized manipulation of extra dimensions."
    },
    {
        "verdict": "yes",
        "reason": "The context mentions that the paper leverages the cosmological constant (\u039b) as a fundamental parameter with its value in SI units as approximately 10^-10 J/m^3, which aligns with the claim."
    },
    {
        "verdict": "no",
        "reason": "While the context does not explicitly state that H \u221d \u221a\u039b, it mentions calculations involving the cosmological constant and energy requirements, suggesting a relationship between these parameters."
    },
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "idk",
        "reason": "The context states that an advanced civilization could utilize the most efficient method of energy production, but it does not provide enough information to definitively say whether this assumption is based on current understanding or not."
    },
    {
        "verdict": "no",
        "reason": "The context indicates that the magnitude of the energy required (10^52) is significant and aligns with the technological challenges involved in the model, supporting the claim."
    }
]

Score: 0.5714285714285714
Reason: The score is 0.57 because the actual output contradicts the context by implying that calculations regarding speed limits are present, which is not mentioned in the retrieval context. Additionally, it suggests a specific relationship (H ∝ √Λ) between parameters without evidence from the context.

======================================================================
======================================================================

Metrics Summary

  - ❌ Contextual Precision (score: 0.3333333333333333, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.33 because nodes 1 and 2 are ranked higher than node 3, despite the third document explicitly addressing the energy source needed for generating a warp bubble. Nodes 1 and 2 should be ranked lower as they do not provide relevant information., error: None)
  - ✅ Faithfulness (score: 0.5714285714285714, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.57 because the actual output contradicts the context by implying that calculations regarding speed limits are present, which is not mentioned in the retrieval context. Additionally, it suggests a specific relationship (H ∝ √Λ) between parameters without evidence from the context., error: None)

For test case:

  - input: Please consider Jeff Nyman's warp drive paper.
What energy source does the paper propose would be needed to
generate the warp bubble for faster-than-light travel?

  - actual output: Okay, let's break down Jeff Nyman's approach to this warp drive concept. Here's my step-by-step reasoning:

1.  **Initial Goal:** Jeff's primary goal isn’t to create a physically realistic warp drive, but to build a *model* that can analyze the energy requirements. He’s acknowledging the significant challenges in actually building such a device.

2.  **Warp Bubble Creation:** The core of the model is creating a warp bubble through localized manipulation of extra dimensions, driven by an exotic power generator.

3.  **Cosmological Constant:** The paper leverages the cosmological constant (Λ) – approximately 10^-47 (GeV)⁴ – as a fundamental parameter. This is a crucial starting point. The paper notes a conversion to SI units (Λ ≈ 10^-10 J/m³).

4.  **Expansion Factor:**  A key calculation involves multiplying the cosmological constant by a factor of 10⁵² due to the relationship H ∝ √Λ (where H is the Hubble constant). This amplification is central to generating the necessary energy.

5.  **Efficient Energy Production:** The paper explicitly states the assumption that an advanced civilization would use the “most efficient method of energy production.” This is a key assumption that drives the calculations.

**Therefore, based on this analysis, the paper proposes that an exotic power generator, utilizing an extremely efficient method of energy production, is needed to generate the warp bubble.** The generator's purpose is to amplify the cosmological constant to create the necessary energy density for localized spacetime manipulation.

**Risk Assessment & Considerations:**

*   **Assumption of Efficiency:** The entire calculation hinges on the assumption of an incredibly efficient energy production method – something far beyond our current understanding of physics. This is a *major* assumption and a potential point of failure for the model's realism.
*   **Scale of Energy:** The magnitude of the energy required (10⁵²) is staggering, reinforcing the technological challenges.
*   **Falsification:**  If a more realistic model could be developed that required significantly less energy, it would challenge the assumptions underpinning this approach.

Do you want me to delve deeper into a particular aspect of this reasoning, such as the justification for the 10⁵² factor or a discussion of potential alternative energy sources?

  - expected output: Matter/antimatter annihilation, requiring
approximately 10^28 kg of antimatter (equivalent to Jupiter's
mass-energy).

  - context: None

  - retrieval context: ['I find it very challenging to mak e predictions on how this warp drive might actually function. My route \nwas to envision a spacecraft with an exotic power generator that could create the necessary energies to \nlocally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract \nthe compactified space-time around it, thereby creating the propulsion effect. That being said, my goal \nin this paper is to work on realistic model rather than a physically realizable device.  \nThe first four sections of this paper revie w the necessary physics required to appreciate the new warp \ndrive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding \nspeed limits and energy requirements will also be presented.', 'of Relativity. An element missing from all the papers is that there is little or no suggestion as to how \nsuch a warp bubble may be created. \nI do not plan to buck that trend too much in that the aim of this paper is not to discuss the plausibility of \na warp drive. This means I am not addressing the valid questions associated with violation of the null', 'I can now look at the energy required to create the necessary warp bubble. The accepted value of the \ncosmological constant is Λ ≈ 10 -47(GeV)4. Converting again into SI units gives Λ ≈ 10 -10J/m3. Now, for a \nwarp bubble expanding at the speed of light I would need to increase this again by a factor of 10 52 as I \nhave H ∝ √Λ . I can thus say this: \n \nEquation 29 \nHere Λc is the local value of the cosmological constant when space is expanding at c. To make this a \nconcrete example, I will consider a spacecraft of these dimensions: \n \nEquation 30 \nIf I postulate that the warp bubble must, at least, encompass the volume of the craft,  the total amount \nof energy ‘injected’ locally would equal \n \nEquation 31 \nAssuming some arbitrarily advanced civilization was able to create such an effect ,  I will further \npostulate that this civilization would be able to utilize the most efficient method of  energy production,']

======================================================================

Overall Metric Pass Rates

Contextual Precision: 0.00% pass rate
Faithfulness: 100.00% pass rate

Baseline Scores:
Contextual Precision: 0.3333333333333333
Faithfulness: 0.5714285714285714

In the last post, I did the deep dive into each section, so I won’t repeat that here. But do note the following:

Previous post baseline: Contextual Precision: 0.33; Faithfulness: 0.33
This post’s baseline: Contextual Precision: 0.33 (identical); Faithfulness: 0.57 (higher)

What stayed the same and what changed is important to note. The retrieved chunks were identical: same three chunks, same problematic ranking (irrelevant chunks 1 and 2, relevant chunk 3). This gives me the same Contextual Precision score of 0.33, confirming my retrieval diagnosis.

But Faithfulness improved from 0.33 to 0.57. Why? Looking at the actual output, I found that ts-reasoner generated a different response this time. Instead of making specific claims about “Equation 29” and “Hubble’s constant H,” the model focused more on what was explicitly in the retrieved chunks: the cosmological constant calculations, the assumption of efficient energy production, and the exotic power generator concept. The response was more cautious and stayed closer to what it could directly support from the context.

This demonstrates an important point: retrieval is more deterministic than generation. The same question with the same chunking strategy tends to retrieve the same chunks (hence identical Contextual Precision). But the LLM’s interpretation and presentation of those chunks varies between runs (hence variable Faithfulness).

What matters for our experiments is this: we have a consistent Contextual Precision baseline of 0.33 showing poor chunk ranking. Our experiments will test whether different retrieval strategies can improve that score. The Faithfulness variance between runs just reinforces why we’re running all experiments in one session: to keep the baseline stable for meaningful comparisons.

Experiment 1: Smaller Chunks

Our first hypothesis is that 1000-character chunks might be too large, causing the retriever to match on broad topical similarity rather than specific query relevance. Think about it: a large chunk discussing “warp drives in general” might score higher in semantic similarity than a smaller, more focused chunk that specifically discusses “energy calculations for warp bubbles.”

Smaller chunks should be more semantically focused, which could help the retriever surface precisely relevant information rather than topically related information. Let’s try cutting the chunk size in half to 500 characters, with proportionally reduced overlap (100 characters). Add this experiment to the bottom of your current script.

# =========================================================
# EXPERIMENT 1: Smaller Chunks
# =========================================================
print("\n" + "=" * 60)
print("EXPERIMENT 1: chunk_size=500, chunk_overlap=100, k=3")
print("=" * 60)

retriever_exp1, num_chunks_exp1 = create_rag_system(
  chunk_size=500,
  chunk_overlap=100,
  k=3
)

print(f"Document split into {num_chunks_exp1} chunks")

exp1_results, exp1_context, exp1_response = run_test(
  retriever_exp1,
  question,
  expected_output
)

print_scores("Experiment 1", exp1_results, baseline_results)

# =========================================================

# EXPERIMENT 1: Smaller Chunks

# =========================================================

print("\n" + "=" * 60)

print("EXPERIMENT 1: chunk_size=500, chunk_overlap=100, k=3")

print("=" * 60)

retriever_exp1, num_chunks_exp1 = create_rag_system(

chunk_size=500,

chunk_overlap=100,

k=3

)

print(f"Document split into {num_chunks_exp1} chunks")

exp1_results, exp1_context, exp1_response = run_test(

retriever_exp1,

question,

expected_output

)

print_scores("Experiment 1", exp1_results, baseline_results)

What to Look For

When examining these results, we’re watching for several things:

Did Contextual Precision improve? If smaller chunks helped the retriever rank relevant information higher, we should see a score increase.
Did Faithfulness correlate? As I predicted in the Faithfulness post, if Contextual Precision improves, Faithfulness should improve too, since better retrieval leads to more grounded generation.
Are the retrieved chunks different? We can compare the actual chunk content between baseline and this experiment to see if the retriever is pulling different sections of the paper.
What’s the total chunk count? Smaller chunks mean the document gets split into more pieces. The baseline created 38 chunks; this configuration should create roughly double that. More granular chunks could help precision but might also split related information awkwardly.

Remember: Don’t run this yet! Add this code to your script and continue reading. You should run all experiments together at the end. Or, rather, do whatever you please! Just keep in mind the caveat about baselines.

Here is what I got when I ran this:


============================================================
EXPERIMENT 1: chunk_size=500, chunk_overlap=100, k=3
============================================================
Document split into 69 chunks

------------------------------------------------------------
RETRIEVED CHUNKS:
------------------------------------------------------------

--- Chunk 1 ---
I find it very challenging to mak e predictions on how this warp drive might actually function. My route was to envision a spacecraft with an exotic power generator that could create the necessary energies to locally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract the compactified space-time around it, thereby creating the propulsion effect. That being said, my goal in this paper is to work on realistic model rather than a physically realizable device. The first four sections of this paper revie w the necessary physics required to appreciate the new warp drive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding speed limits and energy requirements will also be presented.

--- Chunk 2 ---
energy condition or issues regarding causality. The aim of this paper is to suggest that a warp bubble could be gen erated using ideas and mathematics that are consistent with quantum field theory . This may have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently advanced technology. By associating the cosmological constant with  the Casimir Energy due to the Kaluza Klein modes of

--- Chunk 3 ---
for example [19]. The basic idea of all of these is to formulate a solution to Einstein’s equations whereby a warp bubble is driven by a local expansion of space-time behind the bubble and a contraction ahead of the bubble. One common feature of these papers is that thei r physical foundation is the General Theory of Relativity. An element missing from all the papers is that there is little or no suggestion as to how such a warp bubble may be created.
------------------------------------------------------------

**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The first document does not provide any specific information about the energy source needed for generating the warp bubble, nor does it mention matter/antimatter annihilation or Jupiter's mass-energy."
    },
    {
        "verdict": "no",
        "reason": "The second document discusses the theoretical consistency of a warp bubble with quantum field theory and the Casimir Energy but does not specify the energy source required for its creation."
    },
    {
        "verdict": "yes",
        "reason": "The third document mentions that 'An element missing from all the papers is that there is little or no suggestion as to how such a warp bubble may be created,' which indirectly supports the need for an advanced and specific energy source, aligning with the expected output."
    }
]

Score: 0.3333333333333333
Reason: The score is 0.33 because nodes 1 and 2 are ranked higher than node 3, despite being marked 'no'. Node 3 directly addresses the need for an advanced energy source, which aligns with the input's query, while nodes 1 and 2 do not provide specific information about the required energy source.

======================================================================
**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "The paper aims to work on a realistic model of a warp drive rather than a physically realizable device.",
    "The first four sections of the paper review necessary physics for understanding the new warp drive model.",
    "Calculations regarding speed limits and energy requirements will be presented in the latter part of the paper.",
    "The goal is to suggest that a warp bubble could be generated using ideas consistent with quantum field theory.",
    "All papers on this topic are based on the General Theory of Relativity.",
    "There is little or no suggestion as to how such a warp bubble may be created in any of the papers."
]

Claims:
[
    "The paper discusses a theoretical warp drive based on manipulating spacetime.",
    "The core idea is to create a 'warp bubble' using an exotic power generator.",
    "The paper focuses on a model rather than a physically realizable device.",
    "The proposed exotic power generator would be used to locally manipulate extra dimensions.",
    "Manipulating these extra dimensions is intended to lead to the expansion and contraction of spacetime around the spacecraft.",
    "The purpose of this manipulation is to hypothesize how such a bubble could be created by sufficiently advanced technology."
]

Verdicts:
[
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The context states that the paper aims to work on a realistic model of a warp drive rather than a physically realizable device, and does not mention an exotic power generator."
    },
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "idk",
        "reason": "While the context mentions ideas consistent with quantum field theory, it does not provide information about using an exotic power generator to manipulate extra dimensions."
    },
    {
        "verdict": "no",
        "reason": "The context states that all papers on this topic are based on the General Theory of Relativity and do not mention manipulating extra dimensions or expanding/contracting spacetime around a spacecraft."
    },
    {
        "verdict": "idk",
        "reason": "There is no direct contradiction in the context regarding the purpose of the manipulation, only that it aims to suggest how such a bubble could be created by sufficiently advanced technology."
    }
]

Score: 0.6666666666666666
Reason: The score is 0.67 because the actual output incorrectly suggests that the paper discusses exotic power generators and manipulating extra dimensions,
which contradicts the stated focus on a realistic model based on General Theory of Relativity.

======================================================================
======================================================================

Metrics Summary

  - ❌ Contextual Precision (score: 0.3333333333333333, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.33 because nodes 1 and 2 are ranked higher than node 3, despite being marked 'no'. Node 3 directly addresses the need for an advanced energy source, which aligns with the input's query, while nodes 1 and 2 do not provide specific information about the required energy source., error: None)
  - ✅ Faithfulness (score: 0.6666666666666666, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.67 because the actual output incorrectly suggests that the paper discusses exotic power generators and manipulating extra dimensions, which contradicts the stated focus on a realistic model based on General Theory of Relativity., error: None)

For test case:

  - input: Please consider Jeff Nyman's warp drive paper.
What energy source does the paper propose would be needed to
generate the warp bubble for faster-than-light travel?

  - actual output: Okay, let's break down this problem step-by-step, applying Jeff Nyman's testing mindset.

1.  **Understanding the Context:** We have a paper discussing a theoretical warp drive based on manipulating spacetime. The core idea is to create a "warp bubble" using an exotic power generator. The paper explicitly states it's focusing on a *model* rather than a physically realizable device.

2.  **Identifying Key Elements:** The paper proposes an exotic power generator. This generator’s role is to create the necessary energy to locally manipulate extra dimensions, leading to the expansion and contraction of spacetime around the spacecraft.

3.  **Analyzing the Relevant Passages:** The passage states: "My route \nwas to envision a spacecraft with an exotic power generator that could create the necessary energies to \nlocally manipulate the extra dimension(s)." and “This may have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently \nadvanced technology.”

4.  **Synthesizing the Answer:** Based on this analysis, the paper proposes that an *exotic power generator* is needed to generate the warp bubble. This generator is intended to manipulate extra dimensions and create the necessary energy for spacetime distortion.

**Final Answer:** The paper proposes an exotic power generator to generate the warp bubble for faster-than-light travel.

**Risk Assessment & Considerations:**

*   **Assumption:** The paper assumes the existence and controllability of “extra dimensions.” This is a significant assumption, as the existence of such dimensions is currently theoretical.
*   **Falsification:**  A key area for testing this idea would be to develop a mathematical model that *explicitly* describes the energy requirements and spacetime distortion caused by this exotic power generator.  A model that accurately predicts behavior, even if theoretically based, would provide evidence supporting the concept. Conversely, if the model fails to produce a consistent or plausible description of the warp bubble’s formation, it would cast doubt on the approach.

  - expected output: Matter/antimatter annihilation, requiring
approximately 10^28 kg of antimatter (equivalent to Jupiter's
mass-energy).

  - context: None

  - retrieval context: ['I find it very challenging to mak e predictions on how this warp drive might actually function. My route \nwas to envision a spacecraft with an exotic power generator that could create the necessary energies to \nlocally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract \nthe compactified space-time around it, thereby creating the propulsion effect. That being said, my goal \nin this paper is to work on realistic model rather than a physically realizable device.  \nThe first four sections of this paper revie w the necessary physics required to appreciate the new warp \ndrive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding \nspeed limits and energy requirements will also be presented.', 'energy condition or issues regarding causality. The aim of this paper is to suggest that a warp bubble \ncould be gen erated using ideas and mathematics that are consistent with quantum field theory . This \nmay have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently \nadvanced technology. \nBy associating the cosmological constant with  the Casimir Energy due to the Kaluza Klein modes of', 'for example [19]. The basic idea of all of these is to formulate a solution to Einstein’s equations whereby \na warp bubble is driven by a local expansion of space-time behind the bubble and a contraction ahead of \nthe bubble. One common feature of these papers is that thei r physical foundation is the General Theory \nof Relativity. An element missing from all the papers is that there is little or no suggestion as to how \nsuch a warp bubble may be created.']

======================================================================

Overall Metric Pass Rates

Contextual Precision: 0.00% pass rate
Faithfulness: 100.00% pass rate

Experiment 1 Scores:
Contextual Precision: 0.3333333333333333
Faithfulness: 0.6666666666666666

Comparison to Baseline:
Contextual Precision: +0.00
Faithfulness: +0.10

Experiment 1 Results: Smaller Chunks

Reducing chunk size from 1000 to 500 characters split our document into 69 chunks (up from 38 in the baseline). This is roughly the doubling we expected: more granular chunks mean more retrieval options. But did this granularity help?

Well, the scores tell a mixed story.

Contextual Precision: 0.33 (unchanged from baseline)
Faithfulness: 0.67 (+0.10 improvement from baseline’s 0.57)

Contextual Precision remained at 0.33, showing the same problematic pattern: relevant information ranked last. Looking at the retrieved chunks, we can see why. The first two chunks still discuss general concepts (the exotic power generator, quantum field theory consistency) without addressing the specific energy source question. Only Chunk 3 gets closer to the answer by noting that existing papers don’t explain “how such a warp bubble may be created.”

What’s particularly interesting is that the chunks retrieved are completely different from the baseline, yet they exhibit the same ranking problem. The baseline pulled a chunk about energy calculations with the cosmological constant. This experiment pulled chunks about quantum field theory and general warp drive concepts. Both retrievals failed to surface the matter/antimatter annihilation information, just in different ways.

This reveals something important about smaller chunks: they give the retriever more options to choose from (69 vs 38), but more options doesn’t automatically mean better choices. The retriever is still matching on semantic similarity, and “exotic power generator” and “quantum field theory” are semantically similar to “energy source” even though they’re not the specific answer we need.

However, Faithfulness improved modestly to 0.67. Why? Looking at the actual output, ts-reasoner concluded that “an exotic power generator” is the proposed energy source. This is technically present in the retrieved chunks (Chunk 1 explicitly mentions it), so the model stayed more faithful to what it received. The model didn’t reach for external concepts like Hubble’s constant or specific equations that weren’t in the context.

But notice: the question still isn’t answered correctly. The expected output is “matter/antimatter annihilation,” not “exotic power generator.” An exotic power generator is a mechanism, not an energy source. It’s like asking “what fuel does your car use?” and getting the answer “a fuel injection system.” Technically related, but not what was asked.

What went wrong? The problem is that none of these three chunks contain the words “matter,” “antimatter,” or “annihilation.” The information simply isn’t available to the model. Smaller chunks didn’t help because the retriever still isn’t finding the right section of the paper. We’ve made the haystack bigger (69 chunks instead of 38) but the needle is still missing from what gets retrieved.

Test Finding: Smaller chunks alone don’t improve retrieval quality if the semantic matching strategy remains unchanged. The retriever found different chunks than the baseline but made the same type of error: selecting topically related but not query-relevant information. Faithfulness improved slightly because the model had less temptation to reach beyond the context, but the fundamental retrieval failure persists.

This experiment demonstrates why chunk size is only one variable in retrieval quality. The strategy for matching chunks to queries matters as much as the chunk granularity. We’re still using pure semantic similarity search, which continues to prioritize broad topical relevance over specific answer content.

Experiment 2: More Retrieved Chunks

Our second hypothesis addresses a different potential problem: what if the relevant information is being retrieved, but we’re not pulling enough chunks to capture the complete answer? By only retrieving k=3 chunks, we might be missing critical context. Looking at the baseline results, the energy calculation chunk appeared at position #3. What if the specific detail about matter/antimatter annihilation is in a chunk that ranked fourth or fifth?

Increasing k to 5 might give the retriever more opportunities to find all the pieces of information needed to answer the question completely. However, there’s a tradeoff: more chunks means more tokens consumed, potentially more noise in the context, and a larger window for the model to get distracted from the most relevant information. There’s also a ranking consideration. If we retrieve 5 chunks but the most relevant ones are still at positions 4 and 5, Contextual Precision will remain low even though we’ve retrieved more total chunks.

Add the next experiment to the bottom of your script.

# =========================================================
# EXPERIMENT 2: More Chunks
# =========================================================
print("\n" + "=" * 60)
print("EXPERIMENT 2: chunk_size=1000, chunk_overlap=200, k=5")
print("=" * 60)

retriever_exp2, num_chunks_exp2 = create_rag_system(
  chunk_size=1000,
  chunk_overlap=200,
  k=5
)

print(f"Document split into {num_chunks_exp2} chunks")

exp2_results, exp2_context, exp2_response = run_test(
  retriever_exp2,
  question,
  expected_output
)

print_scores("Experiment 2", exp2_results, baseline_results)

# =========================================================

# EXPERIMENT 2: More Chunks

# =========================================================

print("\n" + "=" * 60)

print("EXPERIMENT 2: chunk_size=1000, chunk_overlap=200, k=5")

print("=" * 60)

retriever_exp2, num_chunks_exp2 = create_rag_system(

chunk_size=1000,

chunk_overlap=200,

k=5

)

print(f"Document split into {num_chunks_exp2} chunks")

exp2_results, exp2_context, exp2_response = run_test(

retriever_exp2,

question,

expected_output

)

print_scores("Experiment 2", exp2_results, baseline_results)

What to Look For

This experiment tests whether the problem is missing information rather than poorly ordered information. Here’s what we’re watching for:

Does the relevant chunk appear at all? With 5 chunks instead of 3, we might finally retrieve the section from page 11 that explicitly mentions matter/antimatter annihilation.
If it appears, where is it ranked? Finding the right chunk at position 5 is better than not finding it at all, but it still indicates a ranking problem. Contextual Precision measures whether relevant chunks are ranked highly, so adding chunks without improving their position won’t necessarily improve this metric.
Does Faithfulness improve? If we now have access to the complete answer (matter/antimatter annihilation + the 10²⁸ kg requirement), the model should be able to generate a faithful response that matches the expected output.
Is there helpful redundancy or harmful noise? More chunks might reinforce key concepts through repetition, or they might dilute the signal with tangentially related information.

The key insight to test here is that this experiment distinguishes between “retrieval coverage” (did we get the right chunks?) and “retrieval precision” (did we rank them well?). Both matter, but they’re different problems requiring different solutions. Here’s the output I got:


============================================================
EXPERIMENT 2: chunk_size=1000, chunk_overlap=200, k=5
============================================================
Document split into 38 chunks

------------------------------------------------------------
RETRIEVED CHUNKS:
------------------------------------------------------------

--- Chunk 1 ---
I find it very challenging to mak e predictions on how this warp drive might actually function. My route was to envision a spacecraft with an exotic power generator that could create the necessary energies to locally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract the compactified space-time around it, thereby creating the propulsion effect. That being said, my goal in this paper is to work on realistic model rather than a physically realizable device. The first four sections of this paper revie w the necessary physics required to appreciate the new warp drive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding speed limits and energy requirements will also be presented.

--- Chunk 2 ---
I find it very challenging to mak e predictions on how this warp drive might actually function. My route was to envision a spacecraft with an exotic power generator that could create the necessary energies to locally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract the compactified space-time around it, thereby creating the propulsion effect. That being said, my goal in this paper is to work on realistic model rather than a physically realizable device. The first four sections of this paper revie w the necessary physics required to appreciate the new warp drive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding speed limits and energy requirements will also be presented.

--- Chunk 3 ---
energy condition or issues regarding causality. The aim of this paper is to suggest that a warp bubble could be gen erated using ideas and mathematics that are consistent with quantum field theory . This may have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently advanced technology. By associating the cosmological constant with  the Casimir Energy due to the Kaluza Klein modes of

--- Chunk 4 ---
for example [19]. The basic idea of all of these is to formulate a solution to Einstein’s equations whereby a warp bubble is driven by a local expansion of space-time behind the bubble and a contraction ahead of the bubble. One common feature of these papers is that thei r physical foundation is the General Theory of Relativity. An element missing from all the papers is that there is little or no suggestion as to how such a warp bubble may be created.

--- Chunk 5 ---
in this paper is to work on realistic model rather than a physically realizable device. The first four sections of this paper revie w the necessary physics required to appreciate the new warp drive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding speed limits and energy requirements will also be presented.
------------------------------------------------------------

**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The context does not mention any specific energy source needed for generating a warp bubble."
    },
    {
        "verdict": "no",
        "reason": "This document is repetitive and does not provide information about the energy requirements for the warp drive."
    },
    {
        "verdict": "no",
        "reason": "The context discusses the theoretical consistency of the concept but does not specify any particular energy source or its requirements."
    },
    {
        "verdict": "no",
        "reason": "This document focuses on the general idea and mathematical formulation without detailing the specific energy needs for a warp bubble."
    },
    {
        "verdict": "no",
        "reason": "The context is about the overall approach to the paper but does not provide information regarding the energy source required for generating a warp bubble."
    }
]

Score: 0
Reason: The score is 0.00 because all nodes in the retrieval contexts (ranks 1-5) are marked as 'no', indicating they do not address the specific question about the energy source needed for generating a warp bubble, which directly relates to the input.

======================================================================
**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "The author aims to work on a realistic model rather than a physically realizable device for the warp drive.",
    "The first four sections of the paper review the necessary physics required to appreciate the new warp drive model.",
    "Calculations regarding speed limits and energy requirements will be presented in the latter part of the paper.",
    "The author envisions a spacecraft with an exotic power generator that could create energies to locally manipulate extra dimensions.",
    "A warp bubble is driven by a local expansion of space-time behind the bubble and contraction ahead of it, according to some papers.",
    "The physical foundation for these ideas is General Theory of Relativity.",
    "There is little or no suggestion as to how such a warp bubble may be created in existing literature."
]

Claims:
[
    "The core concept involves a spacecraft with an 'exotic power generator' that manipulates extra dimensions to locally alter spacetime.",
    "The paper heavily relies on General Relativity and Kaluza-Klein modes, particularly linking the cosmological constant to Casimir Energy.",
    "The specific mechanism described is a local expansion of space-time behind the bubble and a contraction ahead of the bubble, driven by the exotic power generator.",
    "The paper outlines that a sufficiently advanced technology may allow us to hypothesize how such a bubble could be created.",
    "The function of an 'exotic power generator' is central to the concept of creating a warp bubble through spacetime distortions.",
    "The exact nature of this 'exotic power generator' is not specified in the paper."
]

Verdicts:
[
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The context does not mention Kaluza-Klein modes or a specific link to the cosmological constant and Casimir Energy."
    },
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The context states that there is little suggestion on how such a warp bubble may be created in existing literature, implying uncertainty rather than outlining specific mechanisms."
    },
    {
        "verdict": "yes",
        "reason": null
    }
]

Score: 0.6
Reason: The score is 0.60 because the actual output incorrectly includes details about Kaluza-Klein modes and a direct link to the cosmological constant and Casimir Energy, which are not present in the context. Additionally, it overstates the existence of specific mechanisms for creating a warp bubble, contradicting the context's indication of uncertainty.

======================================================================
======================================================================

Metrics Summary

  - ❌ Contextual Precision (score: 0.0, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.00 because all nodes in the retrieval contexts (ranks 1-5) are marked as 'no', indicating they do not address the specific question about the energy source needed for generating a warp bubble, which directly relates to the input., error: None)
  - ✅ Faithfulness (score: 0.6, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.60 because the actual output incorrectly includes details about Kaluza-Klein modes and a direct link to the cosmological constant and Casimir Energy, which are not present in the context. Additionally, it overstates the existence of specific mechanisms for creating a warp bubble, contradicting the context's indication of uncertainty., error: None)

For test case:

  - input: Please consider Jeff Nyman's warp drive paper.
What energy source does the paper propose would be needed to
generate the warp bubble for faster-than-light travel?

  - actual output: Okay, let's break down Jeff Nyman's warp drive paper and identify the proposed energy source. Here’s my step-by-step reasoning:

1.  **Initial Vision:** The core concept is a spacecraft with an “exotic power generator” that manipulates extra dimensions to locally alter spacetime. This generator is meant to expand and contract spacetime around the spacecraft, creating the propulsion effect.

2.  **Theoretical Foundation:** The paper heavily relies on General Relativity and Kaluza-Klein modes, particularly linking the cosmological constant to Casimir Energy. This suggests the energy source isn't a conventional one (like chemical rockets).

3.  **Specific Mechanism:** The paper outlines a specific mechanism: “a local expansion of space-time behind the bubble and a contraction ahead of the bubble.” This expansion and contraction are driven by the exotic power generator.

4. **Key Quote:** "…a sufficiently advanced technology… may have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently advanced technology.” This indicates the power source is beyond our current understanding, reliant on advanced technology.

5. **Summary of the energy source**: Based on the above, the paper proposes an **exotic power generator** that can create the necessary energies to locally manipulate the extra dimensions and drive the spacetime distortions required for warp bubble creation.  It’s not explicitly defined *how* this generator works, but its function is central to the whole concept.

**Therefore, the answer is an exotic power generator that can manipulate extra dimensions.**

**Risks & Considerations:**

*   **Unrealistic:** The entire premise of warp drive is highly speculative and faces significant theoretical challenges (causality violations, energy requirements, etc.). This is acknowledged within the text itself.
*   **Undefined Generator:** The exact nature of this "exotic power generator" is not specified, leaving a huge area for potential ambiguity and further research.
*   **Assumption:** The paper implicitly assumes the existence and controllable manipulation of extra dimensions, a concept that’s not yet proven.

  - expected output: Matter/antimatter annihilation, requiring
approximately 10^28 kg of antimatter (equivalent to Jupiter's
mass-energy).

  - context: None

  - retrieval context: ['I find it very challenging to mak e predictions on how this warp drive might actually function. My route \nwas to envision a spacecraft with an exotic power generator that could create the necessary energies to \nlocally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract \nthe compactified space-time around it, thereby creating the propulsion effect. That being said, my goal \nin this paper is to work on realistic model rather than a physically realizable device.  \nThe first four sections of this paper revie w the necessary physics required to appreciate the new warp \ndrive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding \nspeed limits and energy requirements will also be presented.', 'I find it very challenging to mak e predictions on how this warp drive might actually function. My route \nwas to envision a spacecraft with an exotic power generator that could create the necessary energies to \nlocally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract \nthe compactified space-time around it, thereby creating the propulsion effect. That being said, my goal \nin this paper is to work on realistic model rather than a physically realizable device.  \nThe first four sections of this paper revie w the necessary physics required to appreciate the new warp \ndrive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding \nspeed limits and energy requirements will also be presented.', 'energy condition or issues regarding causality. The aim of this paper is to suggest that a warp bubble \ncould be gen erated using ideas and mathematics that are consistent with quantum field theory . This \nmay have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently \nadvanced technology. \nBy associating the cosmological constant with  the Casimir Energy due to the Kaluza Klein modes of', 'for example [19]. The basic idea of all of these is to formulate a solution to Einstein’s equations whereby \na warp bubble is driven by a local expansion of space-time behind the bubble and a contraction ahead of \nthe bubble. One common feature of these papers is that thei r physical foundation is the General Theory \nof Relativity. An element missing from all the papers is that there is little or no suggestion as to how \nsuch a warp bubble may be created.', 'in this paper is to work on realistic model rather than a physically realizable device.  \nThe first four sections of this paper revie w the necessary physics required to appreciate the new warp \ndrive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding \nspeed limits and energy requirements will also be presented.']

======================================================================

Overall Metric Pass Rates

Contextual Precision: 0.00% pass rate
Faithfulness: 100.00% pass rate

Experiment 2 Scores:
Contextual Precision: 0.0
Faithfulness: 0.6

Comparison to Baseline:
Contextual Precision: -0.33
Faithfulness: +0.03

Experiment 2 Results: More Retrieved Chunks

Increasing k from 3 to 5 chunks should have given the retriever more opportunities to find relevant information. Instead, things got worse. The scores tell a troubling story:

Contextual Precision: 0.0 (-0.33 from baseline) ❌
Faithfulness: 0.6 (+0.03 from baseline, essentially flat)

Contextual Precision dropped to zero: the worst possible score. Looking at the verdicts, all five retrieved chunks were marked “no,” meaning none of them address the specific question about energy sources. The metric’s reasoning is blunt: “all nodes in the retrieval contexts (ranks 1-5) are marked as ‘no’, indicating they do not address the specific question about the energy source needed for generating a warp bubble.”

But here’s what’s particularly revealing: look at the retrieved chunks themselves. Chunks 1 and 2 are identical! The retriever pulled the same passage from page 2 twice, wasting one of our five retrieval slots on duplicate content. Then chunks 3, 4, and 5 are all different fragments discussing general concepts (quantum field theory, Einstein’s equations, the paper’s structure), but none from page 11 where the actual answer lives.

This demonstrates a critical failure mode: retrieving more chunks doesn’t help if the retrieval strategy itself is broken. We gave the system more chances to find the answer, and it responded by retrieving redundant and irrelevant content. It’s like asking someone to search harder for your car keys and watching them check the same couch cushion twice while never looking in the actual location (your pocket).

Faithfulness stayed roughly flat at 0.6 (up just 0.03 from baseline). The model generated the same type of response it did in Experiment 1: discussing the “exotic power generator” concept because that’s what appears in the retrieved chunks. The model concluded “the answer is an exotic power generator that can manipulate extra dimensions,” which, as before, confuses a mechanism with an energy source and completely misses matter/antimatter annihilation.

What did this experiment reveal? The problem isn’t retrieval coverage (not retrieving enough chunks); it’s retrieval accuracy (not finding the right chunks at all). Adding more chunks when the ranking strategy is fundamentally flawed just adds more noise. The semantic similarity search is still matching on broad topical relevance (“warp drive,” “energy,” “spacecraft”) rather than specific query relevance (“what energy source powers it?”).

The duplicate chunk (chunks 1 and 2 being identical) is particularly diagnostic. This suggests the vector similarity scores for multiple chunks are so close that the retriever can’t meaningfully distinguish between them, potentially pulling near-duplicates because they have nearly identical embeddings. This is a sign that the chunking strategy itself might be creating ambiguity in the vector space.

Test Finding: Increasing k (number of chunks retrieved) is only helpful when the retrieval strategy successfully surfaces relevant chunks. If the strategy is ranking irrelevant chunks highly, retrieving more of them doesn’t solve the problem. Instead, it compounds the problem by adding noise and potentially duplicates. This experiment proves that our issue is retrieval quality, not retrieval quantity. We need to change how we retrieve, not just how much we retrieve.

This result also teaches us something about RAG system diagnosis: when Contextual Precision drops to zero, it’s a clear signal that the retrieval mechanism is completely missing the target. No amount of downstream generation optimization will fix that. The model can’t be faithful to information it never received.

Experiment 3: Combined Approach

Our hypotheses haven’t panned out so far. Experiment 1 (smaller chunks) changed what was retrieved but didn’t improve ranking. Experiment 2 (more chunks) actually made things worse, dropping Contextual Precision to zero and retrieving duplicate content. But what if the problem is that we need both changes working together?

This third experiment combines both strategies: smaller chunks (500 characters) and more retrieval (k=5). The reasoning is that smaller chunks create more granular options, and retrieving more of them increases the chances that at least one will come from page 11 where the actual answer lives. With 69 chunks to choose from (from Experiment 1’s smaller chunking) and 5 retrieval slots, maybe we’ll finally surface the matter/antimatter annihilation information.

However, given what we’ve seen so far, we should be cautious. Combining two strategies that individually failed to improve Contextual Precision might just give us more irrelevant, smaller chunks. The fundamental problem (semantic similarity matching on broad topics rather than specific answers) remains unchanged.

Add this to your script:

# =========================================================
# EXPERIMENT 3: Combined (Smaller + More)
# =========================================================
print("\n" + "=" * 60)
print("EXPERIMENT 3: chunk_size=500, chunk_overlap=100, k=5")
print("=" * 60)

retriever_exp3, num_chunks_exp3 = create_rag_system(
  chunk_size=500,
  chunk_overlap=100,
  k=5
)

print(f"Document split into {num_chunks_exp3} chunks")

exp3_results, exp3_context, exp3_response = run_test(
  retriever_exp3,
  question,
  expected_output
)

print_scores("Experiment 3", exp3_results, baseline_results)

# =========================================================

# EXPERIMENT 3: Combined (Smaller + More)

# =========================================================

print("\n" + "=" * 60)

print("EXPERIMENT 3: chunk_size=500, chunk_overlap=100, k=5")

print("=" * 60)

retriever_exp3, num_chunks_exp3 = create_rag_system(

chunk_size=500,

chunk_overlap=100,

k=5

)

print(f"Document split into {num_chunks_exp3} chunks")

exp3_results, exp3_context, exp3_response = run_test(

retriever_exp3,

question,

expected_output

)

print_scores("Experiment 3", exp3_results, baseline_results)

What to Look For

This experiment tests whether combining strategies produces synergistic improvement or simply compounds existing problems. Here’s what we’re watching for:

Does quantity overcome quality issues? With 5 chances to retrieve from 69 smaller chunks, do we finally get content from page 11?
Additive improvements? If both strategies help independently (which we haven’t seen yet), combining them might produce the best scores.
Diminishing returns or conflicting effects? More granular chunks might mean more duplicates or near-duplicates when retrieving 5 instead of 3, similar to what we saw in Experiment 2.
Does Faithfulness correlate with retrieval? If we do retrieve the right information, will the model finally generate a response mentioning matter/antimatter annihilation?

The critical question: can parameter tuning alone solve our retrieval problem, or do we need a fundamentally different approach? This experiment will help answer that. Here is the output I got from this:


============================================================
EXPERIMENT 3: chunk_size=500, chunk_overlap=100, k=5
============================================================
Document split into 69 chunks

------------------------------------------------------------
RETRIEVED CHUNKS:
------------------------------------------------------------

--- Chunk 1 ---
I find it very challenging to mak e predictions on how this warp drive might actually function. My route was to envision a spacecraft with an exotic power generator that could create the necessary energies to locally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract the compactified space-time around it, thereby creating the propulsion effect. That being said, my goal in this paper is to work on realistic model rather than a physically realizable device. The first four sections of this paper revie w the necessary physics required to appreciate the new warp drive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding speed limits and energy requirements will also be presented.

--- Chunk 2 ---
I find it very challenging to mak e predictions on how this warp drive might actually function. My route was to envision a spacecraft with an exotic power generator that could create the necessary energies to locally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract the compactified space-time around it, thereby creating the propulsion effect. That being said, my goal in this paper is to work on realistic model rather than a physically realizable device. The first four sections of this paper revie w the necessary physics required to appreciate the new warp drive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding speed limits and energy requirements will also be presented.

--- Chunk 3 ---
energy condition or issues regarding causality. The aim of this paper is to suggest that a warp bubble could be gen erated using ideas and mathematics that are consistent with quantum field theory . This may have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently advanced technology. By associating the cosmological constant with  the Casimir Energy due to the Kaluza Klein modes of

--- Chunk 4 ---
energy condition or issues regarding causality. The aim of this paper is to suggest that a warp bubble could be gen erated using ideas and mathematics that are consistent with quantum field theory . This may have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently advanced technology. By associating the cosmological constant with  the Casimir Energy due to the Kaluza Klein modes of

--- Chunk 5 ---
for example [19]. The basic idea of all of these is to formulate a solution to Einstein’s equations whereby a warp bubble is driven by a local expansion of space-time behind the bubble and a contraction ahead of the bubble. One common feature of these papers is that thei r physical foundation is the General Theory of Relativity. An element missing from all the papers is that there is little or no suggestion as to how such a warp bubble may be created.
------------------------------------------------------------

**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The context does not mention any specific energy source needed for generating the warp bubble."
    },
    {
        "verdict": "no",
        "reason": "This document is a duplicate and repeats the same information as the first one, which also does not address the required energy source."
    },
    {
        "verdict": "no",
        "reason": "The context discusses the theoretical basis for generating a warp bubble but does not specify the energy source needed. It mentions ideas consistent with quantum field theory and the Casimir Energy due to Kaluza Klein modes, but these are not directly linked to the specific energy requirement."
    },
    {
        "verdict": "no",
        "reason": "Similar to the third document, this one also discusses theoretical aspects without specifying the exact energy source required for generating a warp bubble."
    },
    {
        "verdict": "no",
        "reason": "This context focuses on the physical foundation of the solution and does not provide any specific information about the energy source needed for creating a warp bubble. It mentions that there is little or no suggestion as to how such a bubble may be created, which aligns with the lack of relevant information in the other documents."
    }
]

Score: 0
Reason: The score is 0.00 because all five retrieval contexts provide no specific information about the energy source needed for generating the warp bubble, and they are ranked consecutively without any relevant nodes to address the query.

======================================================================
**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "The paper aims to work on a realistic model of a warp drive rather than a physically realizable device.",
    "The first four sections of the paper review the necessary physics required to appreciate the new warp drive model.",
    "Calculations regarding speed limits and energy requirements will be presented in the latter part of the paper.",
    "The aim is to suggest that a warp bubble could be generated using ideas and mathematics consistent with quantum field theory.",
    "The basic idea involves formulating a solution to Einstein’s equations where a warp bubble is driven by local expansion of spacetime behind it and contraction ahead of it.",
    "All papers on this topic lack suggestions for how such a warp bubble may be created."
]

Claims:
[
    "The paper proposes an 'exotic power generator' as the energy source needed to generate the warp bubble."
]

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The retrieval context does not mention any specific energy source, only that all papers on this topic lack suggestions for how such a warp bubble may be created."
    }
]

Score: 0.0
Reason: The score is 0.00 because the actual output incorrectly implies that the retrieval context mentions specific energy sources, which it does not.

======================================================================
======================================================================

Metrics Summary

  - ❌ Contextual Precision (score: 0.0, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.00 because all five retrieval contexts provide no specific information about the energy source needed for generating the warp bubble, and they are ranked consecutively without any relevant nodes to address the query., error: None)
  - ❌ Faithfulness (score: 0.0, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.00 because the actual output incorrectly implies that the retrieval context mentions specific energy sources, which it does not., error: None)

For test case:

  - input: Please consider Jeff Nyman's warp drive paper.
What energy source does the paper propose would be needed to
generate the warp bubble for faster-than-light travel?

  - actual output: Okay, let's break down this question step-by-step, applying a tester's mindset to understand Jeff Nyman's proposed warp drive concept.

1. **Understanding the Context:** We're analyzing a paper aiming for a *realistic model* of a warp drive, not a physically realizable device. The core idea is manipulating spacetime using an exotic power generator.

2. **Identifying Key Claims:** The text explicitly states: "My route was to envision a spacecraft with an exotic power generator that could create the necessary energies to locally manipulate the extra dimension(s)."

3. **Deconstructing the Mechanism:** The generator’s purpose is to expand and contract spacetime around the spacecraft. This implies a significant energy input is required for this dimensional manipulation.

4. **Connecting to Theoretical Framework:** The paper leverages concepts like the cosmological constant, Casimir energy, and Kaluza-Klein modes – all related to quantum field theory. This suggests the energy source isn’t just brute force but is tied to fundamental physics.

5. **Risk Assessment:** A major risk here is the reliance on "exotic" energy.  The paper doesn't detail the source, highlighting the challenge Jeff Nyman identifies.  This is a crucial assumption.  If this energy source is not adequately explained or justified, the entire model falls apart.

6. **Synthesizing the Answer:** Based on the provided text, the paper proposes an “exotic power generator” as the energy source needed to create the warp bubble. This generator is intended to manipulate extra dimensions and, consequently, spacetime.

**Final Answer:** The paper proposes an “exotic power generator” as the energy source needed to generate the warp bubble.

  - expected output: Matter/antimatter annihilation, requiring approximately 10^28 kg of antimatter (equivalent to Jupiter's mass-energy).

  - context: None

  - retrieval context: ['I find it very challenging to mak e predictions on how this warp drive might actually function. My route \nwas to envision a spacecraft with an exotic power generator that could create the necessary energies to \nlocally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract \nthe compactified space-time around it, thereby creating the propulsion effect. That being said, my goal \nin this paper is to work on realistic model rather than a physically realizable device.  \nThe first four sections of this paper revie w the necessary physics required to appreciate the new warp \ndrive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding \nspeed limits and energy requirements will also be presented.', 'I find it very challenging to mak e predictions on how this warp drive might actually function. My route \nwas to envision a spacecraft with an exotic power generator that could create the necessary energies to \nlocally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract \nthe compactified space-time around it, thereby creating the propulsion effect. That being said, my goal \nin this paper is to work on realistic model rather than a physically realizable device.  \nThe first four sections of this paper revie w the necessary physics required to appreciate the new warp \ndrive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding \nspeed limits and energy requirements will also be presented.', 'energy condition or issues regarding causality. The aim of this paper is to suggest that a warp bubble \ncould be gen erated using ideas and mathematics that are consistent with quantum field theory . This \nmay have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently \nadvanced technology. \nBy associating the cosmological constant with  the Casimir Energy due to the Kaluza Klein modes of', 'energy condition or issues regarding causality. The aim of this paper is to suggest that a warp bubble \ncould be gen erated using ideas and mathematics that are consistent with quantum field theory . This \nmay have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently \nadvanced technology. \nBy associating the cosmological constant with  the Casimir Energy due to the Kaluza Klein modes of', 'for example [19]. The basic idea of all of these is to formulate a solution to Einstein’s equations whereby \na warp bubble is driven by a local expansion of space-time behind the bubble and a contraction ahead of \nthe bubble. One common feature of these papers is that thei r physical foundation is the General Theory \nof Relativity. An element missing from all the papers is that there is little or no suggestion as to how \nsuch a warp bubble may be created.']

======================================================================

Overall Metric Pass Rates

Contextual Precision: 0.00% pass rate
Faithfulness: 0.00% pass rate

Experiment 3 Scores:
Contextual Precision: 0.0
Faithfulness: 0.0

Comparison to Baseline:
Contextual Precision: -0.33
Faithfulness: -0.57

Experiment 3 Results: Combined Approach

Okay, so we combined both strategies: smaller chunks (500 characters) and more retrieval (k=5). This produced the worst results yet. Both metrics completely failed. The scores are catastrophic:

Contextual Precision: 0.0 (-0.33 from baseline) ❌
Faithfulness: 0.0 (-0.57 from baseline) ❌

This is a complete system failure. Not only did we fail to retrieve relevant information (Contextual Precision = 0.0), but the model’s response was also completely unfaithful to what little context it did receive (Faithfulness = 0.0).

Looking at the retrieved chunks reveals the problem immediately: we have two pairs of exact duplicates. Chunks 1 and 2 are identical copies of the page 2 introduction. Chunks 3 and 4 are identical copies of the quantum field theory discussion from page 8. Only chunk 5 is unique. Out of five retrieval slots, we wasted two on redundant content, leaving us with effectively only three unique chunks, and none of them from page 11 where the answer lives.

This is worse than Experiment 2, which had one duplicate pair. The smaller chunk size (500 characters) combined with the larger retrieval count (k=5) has amplified the duplicate problem. When chunks are small and semantically similar passages get split across multiple chunks, the vector embeddings become so close that the retriever can’t distinguish between them. It ends up pulling near-identical content multiple times.

But the Faithfulness collapse to 0.0 is even more revealing. The metric evaluated just one claim from the model’s output: “The paper proposes an ‘exotic power generator’ as the energy source needed to generate the warp bubble.” The verdict was “no” with this reasoning: “The retrieval context does not mention any specific energy source, only that all papers on this topic lack suggestions for how such a warp bubble may be created.”

Wait, hold on. Chunk 1 explicitly says “a spacecraft with an exotic power generator that could create the necessary energies.” How is that claim unfaithful? The answer lies in the Faithfulness metric’s sophistication. The chunk says the paper envisions an exotic power generator as part of the conceptual framework, but it doesn’t propose it as the specific energy source that powers the device. It’s a subtle but important distinction, and the metric caught it. The model is making a stronger claim than the text supports, which is why Faithfulness flagged it as unfaithful.

What this experiment proved definitively is that parameter tuning alone can’t fix a fundamentally broken retrieval strategy. So, let’s take stock of where we’re at. We’ve now tried:

Smaller chunks (Exp 1): CP stayed at 0.33
More retrieval (Exp 2): CP dropped to 0.0
Both combined (Exp 3): CP = 0.0, F = 0.0, complete failure

Every variation of our semantic similarity approach either maintained the baseline failure or made it worse. The problem isn’t the chunk size. The problem isn’t the number of chunks retrieved. The problem is that semantic similarity search is matching on broad topical keywords (“warp drive,” “energy,” “spacecraft”) rather than the specific content that answers the query (“matter/antimatter annihilation”).

The retriever keeps pulling from pages 2 and 8 (conceptual framework, theoretical physics) because those pages are semantically dense with warp drive terminology. Page 11 (the energy requirements section with the actual answer) doesn’t use as much of this terminology; instead, it’s focused on calculations and specific numbers. To a semantic similarity model, pages 2 and 8 look more relevant even though they don’t contain the answer.

Test Finding: When multiple experiments with different parameter configurations all fail in similar ways, you’re seeing a systemic problem, not a tuning problem. The diagnostic pattern here is clear: our retrieval strategy is fundamentally misaligned with the query type. Continuing to adjust chunk sizes and retrieval counts is like rearranging deck chairs on the Titanic. Instead, we need a different ship entirely. This points us toward needing a different retrieval approach, perhaps one that respects document structure (semantic chunking) or uses hybrid search combining semantic and keyword matching.

It’s worth pointing out that this complete failure is actually valuable. It rules out parameter tuning as a solution and forces us to consider more fundamental changes to our approach.

Experiment 4: Semantic Chunking

After three failed experiments, it’s clear that parameter tuning alone won’t solve our problem. Our final experiment takes a fundamentally different approach: instead of splitting text arbitrarily by character count, we’ll use semantic chunking that attempts to respect document structure by splitting at natural boundaries like paragraphs, sentences, and section breaks.

The hypothesis is that semantically coherent chunks (meaning, those that contain complete thoughts rather than arbitrary character slices) might produce better embeddings and thus better retrieval matches. When you split mid-sentence or mid-paragraph, you create chunks whose meaning is fragmented. A chunk that says “I can thus say this:” (ending arbitrarily) has poor semantic content compared to one that contains “I can thus say this: [complete equation and explanation].”

Now, I’ll note here that LangChain doesn’t have a built-in semantic chunker in the same way as RecursiveCharacterTextSplitter, which we’ve been using so far. But we can approximate the concept by using sentence-aware splitting with natural separators. That’s what my create_rag_system_semantic() function does: it tells the splitter to prefer breaking at paragraph boundaries (\n\n\n), then regular paragraph breaks (\n\n), then sentences (. ), and only split mid-sentence as a last resort.

For a true semantic chunking approach, you would want something a wee bit more sophisticated that what I threw together here. Perhaps using an LLM to identify topic boundaries or analyzing section headers. What I’m doing here is a lightweight approximation that respects syntactic structure.

Add this code to the end of your script:

# =========================================================
# EXPERIMENT 4: Semantic Chunking
# =========================================================
print("\n" + "=" * 60)
print("EXPERIMENT 4: Semantic chunking, k=3")
print("=" * 60)

retriever_exp4, num_chunks_exp4 = create_rag_system_semantic(k=3)
print(f"Document split into {num_chunks_exp4} chunks")

exp4_results, exp4_context, exp4_response = run_test(
  retriever_exp4,
  question,
  expected_output
)

print_scores("Experiment 4", exp4_results, baseline_results)

# =========================================================

# EXPERIMENT 4: Semantic Chunking

# =========================================================

print("\n" + "=" * 60)

print("EXPERIMENT 4: Semantic chunking, k=3")

print("=" * 60)

retriever_exp4, num_chunks_exp4 = create_rag_system_semantic(k=3)

print(f"Document split into {num_chunks_exp4} chunks")

exp4_results, exp4_context, exp4_response = run_test(

retriever_exp4,

question,

expected_output

)

print_scores("Experiment 4", exp4_results, baseline_results)

What to Look For

Semantic chunking tests whether respecting document structure improves retrieval quality. Here’s what we’re watching for:

Are chunks more coherent? Examine the retrieved chunks to see if they contain complete thoughts, full sentences, or intact paragraphs rather than arbitrary mid-sentence splits.
Does coherence help retrieval? More semantically complete chunks should produce better-quality embeddings, potentially improving the retriever’s ability to match on meaning rather than just keywords.
Do we finally get page 11? If semantic coherence helps the retriever distinguish between conceptual framework sections (pages 2-8) and calculation sections (page 11), we might finally retrieve the energy requirements content.
Does the chunk count change significantly? This experiment uses chunk_size=800 with natural separators. Depending on document structure, this might create fewer or more chunks than our character-based approaches.

This is our last experiment with the basic RAG architecture. If semantic chunking doesn’t improve things, it would suggest we need more fundamental changes. Maybe hybrid search (combining semantic and keyword matching), query reformulation, or even re-ranking retrieved results. Here is the output I got:


============================================================
EXPERIMENT 4: Semantic chunking, k=3
============================================================
Document split into 43 chunks

------------------------------------------------------------
RETRIEVED CHUNKS:
------------------------------------------------------------

--- Chunk 1 ---
I find it very challenging to mak e predictions on how this warp drive might actually function. My route was to envision a spacecraft with an exotic power generator that could create the necessary energies to locally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract the compactified space-time around it, thereby creating the propulsion effect. That being said, my goal in this paper is to work on realistic model rather than a physically realizable device. The first four sections of this paper revie w the necessary physics required to appreciate the new warp drive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding speed limits and energy requirements will also be presented.

--- Chunk 2 ---
I find it very challenging to mak e predictions on how this warp drive might actually function. My route was to envision a spacecraft with an exotic power generator that could create the necessary energies to locally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract the compactified space-time around it, thereby creating the propulsion effect. That being said, my goal in this paper is to work on realistic model rather than a physically realizable device. The first four sections of this paper revie w the necessary physics required to appreciate the new warp drive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding speed limits and energy requirements will also be presented.

--- Chunk 3 ---
energy condition or issues regarding causality. The aim of this paper is to suggest that a warp bubble could be gen erated using ideas and mathematics that are consistent with quantum field theory . This may have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently advanced technology. By associating the cosmological constant with  the Casimir Energy due to the Kaluza Klein modes of
------------------------------------------------------------

**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The first two documents do not provide any specific information about the energy source needed for generating a warp bubble. They only discuss the general approach and goals of the paper."
    },
    {
        "verdict": "no",
        "reason": "The second document is identical to the first, reiterating the same points without providing relevant details on the energy requirements."
    },
    {
        "verdict": "yes",
        "reason": "The third document mentions 'By associating the cosmological constant with the Casimir Energy due to the Kaluza Klein modes of', which hints at a potential energy source for generating a warp bubble, even though it does not explicitly state matter/antimatter annihilation."
    }
]

Score: 0.3333333333333333
Reason: The score is 0.33 because nodes ranked 1 and 2 are irrelevant as they do not provide specific information about the energy source needed for generating a warp bubble, while node 3 hints at a potential energy source but does not explicitly state it.

======================================================================
**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "The author aims to work on a realistic model rather than a physically realizable device for the warp drive.",
    "The paper will cover the necessary physics required to appreciate the new warp drive model in its first four sections.",
    "Calculations regarding speed limits and energy requirements will be presented in the latter part of the paper.",
    "The author envisions a spacecraft with an exotic power generator that could locally manipulate extra dimensions.",
    "Expanding or contracting compactified space-time around a spacecraft is proposed as a method to create propulsion.",
    "The goal is to suggest how a warp bubble could potentially be generated using ideas and mathematics consistent with quantum field theory.",
    "The paper aims to address energy conditions or issues regarding causality in the context of the warp drive model."
]

Claims:
[
    "The paper establishes that the aim isn’t a physically realizable warp drive but a realistic model based on theoretical physics.",
    "The core idea is manipulating spacetime using an exotic power generator that interacts with extra dimensions.",
    "The spacecraft would utilize an 'exotic power generator' to 'locally manipulate the extra dimension(s).'",
    "This generator’s role is to expand and contract spacetime around the spacecraft, leading to propulsion.",
    "The paper connects this to Kaluza-Klein modes and the cosmological constant, suggesting a link to Casimir energy."
]

Verdicts:
[
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The context states that the author aims to work on a realistic model rather than a physically realizable device, which directly contradicts the claim that the aim is not about a realistic model."
    },
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The context mentions an exotic power generator and manipulating extra dimensions, aligning with the claim."
    },
    {
        "verdict": "yes",
        "reason": null
    }
]

Score: 0.6
Reason: The score is 0.60 because it contains contradictions regarding the author's aim being realistic versus non-realistic, which significantly impacts faithfulness.

======================================================================
======================================================================

Metrics Summary

  - ❌ Contextual Precision (score: 0.3333333333333333, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.33 because nodes ranked 1 and 2 are irrelevant as they do not provide specific information about the energy source needed for generating a warp bubble, while node 3 hints at a potential energy source but does not explicitly state it., error: None)
  - ✅ Faithfulness (score: 0.6, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.60 because it contains contradictions regarding the author's aim being realistic versus non-realistic, which significantly impacts faithfulness., error: None)

For test case:

  - input: Please consider Jeff Nyman's warp drive paper.
What energy source does the paper propose would be needed to
generate the warp bubble for faster-than-light travel?

  - actual output: Okay, let's break this down systematically, applying a tester's mindset to Jeff Nyman's warp drive paper.

1.  **Understanding the Context:** We have two identical excerpts from the paper. Both establish that the aim isn’t a physically realizable warp drive but a *realistic model* based on theoretical physics. The core idea is manipulating spacetime using an exotic power generator that interacts with extra dimensions.

2.  **Identifying Key Elements:** The paper explicitly states that the spacecraft would utilize an “exotic power generator” to “locally manipulate the extra dimension(s).” This generator is the critical component for creating the warp bubble.

3.  **Analyzing the Proposed Mechanism:** The power generator’s role is to expand and contract spacetime around the spacecraft, leading to propulsion. The paper connects this to Kaluza-Klein modes and the cosmological constant, suggesting a link to Casimir energy.

4.  **Synthesizing the Answer:** Based on the above, the paper proposes that an "exotic power generator" is needed to generate the warp bubble. This generator’s function is to manipulate extra dimensions and, consequently, spacetime.

**Final Answer:** The paper proposes an “exotic power generator” as the energy source needed to generate the warp bubble.

**Risks/Assumptions/Edge Cases:**

*   **Assumption:** The paper assumes the existence and manipulability of extra dimensions. This is a core assumption of the Kaluza-Klein model and is currently unproven.
*   **Risk:** The paper's reliance on theoretical physics makes it vulnerable to future discoveries that contradict the model.
*   **Edge Case:** The definition and capabilities of the "exotic power generator" are not fully fleshed out, representing a significant area for further investigation.

  - expected output: Matter/antimatter annihilation, requiring
approximately 10^28 kg of antimatter (equivalent to Jupiter's
mass-energy).

  - context: None

  - retrieval context: ['I find it very challenging to mak e predictions on how this warp drive might actually function. My route \nwas to envision a spacecraft with an exotic power generator that could create the necessary energies to \nlocally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract \nthe compactified space-time around it, thereby creating the propulsion effect. That being said, my goal \nin this paper is to work on realistic model rather than a physically realizable device.  \nThe first four sections of this paper revie w the necessary physics required to appreciate the new warp \ndrive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding \nspeed limits and energy requirements will also be presented.', 'I find it very challenging to mak e predictions on how this warp drive might actually function. My route \nwas to envision a spacecraft with an exotic power generator that could create the necessary energies to \nlocally manipulate the extra dimension(s). In this way, an advanced spacecraft wo uld expand/contract \nthe compactified space-time around it, thereby creating the propulsion effect. That being said, my goal \nin this paper is to work on realistic model rather than a physically realizable device.  \nThe first four sections of this paper revie w the necessary physics required to appreciate the new warp \ndrive model. The remainder of the paper will introduce the propulsion concept. Calculations regarding \nspeed limits and energy requirements will also be presented.', 'energy condition or issues regarding causality. The aim of this paper is to suggest that a warp bubble \ncould be gen erated using ideas and mathematics that are consistent with quantum field theory . This \nmay have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently \nadvanced technology. \nBy associating the cosmological constant with  the Casimir Energy due to the Kaluza Klein modes of']

======================================================================

Overall Metric Pass Rates

Contextual Precision: 0.00% pass rate
Faithfulness: 100.00% pass rate

Experiment 4 Scores:
Contextual Precision: 0.3333333333333333
Faithfulness: 0.6

Comparison to Baseline:
Contextual Precision: +0.00
Faithfulness: +0.03

Experiment 4 Results: Semantic Chunking

Semantic chunking brought us back to baseline performance but offered no improvement. After the complete failures of Experiments 2 and 3, at least we didn’t make things worse, but we also didn’t solve the problem. The scores:

Contextual Precision: 0.33 (+0.00 from baseline, identical)
Faithfulness: 0.6 (+0.03 from baseline, essentially flat)

We’re right back where we started. Contextual Precision returned to the baseline 0.33, showing the same problematic pattern: relevant information ranked last. Faithfulness at 0.6 is barely different from baseline’s 0.57, within the margin of variance we’ve seen throughout these experiments.

Looking at the retrieved chunks reveals a now-familiar story: chunks 1 and 2 are exact duplicates of the page 2 introduction, and chunk 3 is the partial quantum field theory passage from page 8. Despite using structure-aware separators (prioritizing paragraph and sentence boundaries), the semantic chunking approach created 43 total chunks but still pulled the same problematic content we’ve been seeing across all experiments.

The Contextual Precision metric did note something slightly positive about chunk 3: it “hints at a potential energy source” by mentioning “the cosmological constant with the Casimir Energy due to the Kaluza Klein modes.” This is technically closer to the answer than previous experiments where all chunks were completely irrelevant. But “hints at” isn’t “explicitly states,” and the metric correctly gave it a “yes” verdict while still penalizing the overall score because this relevant hint is buried at position #3.

The duplicate chunks are particularly disappointing here because semantic chunking was supposed to help with this problem. By respecting paragraph and sentence boundaries, we expected chunks to be more semantically distinct from each other, which should have produced more varied embeddings and thus more diverse retrieval results. Instead, we got the same page 2 passage twice, suggesting that even with better chunking boundaries, the underlying semantic similarity search is still too coarse to distinguish between slightly different passages about similar topics.

Looking at the model’s actual output, it concluded, once again, that “the paper proposes an ‘exotic power generator’ as the energy source.” This is the same answer we’ve gotten in every experiment. The model is being consistent with what it receives, but what it receives never includes the critical information from page 11 about matter/antimatter annihilation.

So, let’s consider what semantic chunking did and didn’t do.

The approach created 43 chunks (compared to baseline’s 38), suggesting it did respect document structure to some degree. After all, we didn’t just get arbitrary character slices. The chunk boundaries are likely cleaner, breaking at paragraph or sentence boundaries rather than mid-thought. This is good practice in general and probably produces higher-quality embeddings.

But higher-quality embeddings of the wrong content don’t help. The retriever is still matching on broad semantic similarity to terms like “warp drive,” “energy,” “spacecraft,” and “extra dimensions,” all of which appear heavily in the conceptual framework sections (pages 2-8) but not in the calculations section (page 11). Page 11 talks about specific numbers, equations, matter, antimatter, and Jupiter’s mass. These terms have lower semantic similarity to the query “what energy source” than the general physics terminology on earlier pages.

Test Finding: Structural improvements to chunking (respecting paragraphs and sentences) can create cleaner, more coherent chunks, but they don’t solve retrieval accuracy problems when the fundamental matching strategy is misaligned with the query type. Semantic chunking is valuable for creating better embeddings, but it doesn’t change the fact that semantic similarity search prioritizes topical relevance over specific answer content. This experiment confirms what Experiments 1 through 3 demonstrated: our problem is strategic, not tactical.

We’ve now exhausted the parameter and chunking strategy space within the basic RAG architecture. Every experiment (whether adjusting chunk size, retrieval count, or chunking approach) has either maintained the baseline failure or made it worse. The consistent pattern points to a fundamental limitation: pure semantic similarity search cannot reliably distinguish between “talks about the topic” and “contains the specific answer.”

This is valuable negative evidence. We now know definitively that improving this RAG system requires moving beyond pure semantic similarity search. Potential solutions might include those I mentioned earlier: hybrid search (combining semantic and keyword/BM25 matching), query decomposition (breaking “what energy source” into multiple sub-queries), re-ranking retrieved results with a more sophisticated model, or even using an LLM to reformulate the query before retrieval. But those approaches are beyond the scope of parameter tuning. They require architectural changes to the RAG pipeline itself.

Results Summary

Now let’s compile all our results into a comparison table. Add this code to the end of your script to generate a summary:

# =========================================================
# RESULTS SUMMARY
# =========================================================
print("\n" + "=" * 60)
print("RESULTS SUMMARY")
print("=" * 60)
print(f"{'Configuration':<40} {'Precision':>12} {'Faithfulness':>12} {'Change':>12}")
print("-" * 60)

configs = [
  ("Baseline (1000/200/k=3)", baseline_results, None),
  ("Exp 1: Smaller chunks (500/100/k=3)", exp1_results, baseline_results),
  ("Exp 2: More chunks (1000/200/k=5)", exp2_results, baseline_results),
  ("Exp 3: Both (500/100/k=5)", exp3_results, baseline_results),
  ("Exp 4: Semantic (800/150/k=3)", exp4_results, baseline_results)
]

baseline_scores = get_scores(baseline_results)

for name, results, baseline in configs:
  scores = get_scores(results)
  precision = scores.get("Contextual Precision", 0.0)
  faithfulness = scores.get("Faithfulness", 0.0)

  if baseline is None:
    change = "baseline"
  else:
    precision_change = precision \
      - baseline_scores.get("Contextual Precision", 0.0)
    faithfulness_change = faithfulness \
      - baseline_scores.get("Faithfulness", 0.0)
    change = f"P:{precision_change:+.2f} F:{faithfulness_change:+.2f}"

  print(f"{name:<40} {precision:>12.2f} {faithfulness:>12.2f} {change:>12}")

# =========================================================

# RESULTS SUMMARY

# =========================================================

print("\n" + "=" * 60)

print("RESULTS SUMMARY")

print("=" * 60)

print(f"{'Configuration':<40} {'Precision':>12} {'Faithfulness':>12} {'Change':>12}")

print("-" * 60)

configs = [

("Baseline (1000/200/k=3)", baseline_results, None),

("Exp 1: Smaller chunks (500/100/k=3)", exp1_results, baseline_results),

("Exp 2: More chunks (1000/200/k=5)", exp2_results, baseline_results),

("Exp 3: Both (500/100/k=5)", exp3_results, baseline_results),

("Exp 4: Semantic (800/150/k=3)", exp4_results, baseline_results)

]

baseline_scores = get_scores(baseline_results)

for name, results, baseline in configs:

scores = get_scores(results)

precision = scores.get("Contextual Precision", 0.0)

faithfulness = scores.get("Faithfulness", 0.0)

if baseline is None:

change = "baseline"

else:

precision_change = precision \

- baseline_scores.get("Contextual Precision", 0.0)

faithfulness_change = faithfulness \

- baseline_scores.get("Faithfulness", 0.0)

change = f"P:{precision_change:+.2f} F:{faithfulness_change:+.2f}"

print(f"{name:<40} {precision:>12.2f} {faithfulness:>12.2f} {change:>12}")

This summary table gives us a clear view of which strategies worked and which didn’t. We can see at a glance:

Which experiment produced the best Contextual Precision score
Which experiment produced the best Faithfulness score
Whether the two metrics moved together (as we predicted)
How much improvement each strategy provided

My Results Summary

Here’s what I got as output:


============================================================
RESULTS SUMMARY
============================================================
Configuration                               Precision Faithfulness       Change
------------------------------------------------------------
Baseline (1000/200/k=3)                          0.33         0.57     baseline
Exp 1: Smaller chunks (500/100/k=3)              0.33         0.67 P:+0.00 F:+0.10
Exp 2: More chunks (1000/200/k=5)                0.00         0.60 P:-0.33 F:+0.03
Exp 3: Both (500/100/k=5)                        0.00         0.00 P:-0.33 F:-0.57
Exp 4: Semantic (800/150/k=3)                    0.33         0.60 P:+0.00 F:+0.03

As yet another note on variability, if you run these experiments yourself, your exact scores may vary by some amount of points due to LLM non-determinism. However, the relative trends (i.e., which experiments improved or degraded performance) should remain relatively consistent.

Our test results tell our developers a clear and somewhat sobering story: none of our tuning interventions improved the system. Every experiment either had no effect on Contextual Precision or made it worse, and most experiments degraded Faithfulness as well.

What the Results Reveal

Let’s analyze what these specific results teach us.

The baseline performed moderately. With a Faithfulness score of 0.57, ts-reasoner was doing a reasonable job extracting and synthesizing information from the chunks it received, though not perfectly. The Contextual Precision of 0.33 showed that the real problem was retrieval quality: relevant information was consistently ranked last. The model could only be as good as the chunks it was given, and those chunks, while reasonably well processed, didn’t contain the complete answer.

Naive parameter tuning made things worse. All four experiments tried to solve the problem by changing how we chunk the document, but the fundamental issue is that semantic similarity search ranks “exotic power generator” and “warp bubble” chunks higher than “10²⁸ kg antimatter calculation” chunks. Whether we made chunks smaller (Exp 1), retrieved more (Exp 2), combined both approaches (Exp 3), or respected document structure (Exp 4), every variation either maintained the baseline failure or made it worse.

Chunking strategy cannot fix a retrieval strategy problem. All four experiments tried to solve the problem by changing how we chunk the document. But the fundamental issue is that semantic similarity search ranks “exotic power generator” and “warp bubble” chunks higher than “10²⁸ kg antimatter calculation” chunks. No amount of chunking adjustment can fix this: the retrieval algorithm itself is matching on the wrong features.

Multiple metrics are essential for diagnosis. If we had only measured Faithfulness, Experiment 2 (F: 0.6) would look nearly as good as the baseline (F: 0.57). Only Contextual Precision (0.0 vs 0.33) reveals the complete retrieval failure. Similarly, if we only measured Contextual Precision, we wouldn’t see how Experiment 1’s fragmentation hurt generation quality. Both metrics together paint the complete picture.

What Does the Testing Tell Us?

Our experimental results provide a masterclass in what not to do when tuning RAG systems, and more importantly, they reveal why understanding failure modes is more valuable than achieving quick wins.

Helping teams understand failure modes is one of the core ethical concerns of test specialists!

Let’s synthesize what I think we’ve learned across all our testing.

The Diagnostic Value of Failure

The baseline established that ts-reasoner is performing well (Faithfulness: 0.57) despite poor retrieval precision (Contextual Precision: 0.33). This immediately told us where to focus: not on improving the generation model, but on improving retrieval. Our four experiments then tested common tuning strategies and revealed that none of them addressed the actual problem.

This is the power of multi-metric evaluation. Without Contextual Precision, we might have concluded that the baseline was “approximately 60% correct” and moved on. Without Faithfulness, we wouldn’t have caught how Experiment 1’s fragmentation degraded generation quality despite not changing retrieval precision. Together, these metrics triangulate the problem: the retriever is selecting the wrong chunks, and no amount of chunking adjustment fixes this because the problem is in the retrieval strategy, not the chunking strategy.

Confirming the Cascade from Previous Posts

Remember the cascade we identified in our Contextual Precision post: poor retrieval ordering (CP: 0.33) leads to incomplete information reaching the model, which leads to answers that miss critical details even when they’re technically faithful to what was retrieved. Our improvement experiments confirm this cascade works in reverse too: when we made retrieval worse (Exp 2: CP -> 0.0), Faithfulness remained high (0.6) because the model stayed grounded in its inadequate context. When we fragmented the chunks (Exp 1), Faithfulness didn’t really drop much (0.67 -> 0.6) because fragmented context provides less material for faithful synthesis, yet still provided enough of value.

The pattern is clear in our results table: when retrieval degrades (Exp 2 and 3: CP drops to 0.0), Faithfulness either stays moderate (0.60) or collapses entirely (0.00) depending on whether the model can extract anything useful from wrong content. When retrieval maintains baseline precision (Exp 1 and 4: CP stays at 0.33), Faithfulness hovers around 0.60-0.67 because the model works faithfully with incomplete context.

This confirms the cascade works bidirectionally: poor retrieval limits what the model can generate, but the model staying faithful to inadequate content doesn’t help answer the question. You can be perfectly faithful to the wrong information.

The Embedding Similarity Problem

Every experiment pointed to the same root cause: semantic similarity search matches queries to chunks based on semantic relatedness, not answer relevance. When we ask “What energy source does the paper propose?”, the retriever finds chunks about:

“Exotic power generators” (appears multiple times)
“Warp bubbles” and “warp drives” (high keyword overlap)
“Quantum field theory” and “extra dimensions” (conceptually related)

But it doesn’t find the chunk containing “10²⁸ kg of antimatter” because this specific calculation appears in a section discussing equations and numbers rather than conceptual frameworks. The embedding similarity algorithm sees this calculation section as less semantically similar to “energy source” than the conceptual sections, even though it’s more relevant for answering the question.

No chunking strategy can fix this mismatch. Whether we make chunks smaller (Exp 1), retrieve more of them (Exp 2), combine both approaches (Exp 3), or respect document structure (Exp 4), the underlying problem remains: the retrieval algorithm is optimizing for the wrong objective.

This points us toward interventions that address the retrieval strategy itself: hybrid search combining semantic and keyword matching, query reformulation, reranking retrieved results, or section-aware retrieval with metadata tags. But before pursuing these architectural changes, we should first understand whether this is purely a retrieval problem or whether different query types might work better with our existing system—which is exactly what we’ll explore in the next post.

That’s a great test finding! Job well done! Paycheck earned!

The Value of Negative Results

These experiments produced what scientists call “negative results,” meaning tests that didn’t achieve the desired outcome.

This is not “negative testing.” Loyal readers will know I’m not fond of that term.

In academic research, negative results are often unpublished, considered failures. But in engineering, negative results are invaluable because they tell you what not to do and help you understand why the system behaves as it does.

Our negative results taught us:

The baseline was actually performing reasonably well for faithfulness
The bottleneck is retrieval strategy, not chunking strategy
Naive parameter tuning can make things worse
Multiple metrics are essential for accurate diagnosis
Understanding failure modes guides you toward effective solutions

If Experiment 1 had shown a dramatic improvement, we might have stopped there and missed the deeper insight about retrieval strategy. By systematically testing and failing, we’ve diagnosed the actual problem and can now pursue interventions that might actually work.

Testing as a Diagnostic Tool

Throughout this series of posts, we looked at two metrics in isolation and then we built up a testing framework that, using those metrics, does more than just measure scores. It diagnoses problems:

Faithfulness tells us: Is the model staying grounded in its context?
Contextual Precision tells us: Is the retriever surfacing the right information?
Together they tell us: Where in the pipeline are things breaking?

This diagnostic approach transforms testing from a pass/fail gate into an engineering tool. When Faithfulness is high but Contextual Precision is low (like our baseline at F: 0.57, CP: 0.33, or Experiment 2), we know retrieval is the problem. When both are low (like Experiment 3), we know the entire pipeline is failing. When both metrics stay stable but swap which chunks they retrieve (like Experiment 1 vs baseline), we know the retrieval strategy is consistently broken in the same way.

Each metric pair tells a story about system behavior, and each experiment adds another data point to help us understand the system’s failure modes.

This is what effective testing looks like in any context: not just measuring scores (“pass / fails”), but building understanding.

I will note here that the script we built up is in my repo as retrieval-quality-002.py.

Next Steps!

However, before we take a plunge and dive into complex architectural changes, there’s a simpler question worth exploring: does our RAG system work better for different types of questions? In the next post, we’ll test whether the issue is purely the retrieval strategy, or whether there’s also a query-document mismatch at play.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …

AI and Testing: Improving Retrieval Quality, Part 2

The Experimental Approach

What We’re Testing

Experiment 1: Smaller Chunks

What to Look For

Experiment 1 Results: Smaller Chunks

Experiment 2: More Retrieved Chunks

What to Look For

Experiment 2 Results: More Retrieved Chunks

Experiment 3: Combined Approach

What to Look For

Experiment 3 Results: Combined Approach

Experiment 4: Semantic Chunking

What to Look For

Experiment 4 Results: Semantic Chunking

Results Summary

My Results Summary

What the Results Reveal

What Does the Testing Tell Us?

The Diagnostic Value of Failure

Confirming the Cascade from Previous Posts

The Embedding Similarity Problem

The Value of Negative Results

Testing as a Diagnostic Tool

Next Steps!

Leave a Reply Cancel reply