AI and Testing: Improving Retrieval Quality, Part 3

In the previous post we ran four experiments attempting to improve our RAG system’s retrieval quality through parameter tuning: smaller chunks, more retrieval, both combined, and semantic chunking. Every experiment either maintained the baseline failure or made it worse. Let’s continue investigating!

Before we get going, let’s remind ourselves of our context. We have a hypothetical scenario here:

Our developers are using a model they have developed (ts-reasoner).
Our developers have implemented RAG logic that they have developed.
We are testing both to make sure they are of reasonable quality.

You might think: “But, the RAG is really just part of our test script.” Yes, it is. In reality, we would be running against the actual RAG from a development repo. Although it is quite possible that we would build a test double of the RAG, as I have been doing in these posts.

With that scenario in mind, in the previous post, none of our interventions improved Contextual Precision, and several degraded Faithfulness. We provisionally concluded that the problem wasn’t our chunking strategy. Instead, it was our retrieval strategy’s fundamental mismatch with the query type. In other words, our RAG seems to be not so great.

Given that, one thing we could do is just pass these results back to the developers and suggest they get cracking on a making a better retrieval system. However, before we do that, we’re going to test a different hypothesis entirely. Instead of trying to fix the RAG system, we’re going to question whether the system is actually broken. What if our baseline RAG configuration works perfectly well for some queries, just not the specific one we’ve been asking?

Reframing Our Test

Let’s consider our original question: “What energy source does the paper propose?” Note that to understand what the AI should find, you would realistically have to read the paper yourself. This was something I short-circuited for you a bit by simply providing you the expected output.

If you do read the paper, you realize that answering this question requires retrieving specific factual information from a calculations section (page 11: matter/antimatter annihilation, 10²⁸ kg). But our retriever kept surfacing conceptual framework content from pages 2 through 8 (exotic power generators, extra dimensions, quantum field theory). The diagnostic pattern from the Part 2 post suggested that semantic similarity search matches well on broad topics but struggles with specific facts.

So, here’s the question: what if we ask questions that should match those conceptual sections? Questions like “How does manipulating extra dimensions create a warp bubble?” or “What role do Kaluza-Klein modes play?” If the system performs well on these queries, it would prove something important: the RAG isn’t fundamentally broken. Rather, it’s optimized for a different kind of question than we were asking.

This matters because in production RAG systems, understanding when your system works and when it fails is more valuable than trying to make one configuration work for everything. Let’s test this hypothesis.

The Experiment

We’ll use the exact same baseline configuration from the previous posts:

Chunk size: 1000 characters
Chunk overlap: 200 characters
Retrieval: k=3 chunks
Same models: ts-reasoner (execution), ts-evaluator (judge)

The only thing we’re changing is the question. Instead of one query, we’ll test three different conceptual questions to see if the pattern holds:

“How does Jeff Nyman propose that manipulating extra dimensions creates a warp bubble?”
“What role do Kaluza-Klein modes play in the warp drive concept?”
“What is the relationship between the cosmological constant and warp bubble formation in the paper?”

Now, here’s a key thing: if our hypothesis is correct, these questions should produce much higher Contextual Precision and Faithfulness scores than our original query. Notice why I’m saying that. I’m saying that because the answers actually live in the sections where our retriever keeps looking.

So let’s get our script in place. (All the code I show in this post will be available at retrieval-quality-003.py.)

from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings, ChatOllama
from deepeval.metrics import ContextualPrecisionMetric, FaithfulnessMetric
from deepeval.models import OllamaModel
from deepeval.test_case import LLMTestCase
from deepeval import evaluate

def create_rag_system(chunk_size=1000, chunk_overlap=200, k=3):
  """Create a RAG system with configurable parameters."""
  loader = PyPDFLoader("./arXiv-jnyman-051011v3.pdf")
  documents = loader.load()

  text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
  )

  chunks = text_splitter.split_documents(documents)

  embeddings = OllamaEmbeddings(model="nomic-embed-text")
  vectorstore = Chroma.from_documents(chunks, embeddings)

  retriever = vectorstore.as_retriever(search_kwargs={"k": k})

  return retriever, len(chunks)

def run_test(retriever, question, expected_output, show_chunks=True):
  """Run a complete test with both metrics."""

  execution_model = ChatOllama(model="jeffnyman/ts-reasoner")
  judge_model = OllamaModel(model="jeffnyman/ts-evaluator")

  # Get relevant context
  retrieved_docs = retriever.invoke(question)
  context = [doc.page_content for doc in retrieved_docs]

  if show_chunks:
    print("\n" + "-" * 60)
    print("RETRIEVED CHUNKS:")
    print("-" * 60)

    for i, chunk in enumerate(context, 1):
      print(f"\n--- Chunk {i} ---")
      print(chunk)

    print("-" * 60 + "\n")

  # Generate response
  prompt = f"Based on this context: {context}\n\nQuestion: {question}"
  response = execution_model.invoke(prompt).content

  # Create test case
  test_case = LLMTestCase(
    input=question,
    actual_output=response,
    expected_output=expected_output,
    retrieval_context=context
  )

  # Create metrics
  precision_metric = ContextualPrecisionMetric(
    model=judge_model,
    verbose_mode=True
  )

  faithfulness_metric = FaithfulnessMetric(
    model=judge_model,
    verbose_mode=True
  )

  # Evaluate with both metrics
  results = evaluate(
    test_cases=[test_case],
    metrics=[precision_metric, faithfulness_metric]
  )

  return results, context, response

def get_scores(results):
  """Safely extract scores from results."""
  if results is not None:
    metrics_data = results.test_results[0].metrics_data
    if metrics_data is not None:
      return {m.name: m.score for m in metrics_data}

  return {}

def print_scores(label, results):
  """Print scores."""
  print(f"\n{label} Scores:")
  scores = get_scores(results)

  if scores:
    print(f"Contextual Precision: {scores.get('Contextual Precision')}")
    print(f"Faithfulness: {scores.get('Faithfulness')}")
  else:
    print("No metrics data available.")

# =========================================================
# Setup baseline RAG system
# =========================================================
print("=" * 60)
print("BASELINE CONFIGURATION: chunk_size=1000, chunk_overlap=200, k=3")
print("=" * 60)

retriever, num_chunks = create_rag_system(
  chunk_size=1000,
  chunk_overlap=200,
  k=3
)

print(f"Document split into {num_chunks} chunks\n")

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

from langchain_community.document_loaders import PyPDFLoader

from langchain_community.vectorstores import Chroma

from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_ollama import OllamaEmbeddings, ChatOllama

from deepeval.metrics import ContextualPrecisionMetric, FaithfulnessMetric

from deepeval.models import OllamaModel

from deepeval.test_case import LLMTestCase

from deepeval import evaluate

def create_rag_system(chunk_size=1000, chunk_overlap=200, k=3):

"""Create a RAG system with configurable parameters."""

loader = PyPDFLoader("./arXiv-jnyman-051011v3.pdf")

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=chunk_size,

chunk_overlap=chunk_overlap

)

chunks = text_splitter.split_documents(documents)

embeddings = OllamaEmbeddings(model="nomic-embed-text")

vectorstore = Chroma.from_documents(chunks, embeddings)

retriever = vectorstore.as_retriever(search_kwargs={"k": k})

return retriever, len(chunks)

def run_test(retriever, question, expected_output, show_chunks=True):

"""Run a complete test with both metrics."""

execution_model = ChatOllama(model="jeffnyman/ts-reasoner")

judge_model = OllamaModel(model="jeffnyman/ts-evaluator")

# Get relevant context

retrieved_docs = retriever.invoke(question)

context = [doc.page_content for doc in retrieved_docs]

if show_chunks:

print("\n" + "-" * 60)

print("RETRIEVED CHUNKS:")

print("-" * 60)

for i, chunk in enumerate(context, 1):

print(f"\n--- Chunk {i} ---")

print(chunk)

print("-" * 60 + "\n")

# Generate response

prompt = f"Based on this context: {context}\n\nQuestion: {question}"

response = execution_model.invoke(prompt).content

# Create test case

test_case = LLMTestCase(

input=question,

actual_output=response,

expected_output=expected_output,

retrieval_context=context

)

# Create metrics

precision_metric = ContextualPrecisionMetric(

model=judge_model,

verbose_mode=True

)

faithfulness_metric = FaithfulnessMetric(

model=judge_model,

verbose_mode=True

)

# Evaluate with both metrics

results = evaluate(

test_cases=[test_case],

metrics=[precision_metric, faithfulness_metric]

)

return results, context, response

def get_scores(results):

"""Safely extract scores from results."""

if results is not None:

metrics_data = results.test_results[0].metrics_data

if metrics_data is not None:

return {m.name: m.score for m in metrics_data}

return {}

def print_scores(label, results):

"""Print scores."""

print(f"\n{label} Scores:")

scores = get_scores(results)

if scores:

print(f"Contextual Precision: {scores.get('Contextual Precision')}")

print(f"Faithfulness: {scores.get('Faithfulness')}")

else:

print("No metrics data available.")

# =========================================================

# Setup baseline RAG system

# =========================================================

print("=" * 60)

print("BASELINE CONFIGURATION: chunk_size=1000, chunk_overlap=200, k=3")

print("=" * 60)

retriever, num_chunks = create_rag_system(

chunk_size=1000,

chunk_overlap=200,

k=3

)

print(f"Document split into {num_chunks} chunks\n")

This code stops at the baseline and I’m doing that just to show you that this is broadly the same code we had in place before, but you’ll notice no questions or expected outputs yet. We’re going to add these as experiments. Go ahead and add all three experiments to the script:

# =========================================================
# TEST 1: Extra Dimensions and Warp Bubble Creation
# =========================================================
print("=" * 60)
print("TEST 1: Extra Dimensions Question")
print("=" * 60)

question_1 = """How does Jeff Nyman propose that manipulating
extra dimensions creates a warp bubble?"""

expected_1 = """By locally manipulating the radius of extra
dimensions, which creates an asymmetry in the cosmological
constant that expands and contracts space-time around the
spacecraft."""

results_1, context_1, response_1 = run_test(
  retriever,
  question_1,
  expected_1
)

print_scores("Test 1", results_1)

# =========================================================
# TEST 2: Kaluza-Klein Modes
# =========================================================
print("\n" + "=" * 60)
print("TEST 2: Kaluza-Klein Modes Question")
print("=" * 60)

question_2 = """What role do Kaluza-Klein modes play in Jeff Nyman's
warp drive concept?"""

expected_2 = """Kaluza-Klein graviton modes contribute to the
Casimir energy in higher dimensions, which is associated with
the cosmological constant. This relationship between the
compactified extra dimensions and the cosmological constant is
fundamental to the warp drive mechanism."""

results_2, context_2, response_2 = run_test(
  retriever,
  question_2,
  expected_2
)

print_scores("Test 2", results_2)

# =========================================================
# TEST 3: Cosmological Constant Relationship
# =========================================================
print("\n" + "=" * 60)
print("TEST 3: Cosmological Constant Question")
print("=" * 60)

question_3 = """What is the relationship between the cosmological
constant and warp bubble formation in Jeff Nyman's paper?"""

expected_3 = """The cosmological constant is linked to the radius
of extra dimensions through Casimir energy. By manipulating the
extra dimension radius, the local cosmological constant can be
adjusted, creating expansion and contraction of space-time that
forms the warp bubble."""

results_3, context_3, response_3 = run_test(
  retriever,
  question_3,
  expected_3
)

print_scores("Test 3", results_3)

# =========================================================

# TEST 1: Extra Dimensions and Warp Bubble Creation

# =========================================================

print("=" * 60)

print("TEST 1: Extra Dimensions Question")

print("=" * 60)

question_1 = """How does Jeff Nyman propose that manipulating

extra dimensions creates a warp bubble?"""

expected_1 = """By locally manipulating the radius of extra

dimensions, which creates an asymmetry in the cosmological

constant that expands and contracts space-time around the

spacecraft."""

results_1, context_1, response_1 = run_test(

retriever,

question_1,

expected_1

)

print_scores("Test 1", results_1)

# =========================================================

# TEST 2: Kaluza-Klein Modes

# =========================================================

print("\n" + "=" * 60)

print("TEST 2: Kaluza-Klein Modes Question")

print("=" * 60)

question_2 = """What role do Kaluza-Klein modes play in Jeff Nyman's

warp drive concept?"""

expected_2 = """Kaluza-Klein graviton modes contribute to the

Casimir energy in higher dimensions, which is associated with

the cosmological constant. This relationship between the

compactified extra dimensions and the cosmological constant is

fundamental to the warp drive mechanism."""

results_2, context_2, response_2 = run_test(

retriever,

question_2,

expected_2

)

print_scores("Test 2", results_2)

# =========================================================

# TEST 3: Cosmological Constant Relationship

# =========================================================

print("\n" + "=" * 60)

print("TEST 3: Cosmological Constant Question")

print("=" * 60)

question_3 = """What is the relationship between the cosmological

constant and warp bubble formation in Jeff Nyman's paper?"""

expected_3 = """The cosmological constant is linked to the radius

of extra dimensions through Casimir energy. By manipulating the

extra dimension radius, the local cosmological constant can be

adjusted, creating expansion and contraction of space-time that

forms the warp bubble."""

results_3, context_3, response_3 = run_test(

retriever,

question_3,

expected_3

)

print_scores("Test 3", results_3)

And, as before, let’s have a results summary section. Add this to the script:

# =========================================================
# RESULTS SUMMARY
# =========================================================
print("\n" + "=" * 60)
print("RESULTS SUMMARY")
print("=" * 60)
print(f"{'Test':<50} {'Precision':>12} {'Faithfulness':>12}")
print("-" * 60)

tests = [
  ("Test 1: Extra Dimensions", results_1),
  ("Test 2: Kaluza-Klein Modes", results_2),
  ("Test 3: Cosmological Constant", results_3)
]

for name, results in tests:
  scores = get_scores(results)
  precision = scores.get("Contextual Precision", 0.0)
  faithfulness = scores.get("Faithfulness", 0.0)
  print(f"{name:<50} {precision:>12.2f} {faithfulness:>12.2f}")

# =========================================================

# RESULTS SUMMARY

# =========================================================

print("\n" + "=" * 60)

print("RESULTS SUMMARY")

print("=" * 60)

print(f"{'Test':<50} {'Precision':>12} {'Faithfulness':>12}")

print("-" * 60)

tests = [

("Test 1: Extra Dimensions", results_1),

("Test 2: Kaluza-Klein Modes", results_2),

("Test 3: Cosmological Constant", results_3)

]

for name, results in tests:

scores = get_scores(results)

precision = scores.get("Contextual Precision", 0.0)

faithfulness = scores.get("Faithfulness", 0.0)

print(f"{name:<50} {precision:>12.2f} {faithfulness:>12.2f}")

There are some key changes from the previous script that it’s worth calling out.

Single RAG setup: Create the retriever once, reuse for all three tests.
Three separate test sections: Each with its own question and expected output, which differs from our previous experiments which iterated over a single question.
Expected outputs: I’ve written these based on what the paper actually says. You can refine them, should you wish.
Results summary table: Clean comparison of all three tests.
Removed baseline comparison: Since we’re not comparing to Part 2’s baseline, we’re just showing the current experiment results.

Note one of my points above: the expected outputs are based on the paper content. You may want to adjust them based on what you think good answers should be.

Experiment 1 Output

Here is what I got from the first experiment:


============================================================
BASELINE CONFIGURATION: chunk_size=1000, chunk_overlap=200, k=3
============================================================
Document split into 38 chunks

============================================================
TEST 1: Extra Dimensions Question
============================================================

------------------------------------------------------------
RETRIEVED CHUNKS:
------------------------------------------------------------

--- Chunk 1 ---
Cosmological Constant Manipulation in Extra
Dimensions for Exotic Field Propulsion
Jeff Nyman
(Original Dated: 15 November 2009)
(Initial Review: 20 January 2011)
(Revision: 30 January 2011)
(Peer Review: 28 March 2011)
(Revision: 3 April 2011)
(Candidate Acceptance: 10 May 2011)

Abstract
In this paper, I propose a  new approach to generating a warp bubble metric necessary for an Alcubierre -like warp drive effect. The  warp bubble would theoretically allow a spacecraft to travel at arbitrarily  high velocities. Key to this idea is the ability to locally manipulate an extra dimension. String theory suggests that dimensions are globally held compact by strings wrapping around them  which means it may be possible to locally increase or decrease the str ing tension, or even counter the eff ects of some string winding modes. This would change the size of the extra dimensions, allowing for faster-

--- Chunk 2 ---
energy condition or issues regarding causality. The aim of this paper is to suggest that a warp bubble could be gen erated using ideas and mathematics that are consistent with quantum field theory . This may have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently advanced technology. By associating the cosmological constant with  the Casimir Energy due to the Kaluza Klein modes of gravitons in higher dimensions, especially in the context of M -theory derived or inspired models, it is possible to form a relationship between Λ and the radius of the compact extra dimension. As I have shown with equation (16), the following holds:

Equation 17
An easier way of developing this relationship is to put things in terms of Hubble’s constant H which describes the rate of expansion of space per unit distance of space.

Equation 18
I can also put this in terms of the radius of the extra dimension, which gives:

Equation 19

--- Chunk 3 ---
of Relativity. An element missing from all the papers is that there is little or no suggestion as to how such a warp bubble may be created. I do not plan to buck that trend too much in that the aim of this paper is not to discuss the plausibility of a warp drive. This means I am not addressing the valid questions associated with violation of the null
------------------------------------------------------------

**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "yes",
        "reason": "The first document mentions 'locally manipulating an extra dimension' which is directly relevant to the question about how Jeff Nyman proposes to create a warp bubble."
    },
    {
        "verdict": "no",
        "reason": "While this document provides mathematical relationships and equations, it does not explicitly discuss the method of locally manipulating extra dimensions as asked in the question. It focuses more on the theoretical framework rather than the specific technique."
    },
    {
        "verdict": "no",
        "reason": "This document is largely irrelevant to the question about creating a warp bubble through manipulation of extra dimensions, instead focusing on the plausibility and challenges associated with such a concept."
    }
]

Score: 1.0
Reason: The score is 1.00 because the first node (ranked 1) directly addresses the method Jeff Nyman proposes for creating a warp bubble through locally manipulating extra dimensions, making it highly relevant. The subsequent nodes (ranks 2 and 3) are less relevant as they either focus on broader theoretical aspects or challenges, which do not specifically address the question.

======================================================================
**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "The paper proposes a new approach to generating a warp bubble metric for an Alcubierre-like warp drive effect.",
    "Key to this idea is the ability to locally manipulate an extra dimension.",
    "String theory suggests that dimensions are globally compacted by strings wrapping around them.",
    "The paper aims to suggest how such a bubble could be created using ideas and mathematics consistent with quantum field theory.",
    "Equation (16) in the paper establishes a relationship between the cosmological constant Λ and the radius of the compact extra dimension.",
    "Hubble’s constant H is used to describe the rate of expansion of space per unit distance of space, which helps develop the relationship mentioned above.",
    "The paper expresses that there is little or no suggestion as to how such a warp bubble may be created."
]

Claims:
[
    "The central hypothesis is that manipulating extra dimensions can generate a warp bubble, enabling faster-than-light travel, similar to the Alcubierre drive concept.",
    "Nyman grounds this in string theory, specifically the idea that extra dimensions are compactified (curled up) around strings.",
    "The key mechanism is the ability to locally change the tension of these strings. By increasing or decreasing this tension, or altering the string winding modes, he suggests this alters the size of the extra dimensions.",
    "Changing the size of the extra dimensions, according to Nyman, directly influences the geometry of spacetime around the spacecraft, creating the warp bubble.",
    "The specific equations (16, 17, 18, and 19) illustrate this relationship between string tension changes and the creation of a warp bubble.",
    "Nyman associates the cosmological constant (Λ) – representing the energy density of space – with the Casimir energy arising from Kaluza-Klein modes of gravitons within these higher dimensions.",
    "The Hubble constant (H) is introduced to relate the size of the extra dimension to the rate of expansion of space, further illustrating the geometric relationship.",
    "Nyman acknowledges the missing element in existing warp drive research: a discussion of how such a warp bubble could actually be created. His paper aims to fill this gap, focusing on the theoretical relationship rather than practical feasibility."
]

Verdicts:
[
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The context does not directly contradict the claim that Nyman grounds this in string theory, specifically the idea of compactified extra dimensions. However, it focuses more on the manipulation of these dimensions rather than explicitly mentioning strings."
    },
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The context supports that changing the size of the extra dimensions influences spacetime geometry, which is consistent with the claim. The specific mechanism involving string tension changes and winding modes is not explicitly mentioned but aligns with the broader idea."
    },
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The context does not directly contradict the claim that equations (16, 17, 18, and 19) illustrate the relationship between string tension changes and warp bubble creation. The equations are mentioned but their exact nature is not detailed."
    },
    {
        "verdict": "no",
        "reason": "The context supports Nyman's association of the cosmological constant (\u039b) with Casimir energy, which aligns with the claim."
    },
    {
        "verdict": "yes",
        "reason": null
    }
]

Score: 0.5
Reason: The score is 0.50 because while some aspects are aligned, key details such as explicit mentions of strings, compactified extra dimensions, and specific mechanisms like string tension changes and winding modes are missing from the actual output.

======================================================================
======================================================================

Metrics Summary

  - ✅ Contextual Precision (score: 1.0, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 1.00 because the first node (ranked 1) directly addresses the method Jeff Nyman proposes for creating a warp bubble through locally manipulating extra dimensions, making it highly relevant. The subsequent nodes (ranks 2 and 3) are less relevant as they either focus on broader theoretical aspects or challenges, which do not specifically address the question., error: None)
  - ✅ Faithfulness (score: 0.5, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.50 because while some aspects are aligned, key details such as explicit mentions of strings, compactified extra dimensions, and specific mechanisms like string tension changes and winding modes are missing from the actual output., error: None)

For test case:

  - input: How does Jeff Nyman propose that manipulating
extra dimensions creates a warp bubble?

  - actual output: Okay, let's break down Jeff Nyman’s proposal step-by-step, focusing on how he connects extra dimension manipulation to warp bubble creation.

1.  **Core Idea:** The central hypothesis is that manipulating extra dimensions can generate a warp bubble, enabling faster-than-light travel, similar to the Alcubierre drive concept.

2.  **String Theory Foundation:** Nyman grounds this in string theory, specifically the idea that extra dimensions are compactified (curled up) around strings.

3.  **Local Tension Manipulation:** The key mechanism is the ability to *locally* change the tension of these strings. By increasing or decreasing this tension, or altering the string winding modes, he suggests this alters the size of the extra dimensions.

4.  **Dimension Size & Warp Bubble:** Changing the size of the extra dimensions, according to Nyman, directly influences the geometry of spacetime around the spacecraft, creating the warp bubble. The specific equations (16, 17, 18, and 19) illustrate this relationship.

5.  **Cosmological Constant Link:** Nyman associates the cosmological constant (Λ) – representing the energy density of space – with the Casimir energy arising from Kaluza-Klein modes of gravitons within these higher dimensions. This links the warp bubble concept to quantum field theory.

6.  **Hubble Constant Connection:** The Hubble constant (H) is introduced to relate the size of the extra dimension to the rate of expansion of space, further illustrating the geometric relationship.

7.  **Lack of Creation Method:**  Crucially, Nyman acknowledges the missing element in existing warp drive research: a discussion of *how* such a warp bubble could actually be created. His paper aims to fill this gap, focusing on the theoretical relationship rather than practical feasibility.

**In short,** Nyman proposes that by manipulating extra dimensions—specifically through changes in string tension—the geometry of spacetime can be altered to form a warp bubble, as suggested by the cosmological constant.

---
Do you want me to elaborate on any specific aspect, such as the equations or the link to Kaluza-Klein modes?

  - expected output: By locally manipulating the radius of extra
dimensions, which creates an asymmetry in the cosmological
constant that expands and contracts space-time around the
spacecraft.

  - context: None

  - retrieval context: ['Cosmological Constant Manipulation in Extra \nDimensions for Exotic Field Propulsion \nJeff Nyman \n(Original Dated: 15 November 2009) \n(Initial Review: 20 January 2011) \n(Revision: 30 January 2011) \n(Peer Review: 28 March 2011) \n(Revision: 3 April 2011) \n(Candidate Acceptance: 10 May 2011) \n \nAbstract \nIn this paper, I propose a  new approach to generating a warp bubble metric necessary \nfor an Alcubierre -like warp drive effect. The  warp bubble would theoretically allow a \nspacecraft to travel at arbitrarily  high velocities. Key to this idea is the ability to locally \nmanipulate an extra dimension. String theory suggests that dimensions are globally held \ncompact by strings wrapping around them  which means it may be possible to locally \nincrease or decrease the str ing tension, or even counter the eff ects of some string  \nwinding modes. This would change the size of the extra dimensions, allowing for faster-', 'energy condition or issues regarding causality. The aim of this paper is to suggest that a warp bubble \ncould be gen erated using ideas and mathematics that are consistent with quantum field theory . This \nmay have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently \nadvanced technology. \nBy associating the cosmological constant with  the Casimir Energy due to the Kaluza Klein modes of \ngravitons in higher dimensions, especially in the context of M -theory derived or inspired models, it is \npossible to form a relationship between Λ and the radius of the compact extra dimension. As I have \nshown with equation (16), the following holds: \n \nEquation 17 \nAn easier way of developing this relationship is to put things in terms of Hubble’s constant H which \ndescribes the rate of expansion of space per unit distance of space. \n \nEquation 18 \nI can also put this in terms of the radius of the extra dimension, which gives: \n \nEquation 19', 'of Relativity. An element missing from all the papers is that there is little or no suggestion as to how \nsuch a warp bubble may be created. \nI do not plan to buck that trend too much in that the aim of this paper is not to discuss the plausibility of \na warp drive. This means I am not addressing the valid questions associated with violation of the null']

======================================================================

Overall Metric Pass Rates

Contextual Precision: 100.00% pass rate
Faithfulness: 100.00% pass rate

Test 1 Scores:
Contextual Precision: 1.0
Faithfulness: 0.5

Wow! Look at that Contextual Precision: 1.0! This is exactly what we hoped for. Let me help you analyze and narrate this result a bit because this test result is essentially a complete reversal in that the scores tell a dramatically different story than we’ve seen so far.

Contextual Precision: 1.0 (vs. baseline 0.33 with the energy source question)
Faithfulness: 0.5 (vs. baseline 0.57)

Contextual Precision achieved a perfect score. The metric’s reasoning is clear: “The first node (ranked 1) directly addresses the method Jeff Nyman proposes for creating a warp bubble through locally manipulating extra dimensions, making it highly relevant.” Look at what was retrieved:

Chunk 1: The paper’s abstract. Literally the opening summary that explains the core concept of manipulating extra dimensions to create a warp bubble. This is exactly what the question asks about.
Chunks 2 and 3: Supporting material about the mathematical relationships (equations 17-19) and context about existing warp drive research. These are marked “no” for direct relevance, but they don’t hurt the score because the most relevant chunk is ranked first.

This is the opposite of what we saw in Part 2. In every experiment there, relevant information was buried at position #3 (when it appeared at all). Here, the most relevant chunk is at position #1, exactly where it should be.

Great, but why did this work? The question “How does manipulating extra dimensions create a warp bubble?” matches perfectly with the semantic content of the paper’s abstract and theoretical framework sections. These sections are about this very concept; they use terms like “extra dimensions,” “warp bubble,” “manipulating,” and “string theory” extensively. The semantic similarity search found exactly what it was optimized to find: conceptually relevant content.

Compare this to Part 2’s question “What energy source?” which required finding a specific calculation buried in a section focused on numbers rather than concepts. That’s a semantic mismatch. Our new question has semantic alignment.

Faithfulness scored 0.5, which might seem lower than we would like, but look at why. The model generated a detailed, well-structured response covering:

String theory foundation
Local tension manipulation
Dimension size changes
Cosmological constant links
The Hubble constant relationship

The metric flagged several claims as not explicitly stated in the retrieved chunks: things about “string winding modes,” “specific equations (16, 17, 18, 19),” and detailed mechanisms. The model is adding detail and structure that, while consistent with the paper’s concepts, goes slightly beyond what’s explicitly in these three chunks.

Even with that being said, the key insight is that the model is answering the question correctly and comprehensively. It’s explaining how manipulating extra dimensions creates a warp bubble, which is exactly what was asked. The 0.5 Faithfulness score reflects the model synthesizing and elaborating rather than strictly paraphrasing, but the core answer is there and accurate.

So, our test finding here is that when query semantics align with document section semantics, semantic similarity search works beautifully. Contextual Precision jumped from 0.33 to 1.0 (a complete reversal!) simply by asking a question that matches the type of content the retriever naturally surfaces. The RAG system isn’t broken; it’s optimized for conceptual queries about theoretical frameworks, not specific factual queries about numerical calculations.

Experiment 2 Output

Here is the output I got for the second experiment:


============================================================
TEST 2: Kaluza-Klein Modes Question
============================================================

------------------------------------------------------------
RETRIEVED CHUNKS:
------------------------------------------------------------

--- Chunk 1 ---
Cosmological Constant Manipulation in Extra
Dimensions for Exotic Field Propulsion
Jeff Nyman
(Original Dated: 15 November 2009)
(Initial Review: 20 January 2011)
(Revision: 30 January 2011)
(Peer Review: 28 March 2011)
(Revision: 3 April 2011)
(Candidate Acceptance: 10 May 2011)

Abstract
In this paper, I propose a  new approach to generating a warp bubble metric necessary for an Alcubierre -like warp drive effect. The  warp bubble would theoretically allow a spacecraft to travel at arbitrarily  high velocities. Key to this idea is the ability to locally manipulate an extra dimension. String theory suggests that dimensions are globally held compact by strings wrapping around them  which means it may be possible to locally increase or decrease the str ing tension, or even counter the eff ects of some string winding modes. This would change the size of the extra dimensions, allowing for faster-

--- Chunk 2 ---
of Relativity. An element missing from all the papers is that there is little or no suggestion as to how such a warp bubble may be created. I do not plan to buck that trend too much in that the aim of this paper is not to discuss the plausibility of a warp drive. This means I am not addressing the valid questions associated with violation of the null

--- Chunk 3 ---
increase or decrease the str ing tension, or even counter the eff ects of some string winding modes. This would change the size of the extra dimensions, allowing for faster-than-light propulsion in global reference frames. Calculations of the energy requirements of a hypothetical warp drive are put forth and a “top” speed limit is proposed based on the size of the extra dimensions.



arXiv:Candidate Paper [gr-qc] 10 May 2011

[Type the sender company name] [Type the company address] [Type the company phone number]
------------------------------------------------------------

**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "yes",
        "reason": "The context mentions 'increase or decrease the string tension, or even counter the effects of some string winding modes. This would change the size of the extra dimensions, allowing for faster-than-light propulsion in global reference frames.' which is relevant to the role of Kaluza-Klein modes in Jeff Nyman's warp drive concept."
    },
    {
        "verdict": "no",
        "reason": "The second document does not contain any information directly related to Kaluza-Klein modes or their role in a warp drive. It discusses the plausibility and challenges of creating a warp bubble metric, but does not mention Kaluza-Klein modes."
    },
    {
        "verdict": "no",
        "reason": "The third document is an abstract that lacks specific details about Kaluza-Klein modes or their role in Jeff Nyman's concept. It focuses on the energy requirements and speed limits of a hypothetical warp drive, which does not directly address the question."
    }
]

Score: 1.0
Reason: The score is 1.00 because the first node is ranked as 'yes' and provides relevant information about Kaluza-Klein modes in relation to Jeff Nyman's warp drive concept, while the subsequent nodes are ranked as 'no', containing no direct or useful information on the topic.

======================================================================
**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "The paper proposes a new approach to generating a warp bubble metric for an Alcubierre-like warp drive effect.",
    "Key to this idea is the ability to locally manipulate an extra dimension.",
    "String theory suggests that dimensions are globally compacted by strings wrapping around them.",
    "Manipulating string tension or counteracting some string winding modes could change the size of extra dimensions.",
    "The paper calculates energy requirements for a hypothetical warp drive.",
    "A 'top' speed limit is proposed based on the size of the extra dimensions."
]

Claims:
[
    "The core of the paper is about generating a warp bubble using manipulation of extra dimensions.",
    "Kaluza-Klein (KK) modes are hypothetical particles associated with the compactification of extra spatial dimensions.",
    "Nyman proposes manipulating 'string tension' within these compactified extra dimensions, effectively controlling the size of the dimensions themselves.",
    "Controlling the string tension is analogous to controlling the vibrational modes (KK modes) of these dimensions.",
    "The document notes that other papers lack a discussion on how the warp bubble is created.",
    "Nyman’s approach aims to change the size of extra dimensions by altering string tension, which is essential for creating the warp bubble."
]

Verdicts:
[
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The context states that manipulating string tension or counteracting some string winding modes could change the size of extra dimensions, but it does not mention Kaluza-Klein (KK) modes directly."
    },
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The context mentions manipulating string tension or counteracting some string winding modes to change the size of extra dimensions, but it does not explicitly state that this is analogous to controlling KK modes."
    },
    {
        "verdict": "idk",
        "reason": "While the context suggests that the paper discusses energy requirements and a top speed limit for a hypothetical warp drive, it does not provide direct evidence about other papers lacking discussion on how the warp bubble is created."
    },
    {
        "verdict": "yes",
        "reason": null
    }
]

Score: 0.6666666666666666
Reason: The score is 0.67 because the actual output implies a direct connection between manipulating string tension or counteracting some string winding modes
and changing the size of extra dimensions, as well as relating it to Kaluza-Klein (KK) modes, which are not explicitly mentioned in the context.

======================================================================
======================================================================

Metrics Summary

  - ✅ Contextual Precision (score: 1.0, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 1.00 because the first node is ranked as 'yes' and provides relevant information about Kaluza-Klein modes in relation to Jeff Nyman's warp drive concept, while the subsequent nodes are ranked as 'no', containing no direct or useful information on the topic., error: None)
  - ✅ Faithfulness (score: 0.6666666666666666, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.67 because the actual output implies a direct connection between manipulating string tension or counteracting some string winding modes and changing the size of extra dimensions, as well as relating it to Kaluza-Klein (KK) modes, which are not explicitly mentioned in the context., error: None)

For test case:

  - input: What role do Kaluza-Klein modes play in Jeff Nyman's
warp drive concept?

  - actual output: Okay, let's break this down systematically to answer this question about Jeff Nyman's warp drive concept.

1.  **Initial Understanding:** The core of the paper is about generating a warp bubble using manipulation of extra dimensions. The central idea is rooted in string theory.

2.  **Kaluza-Klein Modes - Identifying the Relevant Concept:** Kaluza-Klein (KK) modes are hypothetical particles associated with the compactification of extra spatial dimensions. In simpler terms, if extra dimensions exist and are curled up into very small sizes, the resulting vibrations of these curled-up dimensions can manifest as new particles we observe as “KK modes”.  These modes represent quantized excitations of these compactified dimensions.

3.  **Nyman’s Connection to KK Modes:** The paper explicitly states, “String theory suggests that dimensions are globally held compact by strings wrapping around them… This would change the size of the extra dimensions, allowing for faster-than-light propulsion in global reference frames.” This directly links to KK modes. Nyman proposes manipulating the "string tension" within these compactified extra dimensions, effectively controlling the size of the dimensions themselves.  This manipulation is *analogous* to controlling the vibrational modes (KK modes) of these dimensions.

4.  **How it Relates to the Warp Bubble:** By altering the string tension, Nyman’s method aims to change the size of these extra dimensions. A change in the size of these extra dimensions is essential to creating the warp bubble—the structure required for the Alcubierre-like warp drive.

5.  **Addressing the Missing Piece:** The document notes that other papers lack a discussion of *how* the warp bubble is created. Nyman’s approach attempts to fill this gap, proposing that controlling the string tension – and thus the KK modes – is a mechanism for creating the bubble.

**Therefore, Kaluza-Klein modes (specifically, the vibrational modes associated with the compactification of extra dimensions) play a crucial role in Nyman's warp drive concept as the mechanism by which the size of these extra dimensions is actively manipulated to generate the warp bubble.**

---
Do you want me to:

*   Expand on a particular aspect of this explanation?
*   Explore the energy requirements discussed in the paper?
*   Consider the potential risks or falsification of this approach?

  - expected output: Kaluza-Klein graviton modes contribute to the
Casimir energy in higher dimensions, which is associated with
the cosmological constant. This relationship between the
compactified extra dimensions and the cosmological constant is
fundamental to the warp drive mechanism.

  - context: None

  - retrieval context: ['Cosmological Constant Manipulation in Extra \nDimensions for Exotic Field Propulsion \nJeff Nyman \n(Original Dated: 15 November 2009) \n(Initial Review: 20 January 2011) \n(Revision: 30 January 2011) \n(Peer Review: 28 March 2011) \n(Revision: 3 April 2011) \n(Candidate Acceptance: 10 May 2011) \n \nAbstract \nIn this paper, I propose a  new approach to generating a warp bubble metric necessary \nfor an Alcubierre -like warp drive effect. The  warp bubble would theoretically allow a \nspacecraft to travel at arbitrarily  high velocities. Key to this idea is the ability to locally \nmanipulate an extra dimension. String theory suggests that dimensions are globally held \ncompact by strings wrapping around them  which means it may be possible to locally \nincrease or decrease the str ing tension, or even counter the eff ects of some string  \nwinding modes. This would change the size of the extra dimensions, allowing for faster-', 'of Relativity. An element missing from all the papers is that there is little or no suggestion as to how \nsuch a warp bubble may be created. \nI do not plan to buck that trend too much in that the aim of this paper is not to discuss the plausibility of \na warp drive. This means I am not addressing the valid questions associated with violation of the null', 'increase or decrease the str ing tension, or even counter the eff ects of some string  \nwinding modes. This would change the size of the extra dimensions, allowing for faster-\nthan-light propulsion in global reference frames. Calculations of the energy \nrequirements of a hypothetical warp drive are put forth and a “top” speed limit is \nproposed based on the size of the extra dimensions. \n \n \n  \narXiv:Candidate Paper [gr-qc] 10 May 2011 \n  \n[Type the sender company name] [Type the company address] [Type the company phone number]']

======================================================================

Overall Metric Pass Rates

Contextual Precision: 100.00% pass rate
Faithfulness: 100.00% pass rate

Here we have yet another perfect Contextual Precision score! Test 2 confirms the pattern.

Contextual Precision: 1.0 (perfect, again!)
Faithfulness: 0.67 (up from Test 1’s 0.5)

Contextual Precision achieved another perfect score and the reason is because the metric found the relevant information ranked first: “The first node is ranked as ‘yes’ and provides relevant information about Kaluza-Klein modes in relation to Jeff Nyman’s warp drive concept.” Consider what was retrieved:

Chunk 1: The abstract yet again, mentioning string theory, compactified dimensions, and string winding modes. This is conceptually relevant to Kaluza-Klein modes even though it doesn’t use that exact terminology.
Chunks 2 and 3: One discusses the paper’s scope (not directly relevant), the other repeats part of the abstract with additional context about energy calculations.

The retriever successfully identified the most relevant chunk and ranked it first. This is the second consecutive test where the system performed flawlessly on retrieval precision. Now, the Faithfulness improved to 0.67 (from 0.5 in Test 1). The model generated a comprehensive, well-structured answer explaining:

What Kaluza-Klein modes are (quantized excitations of compactified dimensions)
How they relate to my overall warp concept (manipulating string tension = controlling vibrational modes)
Why this matters for the warp drive (changing dimension size creates the bubble)

It’s worth noting that the metric flagged that “Kaluza-Klein modes” aren’t explicitly mentioned in the retrieved chunks, and, further, it noted that the model is making analogical connections (“controlling string tension is analogous to controlling KK modes”) that go slightly beyond what’s explicitly stated. But the core answer is accurate and well-grounded in the paper’s concepts.

There’s an interesting details here. Notice that the model correctly explains KK modes even though the term doesn’t appear in the retrieved chunks. The abstract discusses “string winding modes” and “compactified dimensions,” and the model correctly identifies these as related to the Kaluza-Klein framework. This shows the model has strong background knowledge about theoretical physics and can make appropriate connections. Thus, it’s not just parroting text.

The Faithfulness metric appropriately penalizes this (hence 0.67 not 1.0) because the model is synthesizing and inferring rather than strictly quoting. However, from a practical standpoint, this is exactly the kind of intelligent synthesis we want from a RAG system: using retrieved context plus domain knowledge to provide useful explanations.

Consider what this means for testing. You have to make allowances for how the system synthesizes knowledge that it may already have. This is why metrics are so important for checking if that synthesis is accurate and in line with the resources being referenced.

Our test finding is that we’re two for two. Both conceptual questions achieved perfect Contextual Precision (1.0), retrieving the most relevant chunk first. Again, this is a stark contrast to Part 2 where every experiment scored 0.0 or 0.33 for Contextual Precision. The pattern is clear and consistent: semantic similarity search excels at matching conceptual queries to conceptual content.

Thus, the evidence is mounting that this isn’t luck. Rather, it’s a systematic difference in how the retriever handles different query types.

Experiment 3 Output

Here’s what I got for experiment 3’s output:


============================================================
TEST 3: Cosmological Constant Question
============================================================

------------------------------------------------------------
RETRIEVED CHUNKS:
------------------------------------------------------------

--- Chunk 1 ---
energy condition or issues regarding causality. The aim of this paper is to suggest that a warp bubble could be gen erated using ideas and mathematics that are consistent with quantum field theory . This may have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently advanced technology. By associating the cosmological constant with  the Casimir Energy due to the Kaluza Klein modes of gravitons in higher dimensions, especially in the context of M -theory derived or inspired models, it is possible to form a relationship between Λ and the radius of the compact extra dimension. As I have shown with equation (16), the following holds:

Equation 17
An easier way of developing this relationship is to put things in terms of Hubble’s constant H which describes the rate of expansion of space per unit distance of space.

Equation 18
I can also put this in terms of the radius of the extra dimension, which gives:

Equation 19

--- Chunk 2 ---
I can now look at the energy required to create the necessary warp bubble. The accepted value of the cosmological constant is Λ ? 10 -47(GeV)4. Converting again into SI units gives Λ ? 10 -10J/m3. Now, for a warp bubble expanding at the speed of light I would need to increase this again by a factor of 10 52 as I have H ∝ √Λ . I can thus say this:

Equation 29
Here Λc is the local value of the cosmological constant when space is expanding at c. To make this a concrete example, I will consider a spacecraft of these dimensions:

Equation 30
If I postulate that the warp bubble must, at least, encompass the volume of the craft,  the total amount of energy ‘injected’ locally would equal

Equation 31
Assuming some arbitrarily advanced civilization was able to create such an effect ,  I will further postulate that this civilization would be able to utilize the most efficient method of  energy production,

--- Chunk 3 ---
Equation 12
Using the fact that ζ(?2m) = 0, for m any natural number, I obtain the vacuum energy:

Equation 13

Equation 14

Equation 15
This result can be generalized to 4+n dimensions

Equation 16
Here the degrees of freedom of the graviton a re expressed in the brackets. This vacuum energy due to the massive KK modes is associated with the cosmological constant.
5. Warp Drives
Numerous papers discussing the idea of warp drives have emerged in the literature in recent years. See for example [19]. The basic idea of all of these is to formulate a solution to Einstein’s equations whereby a warp bubble is driven by a local expansion of space-time behind the bubble and a contraction ahead of the bubble. One common feature of these papers is that thei r physical foundation is the General Theory of Relativity. An element missing from all the papers is that there is little or no suggestion as to how such a warp bubble may be created.
------------------------------------------------------------

**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "yes",
        "reason": "The first document directly links the cosmological constant (\u039b) with the radius of the extra dimension, which is crucial for understanding the relationship between \u039b and warp bubble formation. It also introduces relevant equations such as Equation 17, which are essential to the expected output."
    },
    {
        "verdict": "yes",
        "reason": "The second document provides a specific value for the cosmological constant (\u039b \u2248 10 -47(GeV)4 or \u039b \u2248 10 -10J/m3), and discusses how this relates to the energy required to create a warp bubble. This information is directly relevant to the expected output."
    },
    {
        "verdict": "no",
        "reason": "The third document, while containing mathematical equations related to vacuum energy and gravitons, does not provide any direct link between the cosmological constant and the formation of a warp bubble as described in the expected output. It is more focused on theoretical aspects unrelated to the specific relationship being asked about."
    }
]

Score: 1.0
Reason: The score is 1.00 because nodes 1 and 2 are directly relevant, providing essential information linking the cosmological constant with warp bubble formation through equations and specific values. Node 3, ranked third, is less relevant as it does not address the specific relationship in question.

======================================================================
**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "The paper suggests that a warp bubble could be generated using ideas and mathematics consistent with quantum field theory.",
    "Equation 16 relates the degrees of freedom of the graviton to vacuum energy due to massive Kaluza Klein modes, which is associated with the cosmological constant.",
    "Equation 29 involves the relationship between the local value of the cosmological constant (Λc) and Hubble’s constant (H).",
    "The total amount of energy required to create a warp bubble for a spacecraft would be approximately 10^42 Joules, based on Equation 31.",
    "The paper considers the possibility that an advanced civilization could utilize the most efficient method of energy production to create such an effect."
]

Claims:
[
    "The central argument is that a warp bubble could be generated using quantum field theory concepts and the cosmological constant.",
    "Nyman connects the cosmological constant (Λ) to the Casimir energy via Kaluza-Klein modes of gravitons in higher dimensions (M-theory).",
    "Equation 16 states that the vacuum energy associated with massive Kaluza-Klein modes is equivalent to the cosmological constant.",
    "The relationship between Λ and Hubble’s constant (H) is established, implying that the size of the extra dimension is linked to the expansion rate of space.",
    "To create a warp bubble expanding at the speed of light, the energy required is increased by a factor of 10^52 based on H ∝ √Λ.",
    "The energy needed to create the warp bubble locally equals the local cosmological constant (Λc) multiplied by the volume of the spacecraft as outlined in Equation 30.",
    "Nyman assumes a sufficiently advanced civilization could utilize the most efficient energy production methods."
]

Verdicts:
[
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The claim states that Nyman connects the cosmological constant (\u039b) to Casimir energy via Kaluza-Klein modes of gravitons in higher dimensions, which is not mentioned in the retrieval context. The context only mentions Equation 16 relating vacuum energy and massive Kaluza-Klein modes to the cosmological constant."
    },
    {
        "verdict": "yes",
        "reason": null
    },
    {
        "verdict": "no",
        "reason": "The claim suggests a direct relationship between \u039b and H, implying that the size of the extra dimension is linked to the expansion rate of space. However, the context only mentions Equation 29 involving the local value of the cosmological constant (\u039bc) and Hubble\u2019s constant (H), without establishing such a direct link."
    },
    {
        "verdict": "no",
        "reason": "The claim states that the energy required is increased by a factor of 10^52 based on H \u221d \u221a\u039b, which contradicts the context. The context mentions Equation 31 stating the total amount of energy required to create a warp bubble for a spacecraft would be approximately 10^42 Joules."
    },
    {
        "verdict": "no",
        "reason": "The claim states that the energy needed to create the warp bubble locally equals the local cosmological constant (\u039bc) multiplied by the volume of the spacecraft as outlined in Equation 30, which is not mentioned in the context. The context only mentions a total amount of energy required based on Equation 31."
    },
    {
        "verdict": "yes",
        "reason": null
    }
]

Score: 0.42857142857142855
Reason: The score is 0.43 because the actual output introduces several unsupported claims and direct relationships that are not present or implied in the retrieval context, such as connecting Nyman's work to Casimir energy via Kaluza-Klein modes, establishing a direct link between Λ and H, and specifying an exact factor of energy increase based on H ∝ √Λ.

======================================================================
======================================================================

Metrics Summary

  - ✅ Contextual Precision (score: 1.0, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 1.00 because nodes 1 and 2 are directly relevant, providing essential information linking the cosmological constant with warp bubble formation through equations and specific values. Node 3, ranked third, is less relevant as it does not address the specific relationship in question., error: None)
  - ❌ Faithfulness (score: 0.42857142857142855, threshold: 0.5, strict: False, evaluation model: jeffnyman/ts-evaluator (Ollama), reason: The score is 0.43 because the actual output introduces several unsupported claims and direct relationships that are not present or implied in the retrieval context, such as connecting Nyman's work to Casimir energy via Kaluza-Klein modes, establishing a direct link between Λ and H, and specifying an exact factor of energy increase based on H ∝ √Λ., error: None)

For test case:

  - input: What is the relationship between the cosmological
constant and warp bubble formation in Jeff Nyman's paper?

  - actual output: Okay, let's break down the relationship between the cosmological constant and warp bubble formation as presented in Jeff Nyman's paper. Here’s my step-by-step reasoning:

1.  **Core Idea:** The central argument is that a warp bubble *could* be generated using quantum field theory concepts and the cosmological constant. This isn't a fully developed theory, but a hypothesis.

2.  **Cosmological Constant & Casimir Energy:** Nyman connects the cosmological constant (Λ) to the Casimir energy via Kaluza-Klein modes of gravitons in higher dimensions (M-theory). Equation 16 is key to this connection – it states that the vacuum energy associated with massive KK modes is *equivalent* to the cosmological constant.

3.  **Hubble Constant & Radius:** He then simplifies the relationship by relating Λ to Hubble’s constant (H). Equation 17 and 18 establish this connection, implying that the size of the extra dimension (where these KK modes exist) is linked to the expansion rate of space.

4.  **Energy Calculation for the Bubble:** The paper moves to calculate the energy needed to *create* the warp bubble. Specifically, he posits that to expand at the speed of light, the energy required is increased by a factor of 10⁵², based on H ∝ √Λ.  Equation 29 clearly outlines this scaling.

5.  **Local Energy Injection:**  To make this concrete, he considers a spacecraft within the bubble.  The energy *injected* locally to create the bubble equals the local cosmological constant (Λc) multiplied by the volume of the craft. Equation 30 demonstrates this, linking Λc to the volume of the spacecraft.

6.  **Advanced Technology Assumption:**  Finally, Nyman assumes a sufficiently advanced civilization could utilize the most efficient energy production methods.

**Therefore, the relationship, as presented, is that the cosmological constant (Λ) is the fundamental energy driving the creation of a warp bubble. It’s linked to the expansion rate of space (via Hubble’s constant), the size of extra dimensions, and the amount of energy needed to locally distort spacetime to create the bubble.**

**Risks/Assumptions:** This is heavily reliant on several assumptions:
    *   The existence and manipulability of extra dimensions.
    *   The validity of using the cosmological constant to directly drive spacetime distortion.
    *   The ability of a sufficiently advanced civilization to manage such immense energy requirements.

Do you want me to elaborate on any specific part of this relationship or explore some of the underlying assumptions in more detail?

  - expected output: The cosmological constant is linked to the radius
of extra dimensions through Casimir energy. By manipulating the
extra dimension radius, the local cosmological constant can be
adjusted, creating expansion and contraction of space-time that
forms the warp bubble.

  - context: None

  - retrieval context: ['energy condition or issues regarding causality. The aim of this paper is to suggest that a warp bubble \ncould be gen erated using ideas and mathematics that are consistent with quantum field theory . This \nmay have the effect of allowing us to hypothesize how such a bubble could be created by a sufficiently \nadvanced technology. \nBy associating the cosmological constant with  the Casimir Energy due to the Kaluza Klein modes of \ngravitons in higher dimensions, especially in the context of M -theory derived or inspired models, it is \npossible to form a relationship between Λ and the radius of the compact extra dimension. As I have \nshown with equation (16), the following holds: \n \nEquation 17 \nAn easier way of developing this relationship is to put things in terms of Hubble’s constant H which \ndescribes the rate of expansion of space per unit distance of space. \n \nEquation 18 \nI can also put this in terms of the radius of the extra dimension, which gives: \n \nEquation 19', 'I can now look at the energy required to create the necessary warp bubble. The accepted value of the \ncosmological constant is Λ ? 10 -47(GeV)4. Converting again into SI units gives Λ ? 10 -10J/m3. Now, for a \nwarp bubble expanding at the speed of light I would need to increase this again by a factor of 10 52 as I \nhave H ∝ √Λ . I can thus say this: \n \nEquation 29 \nHere Λc is the local value of the cosmological constant when space is expanding at c. To make this a \nconcrete example, I will consider a spacecraft of these dimensions: \n \nEquation 30 \nIf I postulate that the warp bubble must, at least, encompass the volume of the craft,  the total amount \nof energy ‘injected’ locally would equal \n \nEquation 31 \nAssuming some arbitrarily advanced civilization was able to create such an effect ,  I will further \npostulate that this civilization would be able to utilize the most efficient method of  energy production,', 'Equation 12 \nUsing the fact that ?(?2m) = 0, for m any natural number, I obtain the vacuum energy: \n \nEquation 13 \n \nEquation 14 \n \nEquation 15 \nThis result can be generalized to 4+n dimensions \n \nEquation 16 \nHere the degrees of freedom of the graviton a re expressed in the brackets. This vacuum energy due to \nthe massive KK modes is associated with the cosmological constant. \n5. Warp Drives \nNumerous papers discussing the idea of warp drives have emerged in the literature in recent years. See \nfor example [19]. The basic idea of all of these is to formulate a solution to Einstein’s equations whereby \na warp bubble is driven by a local expansion of space-time behind the bubble and a contraction ahead of \nthe bubble. One common feature of these papers is that thei r physical foundation is the General Theory \nof Relativity. An element missing from all the papers is that there is little or no suggestion as to how \nsuch a warp bubble may be created.']

======================================================================

Overall Metric Pass Rates

Contextual Precision: 100.00% pass rate
Faithfulness: 0.00% pass rate

Test 3 Scores:
Contextual Precision: 1.0
Faithfulness: 0.42857142857142855

Okay, we’re three for three on Contextual Precision! However, Test 3 reveals something interesting about Faithfulness. Our scores show a split:

Contextual Precision: 1.0 (perfect for the third time!)
Faithfulness: 0.43 (below threshold, the lowest yet)

First things first: the retriever nailed it again. The metric found two highly relevant chunks ranked first and second:

Chunk 1: Directly explains the relationship between the cosmological constant (Λ) and the radius of extra dimensions via Casimir energy and Kaluza-Klein modes. This is exactly what the question asks about.
Chunk 2: Provides specific calculations about the cosmological constant (Λ ≈ 10^-47(GeV)⁴), the H ∝ √Λ relationship, and how these relate to warp bubble energy requirements.
Chunk 3: Background material about vacuum energy and KK modes (marked “no” for direct relevance).

This is the third consecutive test where the most relevant chunks appeared first. The pattern is undeniable: when you ask conceptual questions that match the paper’s theoretical framework sections, semantic similarity search performs perfectly.

Yet, we can’t deny that Faithfulness dropped to 0.43, failing the threshold. Why? The model generated a comprehensive, well-structured answer with detailed reasoning, but it made several claims that went beyond what’s explicitly in the retrieved chunks. Let’s look at what the model claimed:

Specific equation numbers and what they establish
The 10⁵² scaling factor based on H ∝ √Λ
That Equation 30 links Λc to spacecraft volume
Specific relationships between equations

Now, let’s consider what the metric flagged:

Some of these details are in the chunks but not stated as explicitly as the model claims
The model is synthesizing across equations and making connections
The model is adding structure and explanation that, while consistent with the paper, goes beyond strict paraphrasing

Yet, here’s what’s crucial: look at what was retrieved. Chunk 2 is from page 11: the energy calculations section! This is the first time across all our tests (in Part 2 and here in Part 3) that we successfully retrieved content from page 11. The question about the cosmological constant’s relationship to warp bubble formation pulled both conceptual framework content (Chunk 1, page 8) and calculation content (Chunk 2, page 11).

Why did this happen here? The question “What is the relationship between the cosmological constant and warp bubble formation?” spans both conceptual and calculational content. The cosmological constant is discussed conceptually (its relationship to extra dimensions and Casimir energy) on pages 6 through 8 and mathematically (its specific value and role in energy calculations) on page 11. The retriever found both!

You’ll note what I did here for this third test case: this was a hybrid query (part conceptual, part specific) and the semantic search successfully retrieved content from both domains. This is why your prompts become a type of test condition that you have to think carefully about.

Neat! But we still have that lower Faithfulness score, right? We do, but the lower Faithfulness score isn’t necessarily bad. The model is doing what we want: synthesizing information from multiple chunks to provide a comprehensive answer. It’s not hallucinating. All the concepts it discusses are in the retrieved content. It’s just adding explanatory structure and making explicit connections that are implicit in the equations.

This is the tension between “faithful paraphrasing” and “useful synthesis.” For production RAG systems, you might actually prefer this kind of intelligent synthesis over strict paraphrasing, as long as it stays grounded in the source material, which it does, in this case.

Our test finding here is that we have perfect Contextual Precision across all three tests (3/3 = 1.0). Every conceptual query retrieved the most relevant chunks first. This is a complete reversal from Part 2, where every configuration scored 0.0 or 0.33.

The Faithfulness variance (0.5 -> 0.67 -> 0.43) reflects how much synthesis versus paraphrasing the model does, which depends on the complexity of the answer. Simpler questions (Test 1) get more synthetic answers. Questions with explicit calculations (Test 3) get answers where the model tries to connect specific equations, sometimes going slightly beyond what’s explicitly stated.

The core lesson stands: The RAG system works excellently for conceptual queries. The “problem” in Part 2 wasn’t the system. Instead, it was asking the wrong type of question for what the system does well.

Note that we still have an issue though, right? After all, our users don’t necessarily know “what the system does well” unless we tell them. So our test results are caveated with that reality.

Results Summary and Analysis

Let’s compile our results and compare them to Part 2’s baseline:

Test	Contextual Precision	Faithfulness
Part 2 Baseline (Energy Source)	0.33	0.57
Test 1: Extra Dimensions	1.0	0.5
Test 2: Kaluza-Klein Modes	1.0	0.67
Test 3: Cosmological Constant	1.0	0.43
Average (Conceptual Questions)	1.0	0.53

I’m going to repeat what I’ve said numerous times: the pattern is unmistakable: Contextual Precision achieved a perfect score (1.0) for all three conceptual questions, compared to 0.33 for the specific factual query in Part 2. This is a complete reversal in retrieval performance, achieved simply by asking different types of questions.

What the Results Reveal

The RAG system isn’t broken; it’s specialized. Our baseline configuration performs excellently for conceptual queries about theoretical frameworks but struggles with specific factual queries requiring numerical data from calculation sections. This isn’t a flaw; it’s a characteristic of semantic similarity search.

Retrieval precision is query-type dependent. All three conceptual questions retrieved the most relevant chunks first (perfect CP scores), while the specific factual question consistently buried relevant information at position #3 or failed to retrieve it at all. The difference isn’t the system. It’s the match between query type and content type.

Perfect retrieval doesn’t guarantee perfect faithfulness. Despite achieving 1.0 Contextual Precision across all tests, Faithfulness scores varied from 0.43 to 0.67. This reflects how much synthesis versus paraphrasing the model performs. Test 3’s lower score came from the model making explicit connections between equations that were implicit in the retrieved chunks. Useful synthesis, perhaps, but flagged by the strict Faithfulness metric.

The paper’s structure matters. My warp drive paper has a specific characteristic that amplifies this query-type mismatch: it’s heavily mathematical with distinct conceptual and calculational sections. Pages 2 through 8 are dense with theoretical framework terminology (extra dimensions, string theory, Kaluza-Klein modes, quantum field theory), while page 11 is dense with equations and specific numbers (10²⁸ kg, antimatter, Jupiter’s mass). These sections have very different semantic profiles.

Conceptual questions naturally match the terminology-rich theoretical sections. Questions like “How do extra dimensions create a warp bubble?” trigger retrieval of content using those exact terms. But specific factual questions like “What energy source?” require matching to a calculation section where the language is more numerical than conceptual. The term “matter/antimatter annihilation” appears exactly once in the entire paper, embedded in a paragraph focused on mass-energy calculations rather than energy source concepts.

This isn’t unique to my paper. In fact, it’s common in technical documents. Research papers typically have:

Conceptual sections: Introductions, theoretical frameworks, literature reviews (terminology-dense)
Methodological sections: Experimental procedures, mathematical derivations (equation-dense)
Results sections: Data, calculations, specific findings (number-dense)
Discussion sections: Implications, connections, interpretation (concept-dense again)

Semantic similarity search naturally gravitates toward terminology-dense sections because those sections have richer semantic content for embedding models to match against. Sections dominated by equations, numbers, and calculations have sparser semantic signals, making them harder to retrieve even when they contain the specific answer.

The Diagnostic Insight

Part 2’s experiments tested whether parameter tuning could fix our retrieval problem. All four experiments failed, proving that the issue wasn’t chunking strategy. Part 3 tested whether the problem was the system itself or the query type. Three successful tests prove it’s the latter.

Test Finding: RAG system performance must be evaluated across different query types, not just different configurations. A system that scores 0.33 on one query type and 1.0 on another isn’t inconsistent. It’s revealing its operational characteristics. As I stated before, understanding when your system works well and when it struggles is more valuable than trying to make one configuration work for everything.

This has immediate practical implications. If you’re team is building a RAG system for technical documentation, you need to:

Characterize your query distribution: Will users primarily ask conceptual questions (“How does X work?”) or specific factual questions (“What is the value of Y?”)?
Test across query types: Don’t just test one question repeatedly. Test representative examples of each query category you expect.
Consider hybrid approaches: For documents with mixed content types, you might need different retrieval strategies for different query types, or a hybrid approach combining semantic and keyword search.
Set appropriate expectations: If your system excels at conceptual queries but struggles with specific facts, document this behavior rather than treating it as a universal failure.

The Paper Structure Factor

It’s worth acknowledging that my warp drive paper’s heavy mathematical content may make it a particularly challenging case for pure semantic similarity search. A paper with more evenly distributed prose throughout might not show such a stark difference between query types. However, this characteristic makes it an excellent teaching example precisely because it exaggerates a pattern that exists in most technical documents: different sections have different semantic densities and require different retrieval approaches.

If our team was optimizing a production RAG system for physics (or any mathematical) papers specifically, we might:

Add metadata tags distinguishing theoretical vs. calculation sections
Use hybrid search to catch specific terms like “antimatter” and “10²⁸ kg” via keyword matching
Implement query classification to route conceptual vs. factual questions differently
Parse equations separately and make them searchable by their components

But, for our purposes, understanding why the system behaves differently across query types is more valuable than immediately fixing it. That understanding guides us toward the right solutions rather than trying random parameter adjustments.

What Does the Testing Tell Us?

Our testing across Parts 2 and 3 demonstrates a complete diagnostic cycle:

Part 2: Diagnosis through failure. Four experiments with different chunking strategies all failed to improve Contextual Precision, proving the problem wasn’t parameter tuning. The consistent failure pointed toward a fundamental mismatch between retrieval strategy and query type.

Part 3: Validation through success. Three conceptual queries all achieved perfect Contextual Precision, proving the system works excellently for certain query types. The success confirmed that the system itself isn’t broken. It’s optimized for conceptual matching rather than specific fact retrieval.

Together, these results teach us that effective RAG evaluation requires testing across multiple dimensions: different configurations (Part 2) and different query types (Part 3). Testing only one dimension gives incomplete information. Part 2 alone would suggest the system is fundamentally flawed. Part 3 alone would suggest it works perfectly. Only by testing both do we understand the system’s true operational profile: excellent for conceptual queries, poor for specific factual queries.

This is diagnostic testing at its best: not just measuring scores, but building understanding of system behavior across different conditions. That understanding guides improvement efforts toward interventions that actually address the root cause rather than symptoms.

Next Steps!

These last posts have been heavy, right? Lots of output, lots of explanation. I will remind, however, that we did here is exactly what you would do in testing an AI context. There is one thing we didn’t consider, however, regarding these test experiments we went through. (Hint: we have a data condition that we haven’t varied. What might it be?) Let’s use the next post to consider that.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …

AI and Testing: Improving Retrieval Quality, Part 3

Reframing Our Test

The Experiment

Experiment 1 Output

Experiment 2 Output

Experiment 3 Output

Results Summary and Analysis

What the Results Reveal

The Diagnostic Insight

The Paper Structure Factor

What Does the Testing Tell Us?

Next Steps!

Leave a Reply Cancel reply