DSPy and RAG: Grounding Answers in Documents

In the previous post we looked at the idea of building up a pipeline with DSPy. In this post, we’ll put that idea into more action with something we looked at in my AI and Testing series: RAG.

To clarify, what we specifically did in the previous post: we built a two-step DSPy pipeline: one module, two predictors, two compiled prompts generated from a single forward() call. We also saw field descriptions appear for the first time: a desc on an OutputField that got compiled directly into the prompt schema. The declaration was starting to carry intent, not just structure.

This post takes that further. We’re going to build a retrieval-augmented generation (RAG) pipeline using DSPy, Chroma, and a real academic paper as the knowledge source. Contrary to my AI and Testing series, we will not be using one of my own papers for this. Here we’ll use Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task by Kosmyna et al. (MIT Media Lab, 2025).

It’s arguably a fitting choice: a blog about AI evaluation, using an AI pipeline, to answer questions about a paper studying what AI does to cognition.

The pipeline splits across two scripts. The first, ingest.py, extracts text from the PDF, chunks it, embeds the chunks, and stores them in a local Chroma vector database. The second, rag.py, is the DSPy pipeline that takes a question, retrieves relevant chunks, and generates a grounded answer. You run ingestion once; you run the RAG script as many times as you like.

Both scripts are available to download: ingest.py and rag.py. The PDF of the paper is also available (brain-on-chatgpt.pdf) and it should be placed in the same directory as both scripts.

The original paper I am using is here: Your Brain on ChatGPT: Accumulation of Cognitive Debt when relying on LLMs for Learning. The only change I have made is to condense the PDF, from about 37 MB to around 5 MB.

Dependencies

If you’ve been following this series, you already have DSPy and Ollama installed. For this post you’ll also need the following (and remember to be in your virtual environment):


python -m pip install pypdf chromadb ollama

And you’ll need the nomic-embed-text embedding model pulled in Ollama:


ollama pull nomic-embed-text

This model handles both ingestion (embedding the chunks) and retrieval (embedding your query). Using the same model for both is important — if the chunks and the query are embedded by different models, the vector space they land in won’t be comparable, and retrieval will produce poor results.

Step One: Ingestion

Run ingestion first:

  python ingest.py

Here’s what that produced on the paper:


Reading: brain-on-chatgpt.pdf
Extracted 437,329 characters of text.
Created 324 chunks (size=1500, overlap=150).
Embedding 324 chunks via nomic-embed-text. This may take a minute...
Ingested 324 chunks into collection 'brain_on_chatgpt'.
Chroma store saved to: ~/blog-ai-testing/chroma_store

A few things worth noting in those numbers. The paper is 216 pages, which yielded 437,329 characters of extracted text. That gets split into 324 chunks of 1,500 characters each, with 150 characters of overlap between adjacent chunks. The overlap is what prevents ideas that fall on a chunk boundary from being cleanly severed, in that both neighboring chunks will contain the overlapping passage, so retrieval has a chance to find it from either side.

Each of those 324 chunks gets embedded individually by nomic-embed-text via Ollama, producing a vector representation that captures the chunk’s semantic content. Those vectors, along with the original chunk text, are stored in a Chroma collection persisted to disk. After ingestion finishes, the chroma_store directory in your working folder contains everything the RAG pipeline needs for retrieval. You don’t need to re-run ingestion unless you change the PDF or the chunking parameters.

The chunking strategy here is deliberately simple: fixed-size windows with overlap. More sophisticated approaches exist, such as splitting on sentence boundaries, on section headers, or grouping by semantic similarity. That said, fixed-size chunking is a reliable baseline and keeps the focus on the DSPy pipeline rather than chunking theory. The script’s comments flag this explicitly if you want to extend it.

Step Two: The RAG Pipeline

With ingestion done, run the RAG script:

  python rag.py

Or pass your own question:

  python rag.py "What were the study's main limitations?"

The default question, used for the walkthrough below, is: What did the EEG data reveal about the LLM group compared to the Brain-only group?

How the Pipeline Works

The forward() method in RAGPipeline runs in two phases. First, retrieval: the question is embedded using nomic-embed-text and compared against the stored chunk vectors in Chroma. The five closest chunks by vector similarity are returned. Second, generation: those five chunks are joined into a numbered context string and passed, alongside the original question, into a ChainOfThought predictor compiled against RAGSignature.

Keeping retrieval as a plain function rather than a DSPy module is a deliberate choice. It makes the two phases visually distinct in the code and keeps the Chroma mechanics readable without hiding them inside DSPy abstractions. The boundary between “find relevant text” and “reason over relevant text” is worth seeing clearly on a first encounter with RAG.

Understanding the Output

The Prediction Object


=== Prediction ===
Prediction(
    reasoning='The context describes EEG data collected from both the LLM group
    and the Brain-only group. Specifically, it details differences in delta and
    high-delta band activity, as well as alpha band activity, between the two
    groups. It mentions Figures 66 and 67, which show Dynamic Direct Transfer
    Functions (dDTFs) representing the transfer of activity between electrodes.
    The question asks about the EEG data, so we need to extract relevant
    information about the differences observed in these bands between the groups.',
    answer='The EEG data revealed that the LLM group showed higher high-alpha
    activity compared to the Brain-only group. Figures 66 and 67, displaying
    dDTFs for low alpha, alpha, and high alpha bands between the LLM and Search
    Engine groups (which included the Brain-only group), showed that high-alpha
    activity was marginally higher in the LLM group.'
)

The reasoning field shows the model orienting itself to what the retrieved context actually contains before committing to an answer. Specifically, noting the delta and alpha band differences, the dDTF figures, and the two groups being compared. The answer then pulls from chunk 5 of the retrieved context, which described alpha band activity differences between the LLM and Search Engine groups.

There’s something worth being precise about here. The question asked about LLM versus Brain-only. The answer addresses LLM versus Search Engine. That’s not a model failure. That’s a retrieval signal. The chunks that came back didn’t contain a direct LLM-to-Brain-only EEG comparison at that level of specificity; the most relevant passage the retriever found was the alpha band comparison against the Search Engine group. The model worked faithfully with what it was given, which is exactly what the answer field’s description asked for: a precise answer grounded in the provided context, introducing nothing not present in the context.

This illustrates a foundational RAG principle: generation quality is bounded by retrieval quality. The best model in the world can’t compensate for a context window that doesn’t contain the answer. If you want a cleaner answer to the LLM-versus-Brain-only EEG question, the path forward is better retrieval (more chunks, better chunking boundaries, a more specific query), not a better generator.

The Retrieved Context

The generated prompt’s user message shows exactly what the five retrieved chunks contained. This is worth reading carefully, because it makes retrieval concrete rather than abstract.

Chunk 1 covers delta and high-delta band activity comparing Brain-only and Search Engine groups. Chunk 2 is about NLP analysis of named entities across groups, with a tangential mention of social media influence in Brain-only essays. Chunk 3 describes the session 4 crossover study design. Chunk 4 covers Named Entity Recognition frequencies for the Search Engine and Brain-only groups. Chunk 5, the one that actually answered the question, describes alpha band dDTF results showing marginally higher high-alpha activity in the LLM group.

Four of the five retrieved chunks were only loosely relevant to the question. That’s not unusual for top-k retrieval against a 216-page paper with a broad question. Vector similarity finds passages that share vocabulary and semantic neighborhood with the query, but it has no awareness of the paper’s argument structure or which section is authoritative for a given topic. Chunk 5 was the right chunk; chunks 1 through 4 were plausible neighbors that happened to share enough semantic overlap with “EEG,” “LLM group,” and “Brain-only group” to rank in the top five.

If you went through my AI and Testing series, you saw much more in-depth examples of paper structure and how that impacted outcomes.

The model correctly identified chunk 5 as the load-bearing passage and grounded its answer there. But it’s worth being clear-eyed about what retrieval gave it to work with.

The Compiled Prompt


System message:

Your input fields are:
1. `context` (str): Relevant passages retrieved from the paper 'Your Brain on
   ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay
   Writing Task'.
2. `question` (str): A question about the paper's findings, methodology, or
   conclusions.
Your output fields are:
1. `reasoning` (str):
2. `answer` (str): A precise answer grounded in the provided context. Do not
   introduce information not present in the context.
All interactions will be structured in the following way, with the appropriate
values filled in.

...

In adhering to this structure, your objective is:
        Answer questions about the paper using only the provided context.

This system message is the fullest demonstration yet of how much work a Signature declaration does. Compare it against the system messages from the previous scripts.

The context field carries a description that names the paper explicitly. The question field carries a description that scopes what kind of question is expected. The answer field carries a description that states a faithfulness constraint: the model is told not to introduce information absent from the context. And the class docstring, Answer questions about the paper using only the provided context, appears verbatim as the objective statement.

That’s four distinct layers of intent compiled from the Signature: field names, field types, field descriptions, and the class docstring. None of it was written as a prompt string. All of it showed up in what the model actually saw.

In the previous posts, field descriptions were present but optional; a sort of courtesy that nudged the model toward a concise summary. Here they’re genuinely load-bearing. Without the answer field’s faithfulness constraint, a capable model will freely blend retrieved context with its own parametric knowledge, and you lose the grounding that makes RAG useful. The description is what keeps the answer tethered to the paper.

Questions Worth Trying

The default question probes a specific technical result. Try a few others to see how retrieval behaves across different parts of the paper:


python rag.py "What was the study's methodology for the EEG data collection?"
python rag.py "What were the main limitations acknowledged by the authors?"
python rag.py "How did the LLM group's essays differ linguistically from the Brain-only group?"
python rag.py "What does the paper conclude about cognitive debt?"

Pay attention to the retrieved chunks in the generated prompt for each question. You’ll see retrieval pulling from very different parts of the paper depending on the query. Questions about methodology will surface different chunks than questions about conclusions. That variation is what makes inspect_history worth keeping in the script in that it lets you see not just the answer but what the model was working from when it produced it.

Where This Goes Next

The pipeline we built here is functional RAG, but it’s unoptimized. The chunking is fixed-size, the retrieval is vanilla top-k cosine similarity, and the Signature descriptions are hand-written intuitions about what the model needs to know. Those are reasonable starting points, but DSPy’s real promise is that you don’t have to stay there.

DSPy’s optimization layer, dspy.MIPROv2 and related optimizers, can take a set of example question-answer pairs, a quality metric, and the pipeline we built here, and search for better prompt strategies automatically. That includes rewriting field descriptions, injecting few-shot examples, and tuning the chain-of-thought behavior. The Signature you wrote doesn’t change; DSPy adjusts what it compiles from it.

That would probably be the next natural step in this series: taking the RAG pipeline from hand-tuned to optimized, and seeing what DSPy’s compilation model looks like when it’s working on your behalf rather than just executing your declarations.

That being said, that would start to take this DSPy series far outside the realm of simple pedagogical learning and seeing some test thinking applied.

Did This Matter for Testers?

Stepping back from the mechanics: what DSPy represents is a shift in how LLM-powered software gets built. As these systems move from experiments into production, the fragility of hand-written prompt strings becomes a real engineering liability: they’re hard to version, hard to test, and they break silently when models change. DSPy’s answer is to treat prompting as compilation: you declare what your program needs, and the framework handles how to ask for it. That separation of declaration from execution is the same principle that made higher-level programming languages worth adopting, and it’s likely to matter for the same reasons.

For developers, this means pipelines that are composable, inspectable, and eventually optimizable without rewriting prompt strings by hand. For testers, it means there’s a new layer of the stack worth understanding: one where the “code” being tested includes Signatures, compiled prompts, and retrieval pipelines, and where inspect_history is as important a diagnostic tool as any assertion in a test suite.

The question of whether an LLM-powered system is behaving correctly increasingly depends on being able to see what it was actually asked, not just what it answered.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …