DSPy: Declaring Instead of Prompting

In my AI and Testing series, which ran for a couple of months, I focused heavily on the testing side of things. I now want to consider AI in some specific contexts that you are likely to come across and show how those contexts work. The first that I’ll focus on is a tool called DSPy.

Imagine you have a more-or-less precise robotic manipulator calibrating the intersection of two distinct, glowing streams. One stream (the “blue logic”) is angular and structured with geometric circuit patterns and digital glyphs, which represents programmatic code. The other stream (the “gold intuition”) is fluid, soft, and text-based, which represents natural language. Where they weave, they generate sparking white “compilation” points.

That metaphorical image is essentially what DSPy provides, but let’s talk about how it all works.

If you’ve spent time with LLMs, you know the drill: you write a prompt, it works, then you change the model or the task slightly, and you’re back to rewriting the prompt by hand. Prompts are strings. Strings are fragile. And when your pipeline grows beyond one step, you end up with a tangle of formatted text and manual parsing that breaks in ways that are hard to debug and painful to fix.

DSPy is a response to that problem. Built by the Stanford NLP group, it treats LLM pipelines as programs rather than prompt templates, and that shift in framing has real practical consequences for how you build, inspect, and eventually optimize LLM-powered logic.

This post introduces DSPy through two small scripts. The first establishes the baseline machinery. The second changes exactly one word, and in doing so, makes DSPy’s core claim demonstrably visible. Both scripts are available for you to download and run.

What Problem Does DSPy Solve?

Here’s an analogy that might help frame it. Early programmers wrote machine code by hand. They controlled every instruction. That gave them power but also enormous maintenance burden. Higher-level languages introduced compilers: you write source code that describes what you want, and the compiler figures out how to express that to the machine.

DSPy does something similar for LLM prompting. You describe your task as a typed signature (essentially: what goes in, what comes out) and DSPy compiles that into a structured prompt. You don’t write the prompt string. You don’t maintain it when the model changes. You declare the interface, and DSPy handles the expression.

There’s another important consequence of this approach: because prompts are compiled artifacts rather than hand-authored strings, DSPy can optimize them. You can provide examples, define a metric, and let DSPy search for better prompt strategies automatically. That optimization story is beyond scope for this post, but it’s worth naming upfront, because it’s the reason the framework is designed the way it is.

The Core Abstractions

DSPy has three layers you’ll encounter in any program, no matter how simple or complex.

The Language Model (LM). DSPy wraps your LLM backend using LiteLLM’s provider format. Once configured globally, every DSPy module in your process uses it without needing it passed around explicitly. This is analogous to configuring a database connection at application startup: you set it once, and everything downstream inherits it.

LiteLLM is an open-source AI gateway and Python SDK that provides a unified interface to call over many different Large Language Models, including OpenAI, Anthropic, Huggingface, and Cohere.

The Signature. A Signature is a typed declaration of what a task takes as input and produces as output. Think of it like a function signature in a statically typed language: it describes the interface, not the implementation. DSPy reads the field names, types, and ordering to automatically construct a prompt. You don’t write prompts; you write contracts.

The Module. A DSPy Module is the executable unit, modeled after PyTorch’s nn.Module. It wraps one or more predictors, the objects that actually compile a Signature into a prompt, call the LM, and parse the response back into typed fields. Simple programs have a single module with a single predictor. Complex pipelines chain many modules together, each with its own Signature.

PyTorch is a leading open-source machine learning framework used primarily for developing and training deep learning models. In PyTorch, nn.Module is the base class for all neural network modules and it acts as a foundational building block that allows you to define complex, stateful computations, such as layers or entire models.

With those three layers in mind, let’s look at the simplest possible DSPy program.

Script 1: The Baseline

The first script, dspy1.py, demonstrates all three layers in minimal form: one Signature, one predictor, one question, one answer.

Let’s consider a few prerequisites before you run it.

Similar to my AI and Testing series, I’m going to use Ollama for this so that you can use free models locally. Accordingly, you will need to get Ollama on your system. My post on Ollama and Models contains the basics.

I’ve provided some models for use in my posts on AI, and one of those is known as TesterStories Reasoner. If you want to get that model, just execute the following command:

  ollama pull jeffnyman/ts-reasoner

You can actually use any model you want to for this and I’ll indicate where you can change it in the script, although it’s pretty obvious.

I’ll be focusing on Python scripts, so having a relatively recent Python 3.x version installed on your system would be good. The best thing to do is set up a project directory where you can work. When you set up Python projects, it helps to set up a virtual environment. Without any secondary tooling, the easiest way to do this is:

  python -m venv .venv

You can name your virtual environment whatever you want, but .venv is considered the standard and IDEs will tend to look for that first. To activate this virtual environment, you can do this on a POSIX system:

   source .venv/bin/activate

On Windows, do this:

  .\.venv\Scripts\activate

To get out of the virtual environment, you can do this:

  deactivate

Once you have all that setup, the main thing to do is install the DSPy dependency into your virtual environment.

  python -m pip install dspy

The script we’re using for this post accepts an optional command-line argument. If you pass a question, it uses that; if you don’t, it falls back to a default. So you can just run the script like this:

  python dspy1.py

You can also run the script like this:

  python dspy1.py "What year did the Berlin Wall fall?"

The script, which I should note is heavily commented, does three things: configures the LM backend, defines a Signature and a Module, then runs the module and prints both the structured result and the generated prompt via inspect_history.

That LM backend in the script is where you can change the model from ts-reasoner to any other model you have installed.

What the Output Tells You

When you run the script, you’ll see three sections.

The Prediction object.

=== Prediction ===
Prediction(
    answer='42'
)

You didn’t get back a raw string. You got a structured Prediction object with typed fields; specifically, the answer field you declared in QASignature. You can access it programmatically as result.answer. The Signature you wrote shaped what came back. If you had called the LLM directly via the API, you would have received a blob of text and had to parse it yourself.

The generated prompt.

=== Generated Prompt ===

System message:

Your input fields are:
1. question (str):
Your output fields are:
1. answer (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]
In adhering to this structure, your objective is: Given the fields `question`, produce the fields `answer`.

User message:

[[ ## question ## ]]
What is the answer to life, the universe and everything?

Respond with the corresponding output fields, starting with the field `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.

This is the prompt DSPy compiled from your Signature. You didn’t write it. The system message establishes the schema: naming the fields, showing the section marker format, stating the objective. Your field names (question and answer) became the section headers. The user message slots into your actual input and instructs the model to respond using the same marker structure.

Those markers, [[ ## fieldname ## ]], are not formatting noise. They’re parse targets. DSPy uses them to extract the model’s response back into the typed fields on the Prediction object. The full round-trip is: declaration → compiled prompt → structured response → typed object.

The model’s response.

Response:

[[ ## answer ## ]]
42
[[ ## completed ## ]]

The model responded in exactly the format DSPy specified. It didn’t just say “42” but, rather, it echoed the section marker structure so DSPy could parse the output unambiguously. The answer='42' you saw in the Prediction object came directly from this. The model cooperated with a parsing protocol that DSPy established in the system message. Notice that you didn’t have to write any of the scaffolding that made that possible.

Script 2: One Word, New Behavior

The second script, dspy2.py, is structurally identical to the first. Same LM configuration. Same Signature. Same module shape. The only difference is a single word in the module’s __init__:

# Script 1
self.predictor = dspy.Predict(QASignature)

# Script 2
self.predictor = dspy.ChainOfThought(QASignature)

That’s it. Predict becomes ChainOfThought. Run the script the same way as you did the first one.

What Changes in the Output

The Prediction object now has two fields.

=== Prediction ===
Prediction(
    reasoning="This is a classic question from the science fiction comedy series
    *The Hitchhiker's Guide to the Galaxy* by Douglas Adams. The answer, according to the supercomputer Deep Thought, is 42. The question itself was never fully explained in the books, adding to the humor and absurdity of the situation.",
    answer='42'
)

You declared one output field in your Signature: answer. The Prediction object now has two: reasoning and answer. DSPy injected the reasoning field internally when you switched to ChainOfThought. You didn’t declare it. You didn’t touch the Signature. It appeared because the predictor strategy requested it.

The compiled prompt reflects the new field.

System message:

Your input fields are:
1. `question` (str):
Your output fields are:
1. `reasoning` (str):
2. `answer` (str):

...

User message:

[[ ## question ## ]]
What is the answer to life, the universe and everything?

Respond with the corresponding output fields, starting with the field
`[[ ## reasoning ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.

Compare this system message against the one from the first script. The output fields section now lists reasoning before answer. The user message instructs the model to start at [[ ## reasoning ## ]] before reaching [[ ## answer ## ]]. DSPy compiled a structurally different prompt from the same Signature, purely because the predictor strategy changed.

What Doesn’t Change

The Signature is identical. QASignature still declares exactly one input field and one output field, with the same names and types as the first script. Nothing about the contract changed. Only the strategy for fulfilling it did.

This is the claim DSPy makes that’s worth taking seriously: the declaration of what you want is completely independent of how DSPy asks for it. Swapping from a direct answer to a chain-of-thought reasoning strategy required no prompt rewriting, no string editing, no new parsing logic. One word changed, and DSPy recompiled everything around it.

A Note on inspect_history

Both scripts use dspy.inspect_history(n=1) to print the generated prompt. It’s tempting to treat this as a debugging convenience that you might strip out once things are working. That undersells it.

Because DSPy compiles prompts rather than accepting hand-authored ones, inspect_history is how you verify what was actually sent to the model versus what you assumed was sent. As your Signatures grow more complex (adding field descriptions, chain-of-thought, few-shot examples), the compiled prompt can diverge from your intuition in subtle ways. Being able to inspect the compiled artifact is the same discipline as being able to read compiler output: you want to know what your declaration produced, not just trust that compilation went as intended.

This, needless to say, is an aspect of testability.

Think of it as your test oracle for the prompt layer. Which, given what this blog is about, seems worth keeping around.

Where This Goes Next

These two scripts establish the foundation: declare a Signature, choose a predictor strategy, get typed output back, inspect what was compiled. The Signature stays stable; the strategy adapts around it.

The natural next question is: what happens when a single step isn’t enough? What if you need to expand an answer and then compress it? What if the output of one LLM call needs to become the input of another?

That’s where DSPy’s pipeline model comes in: multiple Signatures, multiple predictors, explicit data flow between steps, and inspect_history showing you every compiled prompt in sequence. That’s the focus of the next post.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …