AI and Testing: A Testing Example

In this post, my goal is to write a relatively substantive test case and, while doing so, bring together many of the topics talked about in previous posts of this series.

The Idea of Testing

You’ll notice that the script we work on in this post doesn’t test AI in the way we usually talk about testing software. There are no assertions, no expected values, and no pass/fail conditions. Instead, what I’ll be doing here is setting up a controlled conversational environment and then probing it with deliberately simple, almost naïve questions. These questions will not be designed to trick the model or extract hidden knowledge; they’re meant to expose how the model reasons, how it scopes its answers, and how it carries conceptual continuity across turns. In other words, the test isn’t the question, it’s the behavior.

We’ll build this script incrementally, using it as a kind of conversational test harness. Each addition will be intentional: memory, prompt structure, model choice, and output handling will all become things we can reason about and observe. Think of this less like unit (or even integration) testing and more like a physics experiment. We define the boundaries, set initial conditions, and then watch what happens when we apply small forces. Along the way, we’ll talk about what it even means to test an AI system (especially one whose primary output is language rather than logic) and how our traditional testing instincts need to evolve when the system under test reasons probabilistically instead of deterministically.

The Idea of History

Imagine walking into a conversation mid-stream and hearing someone say, “Yes, but do those values really define the minimal scale?” Without context, you’d be lost. What values? Minimal scale of what? The question is meaningless in isolation. This is exactly the challenge language models face with every interaction. Unless they have access to conversational history.

History is crucial to large language models because history is context. Most LLMs consult their conversation history before replying, using previous exchanges to understand references like “those values” or “the approach we discussed earlier.” Sometimes they pull history from previous stored conversations, allowing continuity across sessions. But how do we actually implement this? And more importantly, how do we verify it’s working correctly?

Testing with History!

Adding to what I said above, in this post we’ll build a conversational AI system with memory using LangChain, but we’ll approach it as a deliberate test case. We’ll set up specific test conditions (the infrastructure for maintaining history) and carefully chosen data conditions (questions that can only be answered correctly with conversational context). By the end, you’ll hopefully not only understand how to implement conversation history, but also how to design experiments that prove your system behaves as intended.

You’ll note with this example that we are bringing together a whole lot of concepts from all the previous posts on LangChain. Here are the dependencies you will need (many of which you will already have if you’ve been working through these posts):

  python -m pip install python-dotenv
  python -m pip install langchain
  python -m pip install langchain-ollama
  python -m pip install langchain_community

The only one you likely didn’t have yet was langchain-community. This particular dependency is somewhat what it sounds like: it contains all third-party integrations and components contributed and maintained by the community.

Building Our Test Case

Unlike previous posts, I’m going to build this logic with you, largely step by step, just to show you how I approach testing some of these concepts.

Observability Platform

I do want to use LangSmith for my observability so I’m going to put in place the logic to read my .env file.

Model Setup

Next, let’s set up our model. This is pretty much the context required to do anything useful for our test case. We need a model to connect to and run against.

The details of setting up a model and using LangSmith were covered in the post Local LLMs and LangChain.

Template Setup

At this point, I know I want to build a template, specifically a prompt template.

The first thing I want to establish is the shape of the conversation itself. Before we talk about models, memory, or execution, we need to decide what kind of interaction we’re even having. This prompt template is that decision made explicit. It defines three distinct roles in the exchange: a system voice that constrains behavior, a placeholder for prior context, and a human input that represents the current turn. At this stage, nothing is “intelligent” yet; we’re just defining the slots where intelligence will later operate.

Two of the fields here deserve special attention, even though they look deceptively simple: {history} and {prompt}. They aren’t values yet; they’re obligations.

  • {prompt} represents the immediate question we’re asking right now.
  • {history} represents everything the model is allowed to remember about how we got here.

By separating those concerns up front, we’re already making a testing decision: we’re saying that continuity of reasoning matters, and that we want to control it explicitly rather than let it be implicit or accidental. As we go on, we’ll make those placeholders real and see how changing what goes into them changes what comes out.

Templates were talked about in the post on LangChain Templates. The messages were talked about in LangChain Messages.

Chain Setup

With the prompt template in place, the next step is to decide what happens after those messages are assembled.

The primary statement added does that by composing three distinct responsibilities into a pipeline: the prompt structure, the model that will respond to it, and a parser that turns the model’s reply into something usable. Nothing runs yet. No question is asked. We’re simply defining how information will flow once it does. In testing terms, this is like assembling the system under test before we start injecting inputs.

What’s important here is that this chain is a stable object. It’s the thing we’ll later wrap with history, observe over multiple turns, and reason about as a unit. The model isn’t being called directly, and the prompt isn’t being rendered in isolation. Instead, we’re saying: “When something happens, it will always happen in this order.” That predictability is intentional. If we’re going to test how an AI behaves over time, we first need a consistent path from input to output. Otherwise, we can’t tell whether a change in behavior came from the model, the memory, or the wiring itself.

Think of this like plumbing. The template shapes the water, the model pressurizes it, and the parser decides what kind of container it ends up in. Until the pipes are connected, turning on the faucet doesn’t mean anything. I would also add that this is closer to defining a test harness than running a test. We’re not asserting outcomes yet; we’re ensuring that every future interaction flows through the same, inspectable path.

Chaining is talked about in LangChain and Orchestration.

Adding State

Up to this point, everything we’ve defined has been stateless. The chain takes an input, produces an output, and forgets everything immediately afterward. That’s fine for single-turn interactions, but it’s insufficient if we want to observe continuity of reasoning across multiple prompts. This next addition is where that changes. Instead of modifying the chain itself, we wrap it with something that knows how to supply and persist conversational state.

This code will show an error for read_session_history and that’s okay for now.

This wrapper, RunnableWithMessageHistory, doesn’t introduce memory by magic. It introduces a contract. We give it the runnable we want to execute (chain), tell it which incoming value represents the current human input (input_messages_key = “prompt”), and tell it where prior messages should be injected (history_messages_key = “history”). At this level, we’re not saying how history is stored or retrieved, only that, whenever the chain runs, it will be run in the presence of whatever “history” means for the current session.

What’s deliberately hidden here is the most important part: read_session_history. That function is the escape hatch. It’s how we decide what “memory” actually is. In the next step, we’ll open that up and build it piece by piece, because this is where design decisions start to matter. Do we keep everything in memory? Do we persist it? Do we isolate sessions? The wrapper doesn’t care. It only cares that, given a session identifier, it can obtain something that behaves like message history. That separation is intentional and, I might add, very testable.

In fact, I should note that this is exactly how we would talk about a test double. RunnableWithMessageHistory depends on something that provides history, but it doesn’t care whether that something is a fake, a stub, or a real database. That choice comes later.

Reading History

Now let’s look at what read_session_history actually does and why I deliberately didn’t start there. First, let’s introduce some variables.

Before we look at how session history is retrieved, we need two small pieces of setup. The first is store, which is nothing more than a dictionary we’ll use to hold conversation state in memory. The second is session_id, which gives us a way to name a conversation and retrieve its history consistently across turns. At this stage, neither of these does anything interesting on its own. They simply establish where state can live and how we’ll refer to it.

The important thing to notice is that we’re separating identity from mechanism.

  • session_id answers the question “which conversation is this?”
  • store answers the question “where could its history be kept?”

Perhaps think of session_id as a test case name and store as the scratch space where that test’s artifacts are kept. Nothing happens until something asks for them; but once it does, we know exactly where to look.

The function we’ll look at next is where those two ideas finally come together. Until then, this setup is intentionally dull. And that’s a good thing! In testing, the most reliable systems are often the ones whose state management is boring, explicit, and easy to reason about.

Now we can finally look at the function that turns the idea of “history” into something concrete. The read_session_history takes a session identifier and returns an object that represents the conversation so far. If this is the first time we’ve seen that session, it creates a new message history and stores it; otherwise, it simply returns what already exists. There’s no persistence, no database, and no cleverness here: just a straightforward mapping between a session name and a growing list of messages.

What matters isn’t what this function does, but what it guarantees. Given the same session ID, we will always get the same history object back. That single property is what makes multi-turn conversations possible. Every time the chain runs, it can append new messages and see previous ones without needing to know where or how they’re stored. In testing terms, this is controlled shared state: explicit, inspectable, and easy to reset when needed.

This is essentially lazy initialization for conversation memory. We don’t create history until it’s needed, and we don’t duplicate it once it exists. If you’ve ever used a test fixture that only spins up when a test first touches it, the pattern should feel familiar.

Clearing History

Before we run anything, there’s one more deliberate step: clearing the session history.

This statement forces the conversation to start from a known state by removing any messages that may already exist for the given session ID. Without this, repeated runs of the script could silently accumulate context and produce different answers to the same questions. This is a problem that’s especially hard to notice when you’re dealing with language instead of numbers.

I should note that this is a testing move, not an AI move. We’re asserting that every execution begins with a clean slate unless we explicitly decide otherwise. By clearing the history up front, we make the behavior of the system repeatable and observable. If the model responds differently, we can be confident that the difference came from a change we made on purpose, not from leftover conversational residue.

This is the conversational equivalent of resetting a database, clearing caches, or reinitializing fixtures before a test run. Stateful systems are powerful, but only if you can control when state matters and when it doesn’t.

History in a Database

Let’s now add a slight variation to how our conversation history is stored.

Now that the basic version of read_session_history is clear, we can expand it without changing its role. The function still answers the same question (“given a session identifier, where does the conversation history live?”) but it now supports two different answers. The USE_SQLITE flag lets us choose between an in-memory history and a persistent one backed by SQLite, while keeping the rest of the script completely unaware of that choice.

The key point is that nothing else changes. The wrapper doesn’t care. The chain doesn’t care. Even the prompts don’t care. As long as the function returns something that behaves like BaseChatMessageHistory, the system works. This is intentional. We’re isolating the storage decision behind a single seam so we can switch strategies without rewriting or retesting everything upstream. If this feels like dependency inversion or swapping a test double for a real implementation, that’s not an accident.

This is where “memory” stops being a metaphor and becomes an engineering decision. In-memory history is fast, disposable, and ideal for experimentation. SQLite-backed history is slower, persistent, and better for long-running or inspectable conversations. The important thing is that we can choose between them by flipping a flag, not by changing behavior everywhere else.

This gives us a controlled way to decide whether state should survive a process restart. That’s not an AI concern; it’s a test design concern. This is also dependency injection without the ceremony. The system depends on message history, not on a particular storage mechanism. That makes it trivial to reason about, reset, and experiment with, which is exactly what we want when we’re probing behavior instead of asserting correctness.

The Actual Conversation

At this point, we’re finally ready to run the conversation.

Each call to history.invoke represents a single turn, but it’s important to notice what’s actually being passed in. The only explicit input is the current {prompt}, which is the immediate question we want answered. Everything else the model sees comes from the structure we’ve already defined: the prompt template, the chain, and the accumulated history associated with the session.

The config parameter is where the final piece clicks into place. By supplying a session_id, we’re telling the history wrapper which conversation this invocation belongs to. That identifier is passed through to read_session_history, which retrieves the appropriate message history and injects it into the {history} placeholder in the prompt template. The result is that each successive question is answered in the context of the previous ones, even though we never manually pass earlier responses around. Continuity emerges not because the model “remembers,” but because we’ve deliberately arranged for past messages to be part of the input every time.

Seen through a testing lens, each invocation is just another test step operating against shared state. The state isn’t hidden inside the model; it’s external, inspectable, and controlled by us. Change the session_id, and you get a brand-new conversation. Clear the history, and you reset the experiment. Leave it intact, and you can observe how reasoning evolves over time.

Considering the Questions

Notice that the questions themselves are simple. What’s being exercised here isn’t factual recall so much as conceptual continuity; whether the system can carry an idea forward and refine it when prompted. We’re essentially testing whether the model can maintain conceptual threads. Starting with Planck length, moving to Planck time, then asking about their combined significance. This requires the model to hold two prior answers in context and synthesize them. That’s a great way to demonstrate the value of conversation history.

Planck units are ideal here because they have definite answers. This lets us evaluate whether the conversational context actually improves accuracy compared to asking the third question cold. Thus, note test design is often about consideration of your prompts.

Think of this like building a legal case. Each question is like introducing a piece of evidence. Without memory (session history), each question is like starting a trial from scratch. With memory, you’re building an argument where each piece of evidence references and builds on what came before.

This is just how a user would use such a system. This is exactly how we test such a system, which happens to correspond to exactly how we would develop such a system.

Checking Observability

Were you to view this in LangSmith, you would see that you have a RunnableWithMessageHistory.

If you were to dig into one of those entries, you would see that LangSmith is keeping track of that history.

If you change USE_SQLITE to a value of True, that will write a database file to the same directory as your project. You can, of course, change the name of the database in the code.

There are Visual Studio Code extensions to view your database. If you want to just check it with a simple online approach, you can use SQLite Viewer. What I see when I do that is this:

This Script as a Test Case

What we’ve constructed here is essentially a controlled experiment for conversational AI behavior. Like any rigorous test, it has two critical dimensions: test conditions (the infrastructure being evaluated) and data conditions (the specific inputs used for evaluation).

The test conditions are the mechanisms we’re actually examining: Does the RunnableWithMessageHistory wrapper properly maintain conversational context? Does the session management correctly isolate conversations by ID? Does the chain successfully integrate history into the prompt template? These are the “what are we testing?” questions. These are the architectural components whose behavior we want to observe and verify.

The data conditions are our carefully chosen inputs that expose whether those mechanisms work as intended. The three questions about Planck units aren’t arbitrary. I specifically designed those to require contextual memory. The first two establish independent facts, while the third uses the demonstrative “those values,” which is only interpretable if the model has access to prior turns.

With this test, we’ve also chosen data conditions for storage (SQLite vs. in-memory), model selection (qwen3:latest), and session management (a single, named session). By varying these data conditions while keeping test conditions constant (or vice versa), we can isolate exactly what influences conversational coherence.

In essence, this script follows the logic of forensic investigation: we’ve set up a controlled scenario where we know what should happen if our system works correctly, and the outputs either confirm or refute that hypothesis. The progression from independent questions to context-dependent ones functions like a chain of evidence, where each link depends on what came before.

Next Steps!

This was a long post because I tried to foreground structure before behavior. I really wanted to distill a lot of what we’ve talked about in this series to date but also make it clear that we’ve really been doing testing the whole time, even if it felt like we were doing development. In fact, we were doing both.

There are a few more points to cover with this example, but I’ll save those for the next post.

Share

This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.