AI and Testing: Refining Tests

In the previous post I provided an extended testing example where we wrote an “AI test case” together. This post will provide some more test thinking around that initial test case.

Our Test Script

First, let’s make sure we have the test script that we ended up with.

One change I made for this post (highlighted above) was to use Qwen2.5 as the model rather than Qwen3. This is because to play around with these variations, it might help you to have a “non-heavy” model, meaning one that doesn’t do as much reasoning. To get Qwen2.5, you can do the same command we did in prior posts:

  ollama run qwen2.5:latest

As always in these posts, you can use whatever model you want. Choosing models is part of the test design. I’m choosing Qwen2.5 because it sits between a reasoning model, like Qwen3, and an overly simplistic model, like Phi3. Part of the point of our test case here is that we can switch models relatively easily.

It’s worth noting that, in the previous post, we crafted a test case that happens to run via an automated (code-based) script but that was created with test thinking firmly in place. This is (or at least, should be) the case with all such automation tooling that supports some form of testing.

Reporting

While we do have some simple “reporting” at the bottom of our test, as any tester will tell you, how we show our results matters. In order to provide a little more nuanced reporting, replace the last three lines of our previous code (the print statements) with this logic:

For those learning Python, just know that enumerate() is a Python function that loops through a sequence while keeping track of the index. The 1 means “start counting at 1” (not 0), which is more human-friendly. So, in this case, i becomes the message number (1, 2, 3…), and msg is the actual message object. Each message object has a class name like HumanMessage, AIMessage, or SystemMessage. So, I retrieve the class name as a string and then strip out the word “Message,” leaving just Human, AI, or System. Finally, if the message content exceeds 100 characters, the logic here takes just the first 100 character and adds “…” to indicate truncation. Otherwise, it shows the full content.

In terms of what this is doing, as opposed to how, the history inspection portion is largely self-explanatory, but its value is easy to underestimate. By explicitly printing the conversation history, we’re surfacing the actual data structure the system is working with: a sequence of human and AI messages accumulated over time. In this case, we can clearly see six messages total (three prompts and three responses) which confirms that each exchange is being captured exactly once and in the expected order.

More importantly, this makes the idea of “memory” concrete rather than abstract. There’s no hidden state or opaque recall happening inside the model. What the AI sees on each turn is precisely what we can inspect here: literal message content being replayed back into the prompt. For testing purposes, this gives us a control case. We can now correlate changes in model behavior with changes in recorded history, instead of guessing whether context was preserved, truncated, or ignored. Observability like this turns conversational behavior into something we can reason about, not just react to.

A guiding point here is that once we can see the conversation as data, we can start testing it like data.

Adding a Control Case

Let’s now add the following below the logic we just added.

To give you an idea, with the non-control version of my test, the output I got for my third question was:


The Planck length and Planck time are theoretical minimums derived from fundamental constants in physics. They often serve as a framework for understanding the limits of current physical theories, but it's not definitively known if they define the actual minimal scales of physical events. Current scientific understanding does not rule out the possibility of smaller scales beyond these thresholds.

Whereas the output I got for my control variation on the third question was:


Without context: The term "minimal scale of physical events" is broad, but Planck units often define fundamental scales like the Planck length (about 1.6 x 10^-35 meters) for space, suggesting these might be considered minimal physical scales in some contexts. However, whether they definitively set the minimum scale depends on one's theoretical framework and current understanding of physics.

What you see may differ, but what should be consistent is that the control comparison produces a noticeably different result. By asking the same question in a fresh session, I remove all prior context and force the model to respond to the phrase “those values” in isolation. This gives me a baseline for how the system behaves when referents are missing and continuity is broken.

Not Just Pass / Fail

Notice that a traditional notion of “expected results” doesn’t apply. And that’s the point. We’re not asserting a single correct answer. Instead, we’re defining a set of plausible outcomes. The model might express confusion, ask for clarification, or fall back to a generic explanation of Planck-scale physics without anchoring “those values” to anything specific. Any of these outcomes is acceptable. What would be suspicious is a highly specific, context-aware answer that assumes knowledge it was never given.

This is how we empirically demonstrate that history matters. Not as a conceptual claim, but as an observable difference in behavior under controlled conditions. By holding the prompt constant and varying only the presence of prior context, we turn conversational continuity into something we can test, compare, and reason about. Another way to say all this is that our test doesn’t prove what the model knows; it proves what the model was told.

At this point we’ve proven something important: history changes the model’s behavior in observable ways. However, before we go any further, we should do what testers always do when a test harness starts to grow: verify the harness. In other words, we’re not yet judging whether the model’s answers are “correct.” We’re first confirming that our conversation instrumentation is behaving the way we think it is.

Invariants

This is where lightweight invariants come in. An invariant is simply a property that should always hold if our setup is working: message counts match the number of turns, roles alternate in the expected order, and sessions stay isolated from each other. These checks don’t require an oracle for the model’s content. They’re closer to “pre-flight checks” that make later observations credible. If these invariants fail, any conclusions we draw about history, context, or reasoning are on shaky ground.

Go ahead and add this next bit of logic to the end of the script.

I added some lightweight comments in there to reflect what the code is doing. I will also add that this logic requires a change to the existing code:

This would only apply if your USE_SQLITE is set to True. This is because, with that condition, the database file persists between runs. The control-session would accumulate messages across script executions.

You should get something like this as your output:


Session: main (jeff-chat)
  Messages: 6 (expected 6) -> PASS
  Role order: ['HumanMessage', 'AIMessage', 'HumanMessage', 'AIMessage', 'HumanMessage', 'AIMessage'] -> PASS
  Non-empty content -> PASS

Session: control (control-session)
  Messages: 2 (expected 2) -> PASS
  Role order: ['HumanMessage', 'AIMessage'] -> PASS
  Non-empty content -> PASS

These invariants are intentionally boring. They don’t tell us whether the model is right; they tell us whether our experiment is wired correctly. If the “main” session doesn’t have six messages after three prompts, or if the roles don’t alternate, then we’re not actually testing conversational continuity; instead, we’re testing an accident of our implementation. Once these checks pass, we can start evaluating the model’s behavior with more confidence.

Now that the harness is reliable, let’s classify what reasonable looks like when the prompt is such that it references missing context. Put another way, now that we’ve verified the harness, we can start evaluating behavior.

Acceptable Outcome Classes

The catch here is the same one testers always run into with systems that generate language: we don’t have a single crisp expected result. In classic deterministic code, we would assert an exact output. Here, the best we can do is define what “reasonable” looks like.

So, instead of expected results, we define acceptable outcome classes. For the control prompt (“Do those values define the minimal scale of physical events?”) the key constraint is that “those values” has no referent. A well-behaved model should react in one of a few ways: it can ask for clarification, it can explicitly note the missing context, or it can give a generic explanation while carefully hedging. What we’re looking to catch is the opposite behavior: confident specificity that implies it remembers earlier values from a session where it didn’t receive them or where it simply hallucinates what “those values” mean.

This becomes a lightweight oracle. It’s not about whether the physics is right. It’s about whether the model’s confidence and specificity match the information it was actually given.

Go ahead and add the following logic:

That’s a lot of code and I put in some comments to at least situate you. However, while it’s a lot, it’s also relatively simple. The code defines a function that takes in one thing (the AI’s response as text) and spits out two things: a label (like “GOOD” or “SUSPICIOUS”) and an explanation of why. The code creates several lists of phrases. These are like different colored highlighters. Each list represents a different pattern: clarification phrases (yellow highlighter), uncertainty phrases (green highlighter), Planck physics terms (blue highlighter), and confident specificity phrases (red highlighter). Then, the code runs through checks, asking yes/no questions.

  • “Does the response contain any yellow-highlighted phrases?” If so, store as asks_for_clarification.
  • “Does the response contain any green-highlighted phrases?” If so, store as acknowledges_uncertainty.
  • “Are there any numbers in it?” If so, store as has_number.

And so on. Finally, the code works through an if-then ladder, checking conditions in priority order, sort of like a flowchart. The first matching condition wins. If the response asks for clarification, classify as GOOD and stop. If the response shows uncertainty and mentions Planck stuff, then classify as HEDGED GOOD and stop. And so on down the ladder. Whatever classification wins gets packaged up with its explanation and printed out, along with the original response for manual inspection.

What I see when I run this:


Class: CLARIFICATION-SEEKING (GOOD)
Why:   The response requests missing referents for “those values,” which fits the no-history condition.

It’s equally possible (not necessarily equally likely) to get this:


Class: GENERIC FALLBACK (MIXED)
Why:   The response defaults to Planck-scale explanations. This can be acceptable, but watch for unjustified certainty.

It’s even equally possible (again, not necessarily equally likely) to get this:


Class: CONFIDENT SPECIFICITY (SUSPICIOUS)
Why:   The response appears to infer or assert specific prior values despite having no session context.

I say equally possible but not equally likely because this is not a “truth checker.” It’s a behavior checker. We’re looking for alignment between what the model says and what it was actually given. The classifier is intentionally blunt: it sorts responses into buckets that represent acceptable behavior under missing context, and it flags cases where the model seems overly confident or oddly specific.

Our Test Ladder is Building

Now you’ve got a nice ladder.

  • Harness invariants (are we running the experiment correctly?)
  • Outcome classes (is the behavior appropriate given the information?)

What most logically comes next? Well, that’s what we’ll explore soon but, in the next post, we’re going to take a slight side trip to show what refactoring our above code looks like.

Share

This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

2 thoughts on “AI and Testing: Refining Tests”

  1. So much learning from this page to finally end up with a result of

    Class: GENERIC FALLBACK (MIXED)
    Why: The response defaults to Planck-scale explanations. This can be acceptable, but watch for unjustified certainty.

    1. Indeed, sometimes it can feel like a whole lot of work to end up with a … “mixed”, shall we say? … result.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.