AI and Testing: Refactoring Tests

In the previous post, we refined an AI test case that we had previously created as a testing example. In this brief post, I want to show a refactoring of that code. We will also align on the output of this test.

Refactoring Exercise

I’m choosing to focus on refactoring because that’s something that often doesn’t get talked about. Developers perform this activity all the time and, certainly, test engineers (such as those writing automation code), should be aware of the practice. First, let’s consider the code we ended up with:

Wow, right!? Just seeing it all in one shot shows you lots of work that we got through.

The code above works, but it has some pedagogical rough edges. Let’s refactor it to make the testing concepts clearer. Rather than go through this step by step, in which the only thing I would likely be testing is everyone’s patience, I’ll show you what I did to refactor the logic and then explain the key elements to notice.

In the “CONVERSATION EXECUTION” section, notice I extracted the prompts into variables (prompt1, prompt2, prompt3) rather than writing them twice: once in the invoke() call and again in the print() statement. This follows the DRY principle: “Don’t Repeat Yourself.”

Whether and to what extent to apply the DRY principle in testing related code has been long debated. I don’t plan to settle that debate. What I will say is that when you’re building test harnesses, focusing (at least to some extent) on DRY isn’t just about code aesthetics. If you need to refine your test prompts (and you will; testing AI systems is iterative), you want to change them in exactly one place. Duplication creates maintenance headaches: you modify the prompt in the invoke call but forget to update the print statement, and suddenly your output logs don’t match what you actually asked the model.

More importantly, the variables make it easy to reference specific turns in your analysis. Later, when I inspect the control response, I can clearly say “the third prompt deliberately uses a referent (‘those values’) that requires prior context.” The variable name prompt3 gives me a clean handle for discussing that specific test case.

In the “LIGHTWEIGHT INVARIANTS” section, notice I extracted the alternation check into its own function, check_role_alternation(). This isn’t strictly necessary for this simple harness, but it does illustrate a useful pattern: when you’re building test infrastructure, isolating individual checks makes them easier to debug, test, and reuse. If I later wanted to check alternation in a different context (say, verifying a conversation loaded from a database) I could call this function directly.

In this same section, I also felt I had a problem with my check_invariants() function. It does something conceptually simple: verify that a conversation history looks well-formed. However, the code didn’t announce what it’s doing clearly enough. The modulo arithmetic (idx % 2), which is now refactored into the above function, was correct, but not self-documenting.

Adding a docstring that lists the three invariants upfront creates a conceptual roadmap. Then, section comments (“Invariant 1:”, “Invariant 2:”) create landmarks as you read through. The inline explanations (“Even indices (0,2,4…) should be Human”) can help transform mysterious arithmetic into explicit rules.

The biggest issue I felt I had was in the “ACCEPTABLE OUTCOME CLASSES” section, specifically in my classify_control_response() function. The original version mixed what we’re looking for (specific phrases) with why we’re looking for it (detecting patterns). When you’re reading through 30+ string literals inline, it’s hard to see the forest for the trees.

I realized I could separate concerns by extracting the phrase lists to module-level constants. This gives me two benefits. First, the constants become documentation; they show exactly what patterns have been observed across different models. Second, the classification function can focus on logic rather than listing strings. When you read the function, you see “Pattern 1: Asking for clarification” rather than wading through nine different ways to ask “what do you mean?”

These refactorings follow roughly the same pattern: separate mechanism from meaning. Constants capture the empirical observations (these are the phrases we’ve seen). Functions capture the conceptual framework (these are the patterns we’re testing for). Comments explain the bridge between them.

For a testing harness, in particular, this matters more than it would in, say, a traditional test script. I say that because you’re not just making the code work, you’re teaching readers of your code how to think about AI behavior systematically.

This isn’t just cleaner code. It’s clearer thinking! When you’re building testing harnesses, you’re building conceptual tools. Any refactoring I do in this context makes those concepts visible.

The Test Report

Let’s focus on the output we get from this test. Here I won’t reproduce all of what you might see. What I do want to make clear is that you’re not just looking at program output. You are looking a structured test report. Each section tells you something specific about how the conversational AI system behaved under test conditions.

The Main Experiment (Conversation with History)

This section shows the actual conversation flow. The model receives three sequential prompts and successfully tracks context across turns. You should be able to notice how the third response indicates that the model is clearly referencing the specific values it mentioned in the previous two answers. This demonstrates that the history mechanism is working; the model has access to prior turns.

History Inspection

This is your “ground truth” verification. You’re looking under the hood to confirm that the conversation history contains exactly what you expect: six messages (three human, three AI), properly alternated, all non-empty. If this section showed only two messages or revealed gaps in the history, you would know something broke in your session management. This is basic infrastructure validation.

The Control Experiment

Here’s where the test design pays off. We ask the exact same third question (“Do those values define the minimal scale of physical events?”) but to a fresh session with no history. The model has no prior context (no previous mention of Planck length or Planck time) yet the question uses the referent “those values.”

How does the model respond? Well, generally, it makes an educated guess. It falls back to general physics knowledge and talks about Planck constants, but you’ll likely notice the language is more generic and hedged. It doesn’t (and shouldn’t) says something like “The Planck length and time [that we just discussed]…” because there was no prior discussion.

Harness Invariants

These are your sanity checks: automated verification that the test infrastructure itself is working correctly. Both sessions pass all three checks: correct message counts, proper role alternation, no empty content. If any of these failed, you would know the problem was with your test harness, not the model’s behavior.

Outcome Classification

This is your automated oracle: a lightweight classifier that categorizes the control response. In most cases, you’ll likely get “GENERIC FALLBACK (MIXED).” The model likely defaulted to talking about Planck scales (reasonable domain knowledge) but likely didn’t explicitly request clarification about “those values” (which would have been ideal epistemic behavior).

Lot’s of “likely” in what I just said. Keep in mind something we talked about in the previous post: the classification isn’t pass/fail; it’s descriptive. “MIXED” means “this behavior is acceptable but not optimal.” A “GOOD” classification would mean the model asked “Which values?” or said “I don’t have enough context.” A “SUSPICIOUS” classification would mean the model confidently asserted it had discussed specific values when it hadn’t.

Why This Matters

This output structure (experiment, inspection, control, validation, classification) is reproducible. You can run this same harness against different local models and, in fact, against distributed models (GPT-4, Claude, Grok, etc.) and compare their classifications. You can modify the prompts and see how behavior changes. You can add more invariants or refine your classification patterns.

So, again, I will say that this isn’t just output. It’s a test report. It’s a test report based on reproducible evidence. You’re not just running code. You’re systematically probing how conversational AI systems handle context, ambiguity, and missing information with that reproducible evidence.

Next Steps!

This was a bit of an interlude post to take us from our initial testing example to scaling that example up. In the next post, we’ll do exactly that type of scaling and we’ll start with the refactored test we ended up with in this post.

Share

This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.