In the previous post I provided an extended testing example where we wrote an “AI test case” together. This post will provide some more test thinking around that initial test case.

Our Test Script
First, let’s make sure we have the test script that we ended up with.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
from dotenv import load_dotenv from langchain_ollama import ChatOllama from langchain_core.prompts import ChatPromptTemplate from langchain_core.runnables import chain from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables.history import RunnableWithMessageHistory from langchain_core.chat_history import BaseChatMessageHistory from langchain_community.chat_message_histories import ChatMessageHistory from langchain_community.chat_message_histories import SQLChatMessageHistory env = load_dotenv(".env") store = {} session_id = "jeff-chat" MODEL = "qwen2.5:latest" USE_SQLITE = False DB = "jeff-chat.db" # ============================================================ # SESSION HISTORY MANAGEMENT # ============================================================ def read_session_history(session_id: str) -> BaseChatMessageHistory: if USE_SQLITE: return SQLChatMessageHistory( session_id=session_id, connection=f"sqlite:///{DB}" ) else: if (session_id not in store): store[session_id] = ChatMessageHistory() return store[session_id] read_session_history(session_id).clear() # ============================================================ # MODEL SETUP # ============================================================ model = ChatOllama( model=MODEL, base_url="http://localhost:11434", ) template = ChatPromptTemplate.from_messages([ ("system", "Please answer as concisely as possible."), ("placeholder", "{history}"), ("human", "{prompt}") ]) chain = template | model | StrOutputParser() history = RunnableWithMessageHistory( chain, read_session_history, input_messages_key="prompt", history_messages_key="history" ) # ============================================================ # CONVERSATION EXECUTION # ============================================================ response1 = history.invoke( {"prompt": "What is the smallest possible length?"}, config={"configurable": {"session_id": session_id}} ) response2 = history.invoke( {"prompt": "What is the smallest possible time?"}, config={"configurable": {"session_id": session_id}} ) response3 = history.invoke( {"prompt": "Do those values define the minimal scale of physical events?"}, config={"configurable": {"session_id": session_id}} ) print(response1, end="\n\n") print(response2, end="\n\n") print(response3) |
One change I made for this post (highlighted above) was to use Qwen2.5 as the model rather than Qwen3. This is because to play around with these variations, it might help you to have a “non-heavy” model, meaning one that doesn’t do as much reasoning. To get Qwen2.5, you can do the same command we did in prior posts:
ollama run qwen2.5:latest
As always in these posts, you can use whatever model you want. Choosing models is part of the test design. I’m choosing Qwen2.5 because it sits between a reasoning model, like Qwen3, and an overly simplistic model, like Phi3. Part of the point of our test case here is that we can switch models relatively easily.
It’s worth noting that, in the previous post, we crafted a test case that happens to run via an automated (code-based) script but that was created with test thinking firmly in place. This is (or at least, should be) the case with all such automation tooling that supports some form of testing.
Reporting
While we do have some simple “reporting” at the bottom of our test, as any tester will tell you, how we show our results matters. In order to provide a little more nuanced reporting, replace the last three lines of our previous code (the print statements) with this logic:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
print("=" * 60) print("CONVERSATION WITH HISTORY") print("=" * 60) print(response1, end="\n\n") print(response2, end="\n\n") print(response3, end="\n\n") # ============================================================ # HISTORY INSPECTION # ============================================================ print("=" * 60) print("INSPECTING CONVERSATION HISTORY") print("=" * 60) session = read_session_history(session_id) print(f"Total messages in history: {len(session.messages)}") print("\nMessage contents:") for i, msg in enumerate(session.messages, 1): role = msg.__class__.__name__.replace("Message", "") content_str = str(msg.content) if len(content_str) > 100: content = content_str[:100] + "..." else: content = content_str print(f" {i}. [{role}] {content}") print() |
For those learning Python, just know that enumerate() is a Python function that loops through a sequence while keeping track of the index. The 1 means “start counting at 1” (not 0), which is more human-friendly. So, in this case, i becomes the message number (1, 2, 3…), and msg is the actual message object. Each message object has a class name like HumanMessage, AIMessage, or SystemMessage. So, I retrieve the class name as a string and then strip out the word “Message,” leaving just Human, AI, or System. Finally, if the message content exceeds 100 characters, the logic here takes just the first 100 character and adds “…” to indicate truncation. Otherwise, it shows the full content.
In terms of what this is doing, as opposed to how, the history inspection portion is largely self-explanatory, but its value is easy to underestimate. By explicitly printing the conversation history, we’re surfacing the actual data structure the system is working with: a sequence of human and AI messages accumulated over time. In this case, we can clearly see six messages total (three prompts and three responses) which confirms that each exchange is being captured exactly once and in the expected order.
More importantly, this makes the idea of “memory” concrete rather than abstract. There’s no hidden state or opaque recall happening inside the model. What the AI sees on each turn is precisely what we can inspect here: literal message content being replayed back into the prompt. For testing purposes, this gives us a control case. We can now correlate changes in model behavior with changes in recorded history, instead of guessing whether context was preserved, truncated, or ignored. Observability like this turns conversational behavior into something we can reason about, not just react to.
A guiding point here is that once we can see the conversation as data, we can start testing it like data.
Adding a Control Case
Let’s now add the following below the logic we just added.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# ============================================================ # CONTROL COMPARISON (No History) # ============================================================ print("=" * 60) print("CONTROL: SAME QUESTION WITHOUT HISTORY") print("=" * 60) control_response = history.invoke( {"prompt": "Do those values define the minimal scale of physical events?"}, config={"configurable": {"session_id": "control-session"}} ) print(f"Without context: {control_response}") print() |
To give you an idea, with the non-control version of my test, the output I got for my third question was:
The Planck length and Planck time are theoretical minimums derived from fundamental constants in physics. They often serve as a framework for understanding the limits of current physical theories, but it's not definitively known if they define the actual minimal scales of physical events. Current scientific understanding does not rule out the possibility of smaller scales beyond these thresholds.
Whereas the output I got for my control variation on the third question was:
Without context: The term "minimal scale of physical events" is broad, but Planck units often define fundamental scales like the Planck length (about 1.6 x 10^-35 meters) for space, suggesting these might be considered minimal physical scales in some contexts. However, whether they definitively set the minimum scale depends on one's theoretical framework and current understanding of physics.
What you see may differ, but what should be consistent is that the control comparison produces a noticeably different result. By asking the same question in a fresh session, I remove all prior context and force the model to respond to the phrase “those values” in isolation. This gives me a baseline for how the system behaves when referents are missing and continuity is broken.
Not Just Pass / Fail
Notice that a traditional notion of “expected results” doesn’t apply. And that’s the point. We’re not asserting a single correct answer. Instead, we’re defining a set of plausible outcomes. The model might express confusion, ask for clarification, or fall back to a generic explanation of Planck-scale physics without anchoring “those values” to anything specific. Any of these outcomes is acceptable. What would be suspicious is a highly specific, context-aware answer that assumes knowledge it was never given.
This is how we empirically demonstrate that history matters. Not as a conceptual claim, but as an observable difference in behavior under controlled conditions. By holding the prompt constant and varying only the presence of prior context, we turn conversational continuity into something we can test, compare, and reason about. Another way to say all this is that our test doesn’t prove what the model knows; it proves what the model was told.
At this point we’ve proven something important: history changes the model’s behavior in observable ways. However, before we go any further, we should do what testers always do when a test harness starts to grow: verify the harness. In other words, we’re not yet judging whether the model’s answers are “correct.” We’re first confirming that our conversation instrumentation is behaving the way we think it is.
Invariants
This is where lightweight invariants come in. An invariant is simply a property that should always hold if our setup is working: message counts match the number of turns, roles alternate in the expected order, and sessions stay isolated from each other. These checks don’t require an oracle for the model’s content. They’re closer to “pre-flight checks” that make later observations credible. If these invariants fail, any conclusions we draw about history, context, or reasoning are on shaky ground.
Go ahead and add this next bit of logic to the end of the script.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
# ============================================================ # LIGHTWEIGHT INVARIANTS (Harness sanity checks) # ============================================================ print("=" * 60) print("HARNESS INVARIANTS") print("=" * 60) def check_invariants(name: str, session_id: str, expected_turns: int): session = read_session_history(session_id) msgs = session.messages expected_messages = expected_turns * 2 # each turn = human + ai # 1) Count invariant count_ok = (len(msgs) == expected_messages) # 2) Role alternation invariant: HumanMessage, AIMessage, HumanMessage, ... roles = [m.__class__.__name__ for m in msgs] alternation_ok = True for idx, role in enumerate(roles): if idx % 2 == 0 and role != "HumanMessage": alternation_ok = False break if idx % 2 == 1 and role != "AIMessage": alternation_ok = False break # 3) Non-empty content invariant (useful to catch weird parsing/empty messages) nonempty_ok = all(str(m.content).strip() for m in msgs) # Report status_count = "PASS" if count_ok else "FAIL" status_alternation = "PASS" if alternation_ok else "FAIL" status_nonempty = "PASS" if nonempty_ok else "FAIL" print(f"Session: {name} ({session_id})") print(f" Messages: {len(msgs)} (expected {expected_messages}) -> {status_count}") print(f" Role order: {roles[:6]}{'...' if len(roles) > 6 else ''} -> {status_alternation}") print(f" Non-empty content -> {status_nonempty}") print() # Main conversation had 3 turns check_invariants("main", session_id, expected_turns=3) # Control conversation had 1 turn check_invariants("control", "control-session", expected_turns=1) print() |
I added some lightweight comments in there to reflect what the code is doing. I will also add that this logic requires a change to the existing code:
|
1 2 3 4 |
... read_session_history(session_id).clear() read_session_history("control-session").clear() ... |
This would only apply if your USE_SQLITE is set to True. This is because, with that condition, the database file persists between runs. The control-session would accumulate messages across script executions.
You should get something like this as your output:
Session: main (jeff-chat)
Messages: 6 (expected 6) -> PASS
Role order: ['HumanMessage', 'AIMessage', 'HumanMessage', 'AIMessage', 'HumanMessage', 'AIMessage'] -> PASS
Non-empty content -> PASS
Session: control (control-session)
Messages: 2 (expected 2) -> PASS
Role order: ['HumanMessage', 'AIMessage'] -> PASS
Non-empty content -> PASS
These invariants are intentionally boring. They don’t tell us whether the model is right; they tell us whether our experiment is wired correctly. If the “main” session doesn’t have six messages after three prompts, or if the roles don’t alternate, then we’re not actually testing conversational continuity; instead, we’re testing an accident of our implementation. Once these checks pass, we can start evaluating the model’s behavior with more confidence.
Now that the harness is reliable, let’s classify what reasonable looks like when the prompt is such that it references missing context. Put another way, now that we’ve verified the harness, we can start evaluating behavior.
Acceptable Outcome Classes
The catch here is the same one testers always run into with systems that generate language: we don’t have a single crisp expected result. In classic deterministic code, we would assert an exact output. Here, the best we can do is define what “reasonable” looks like.
So, instead of expected results, we define acceptable outcome classes. For the control prompt (“Do those values define the minimal scale of physical events?”) the key constraint is that “those values” has no referent. A well-behaved model should react in one of a few ways: it can ask for clarification, it can explicitly note the missing context, or it can give a generic explanation while carefully hedging. What we’re looking to catch is the opposite behavior: confident specificity that implies it remembers earlier values from a session where it didn’t receive them or where it simply hallucinates what “those values” mean.
This becomes a lightweight oracle. It’s not about whether the physics is right. It’s about whether the model’s confidence and specificity match the information it was actually given.
Go ahead and add the following logic:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
# ============================================================ # ACCEPTABLE OUTCOME CLASSES (Oracle-lite) # ============================================================ print("=" * 60) print("OUTCOME CLASSIFICATION (CONTROL RESPONSE)") print("=" * 60) def classify_control_response(response: str) -> tuple[str, str]: r = response.strip() r_low = r.lower() # Signals that the model is asking for missing referents clarification_markers = [ "what do you mean", "which values", "what values", "those values refer", "can you clarify", "could you clarify", "clarify", "which ones", "what are those" ] # Signals that the model acknowledges missing context / uncertainty uncertainty_markers = [ "without context", "without more context", "not enough context", "i don't have", "i don't know which", "unclear", "ambiguous", "depends on what you mean" ] # Signals of generic physics fallback (often fine if it stays # general / hedged) planck_markers = [ "planck", "quantum", "scale", "fundamental", "minimum length", "minimum time" ] # Signals of confident specificity (riskier in the control session) specificity_markers = [ "the values are", "those values are", "you mean", "as we discussed", "as mentioned earlier", "as i said", "as i told you" ] # Heuristic: look for numbers/units that could indicate the # model is inventing specifics. (Not always bad, but suspicious # if no context was provided.) has_number = any(ch.isdigit() for ch in r_low) mentions_planck = any(m in r_low for m in planck_markers) asks_for_clarification = any(m in r_low for m in clarification_markers) or r.endswith("?") acknowledges_uncertainty = any(m in r_low for m in uncertainty_markers) confident_specific = any(m in r_low for m in specificity_markers) # Classification logic (simple on purpose) if asks_for_clarification: return ("CLARIFICATION-SEEKING (GOOD)", "The response requests missing referents for " "“those values,” which fits the no-history condition.") if acknowledges_uncertainty and mentions_planck: return ("HEDGED GENERIC FALLBACK (GOOD)", "The response notes missing context and then stays " "general (e.g., Planck-scale discussion).") if acknowledges_uncertainty and not mentions_planck: return ("UNCERTAINTY ACKNOWLEDGED (GOOD)", "The response explicitly flags ambiguity or missing " "context without overcommitting to specifics.") # Here’s the main “smell test”: confident + specific, # especially with numbers, in a control session if confident_specific or (has_number and mentions_planck and not acknowledges_uncertainty): return ("CONFIDENT SPECIFICITY (SUSPICIOUS)", "The response appears to infer or assert specific " "prior values despite having no session context.") if mentions_planck: return ("GENERIC FALLBACK (MIXED)", "The response defaults to Planck-scale explanations. " "This can be acceptable, but watch for unjustified " "certainty.") return ("OTHER / UNCLASSIFIED", "The response doesn't match the main expected patterns. " "Inspect manually to decide if it's reasonable.") label, rationale = classify_control_response(control_response) print(f"Class: {label}") print(f"Why: {rationale}") print() print("Raw response:") print(control_response) print() |
That’s a lot of code and I put in some comments to at least situate you. However, while it’s a lot, it’s also relatively simple. The code defines a function that takes in one thing (the AI’s response as text) and spits out two things: a label (like “GOOD” or “SUSPICIOUS”) and an explanation of why. The code creates several lists of phrases. These are like different colored highlighters. Each list represents a different pattern: clarification phrases (yellow highlighter), uncertainty phrases (green highlighter), Planck physics terms (blue highlighter), and confident specificity phrases (red highlighter). Then, the code runs through checks, asking yes/no questions.
- “Does the response contain any yellow-highlighted phrases?” If so, store as asks_for_clarification.
- “Does the response contain any green-highlighted phrases?” If so, store as acknowledges_uncertainty.
- “Are there any numbers in it?” If so, store as has_number.
And so on. Finally, the code works through an if-then ladder, checking conditions in priority order, sort of like a flowchart. The first matching condition wins. If the response asks for clarification, classify as GOOD and stop. If the response shows uncertainty and mentions Planck stuff, then classify as HEDGED GOOD and stop. And so on down the ladder. Whatever classification wins gets packaged up with its explanation and printed out, along with the original response for manual inspection.
What I see when I run this:
Class: CLARIFICATION-SEEKING (GOOD)
Why: The response requests missing referents for “those values,” which fits the no-history condition.
It’s equally possible (not necessarily equally likely) to get this:
Class: GENERIC FALLBACK (MIXED)
Why: The response defaults to Planck-scale explanations. This can be acceptable, but watch for unjustified certainty.
It’s even equally possible (again, not necessarily equally likely) to get this:
Class: CONFIDENT SPECIFICITY (SUSPICIOUS)
Why: The response appears to infer or assert specific prior values despite having no session context.
I say equally possible but not equally likely because this is not a “truth checker.” It’s a behavior checker. We’re looking for alignment between what the model says and what it was actually given. The classifier is intentionally blunt: it sorts responses into buckets that represent acceptable behavior under missing context, and it flags cases where the model seems overly confident or oddly specific.
Our Test Ladder is Building
Now you’ve got a nice ladder.
- Harness invariants (are we running the experiment correctly?)
- Outcome classes (is the behavior appropriate given the information?)
What most logically comes next? Well, that’s what we’ll explore soon but, in the next post, we’re going to take a slight side trip to show what refactoring our above code looks like.
So much learning from this page to finally end up with a result of
Class: GENERIC FALLBACK (MIXED)
Why: The response defaults to Planck-scale explanations. This can be acceptable, but watch for unjustified certainty.
Indeed, sometimes it can feel like a whole lot of work to end up with a … “mixed”, shall we say? … result.