In the previous post, we refined an AI test case that we had previously created as a testing example. In this brief post, I want to show a refactoring of that code. We will also align on the output of this test.

Refactoring Exercise
I’m choosing to focus on refactoring because that’s something that often doesn’t get talked about. Developers perform this activity all the time and, certainly, test engineers (such as those writing automation code), should be aware of the practice. First, let’s consider the code we ended up with:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 |
from dotenv import load_dotenv from langchain_ollama import ChatOllama from langchain_core.prompts import ChatPromptTemplate from langchain_core.runnables import chain from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables.history import RunnableWithMessageHistory from langchain_core.chat_history import BaseChatMessageHistory from langchain_community.chat_message_histories import ChatMessageHistory from langchain_community.chat_message_histories import SQLChatMessageHistory env = load_dotenv(".env") store = {} session_id = "jeff-chat" MODEL = "qwen2.5:latest" USE_SQLITE = False DB = "jeff-chat.db" # ============================================================ # SESSION HISTORY MANAGEMENT # ============================================================ def read_session_history(session_id: str) -> BaseChatMessageHistory: if USE_SQLITE: return SQLChatMessageHistory( session_id=session_id, connection=f"sqlite:///{DB}" ) else: if (session_id not in store): store[session_id] = ChatMessageHistory() return store[session_id] read_session_history(session_id).clear() read_session_history("control-session").clear() # ============================================================ # MODEL SETUP # ============================================================ model = ChatOllama( model=MODEL, base_url="http://localhost:11434", ) template = ChatPromptTemplate.from_messages([ ("system", "Please answer as concisely as possible."), ("placeholder", "{history}"), ("human", "{prompt}") ]) chain = template | model | StrOutputParser() history = RunnableWithMessageHistory( chain, read_session_history, input_messages_key="prompt", history_messages_key="history" ) # ============================================================ # CONVERSATION EXECUTION # ============================================================ response1 = history.invoke( {"prompt": "What is the smallest possible length?"}, config={"configurable": {"session_id": session_id}} ) response2 = history.invoke( {"prompt": "What is the smallest possible time?"}, config={"configurable": {"session_id": session_id}} ) response3 = history.invoke( {"prompt": "Do those values define the minimal scale of physical events?"}, config={"configurable": {"session_id": session_id}} ) print("=" * 60) print("CONVERSATION WITH HISTORY") print("=" * 60) print(response1, end="\n\n") print(response2, end="\n\n") print(response3, end="\n\n") # ============================================================ # HISTORY INSPECTION # ============================================================ print("=" * 60) print("INSPECTING CONVERSATION HISTORY") print("=" * 60) session = read_session_history(session_id) print(f"Total messages in history: {len(session.messages)}") print("\nMessage contents:") for i, msg in enumerate(session.messages, 1): role = msg.__class__.__name__.replace("Message", "") content_str = str(msg.content) if len(content_str) > 100: content = content_str[:100] + "..." else: content = content_str print(f" {i}. [{role}] {content}") print() # ============================================================ # CONTROL COMPARISON (No History) # ============================================================ print("=" * 60) print("CONTROL: SAME QUESTION WITHOUT HISTORY") print("=" * 60) control_response = history.invoke( {"prompt": "Do those values define the minimal scale of physical events?"}, config={"configurable": {"session_id": "control-session"}} ) print(f"Without context: {control_response}") print() # ============================================================ # LIGHTWEIGHT INVARIANTS (Harness sanity checks) # ============================================================ print("=" * 60) print("HARNESS INVARIANTS") print("=" * 60) def check_invariants(name: str, session_id: str, expected_turns: int): session = read_session_history(session_id) msgs = session.messages expected_messages = expected_turns * 2 # each turn = human + ai # 1) Count invariant count_ok = (len(msgs) == expected_messages) # 2) Role alternation invariant: HumanMessage, AIMessage, HumanMessage, ... roles = [m.__class__.__name__ for m in msgs] alternation_ok = True for idx, role in enumerate(roles): if idx % 2 == 0 and role != "HumanMessage": alternation_ok = False break if idx % 2 == 1 and role != "AIMessage": alternation_ok = False break # 3) Non-empty content invariant (useful to catch weird parsing/empty messages) nonempty_ok = all(str(m.content).strip() for m in msgs) # Report status_count = "PASS" if count_ok else "FAIL" status_alternation = "PASS" if alternation_ok else "FAIL" status_nonempty = "PASS" if nonempty_ok else "FAIL" print(f"Session: {name} ({session_id})") print(f" Messages: {len(msgs)} (expected {expected_messages}) -> {status_count}") print(f" Role order: {roles[:6]}{'...' if len(roles) > 6 else ''} -> {status_alternation}") print(f" Non-empty content -> {status_nonempty}") print() # Main conversation had 3 turns check_invariants("main", session_id, expected_turns=3) # Control conversation had 1 turn check_invariants("control", "control-session", expected_turns=1) print() # ============================================================ # ACCEPTABLE OUTCOME CLASSES (Oracle-lite) # ============================================================ print("=" * 60) print("OUTCOME CLASSIFICATION (CONTROL RESPONSE)") print("=" * 60) def classify_control_response(response: str) -> tuple[str, str]: r = response.strip() r_low = r.lower() # Signals that the model is asking for missing referents clarification_markers = [ "what do you mean", "which values", "what values", "those values refer", "can you clarify", "could you clarify", "clarify", "which ones", "what are those" ] # Signals that the model acknowledges missing context / uncertainty uncertainty_markers = [ "without context", "without more context", "not enough context", "i don't have", "i don't know which", "unclear", "ambiguous", "depends on what you mean" ] # Signals of generic physics fallback (often fine if it stays # general / hedged) planck_markers = [ "planck", "quantum", "scale", "fundamental", "minimum length", "minimum time" ] # Signals of confident specificity (riskier in the control session) specificity_markers = [ "the values are", "those values are", "you mean", "as we discussed", "as mentioned earlier", "as i said", "as i told you" ] # Heuristic: look for numbers/units that could indicate the # model is inventing specifics. (Not always bad, but suspicious # if no context was provided.) has_number = any(ch.isdigit() for ch in r_low) mentions_planck = any(m in r_low for m in planck_markers) asks_for_clarification = any(m in r_low for m in clarification_markers) or r.endswith("?") acknowledges_uncertainty = any(m in r_low for m in uncertainty_markers) confident_specific = any(m in r_low for m in specificity_markers) # Classification logic (simple on purpose) if asks_for_clarification: return ("CLARIFICATION-SEEKING (GOOD)", "The response requests missing referents for " "“those values,” which fits the no-history condition.") if acknowledges_uncertainty and mentions_planck: return ("HEDGED GENERIC FALLBACK (GOOD)", "The response notes missing context and then stays " "general (e.g., Planck-scale discussion).") if acknowledges_uncertainty and not mentions_planck: return ("UNCERTAINTY ACKNOWLEDGED (GOOD)", "The response explicitly flags ambiguity or missing " "context without overcommitting to specifics.") # Here’s the main “smell test”: confident + specific, # especially with numbers, in a control session if confident_specific or (has_number and mentions_planck and not acknowledges_uncertainty): return ("CONFIDENT SPECIFICITY (SUSPICIOUS)", "The response appears to infer or assert specific " "prior values despite having no session context.") if mentions_planck: return ("GENERIC FALLBACK (MIXED)", "The response defaults to Planck-scale explanations. " "This can be acceptable, but watch for unjustified " "certainty.") return ("OTHER / UNCLASSIFIED", "The response doesn't match the main expected patterns. " "Inspect manually to decide if it's reasonable.") label, rationale = classify_control_response(control_response) print(f"Class: {label}") print(f"Why: {rationale}") print() print("Raw response:") print(control_response) print() |
Wow, right!? Just seeing it all in one shot shows you lots of work that we got through.
The code above works, but it has some pedagogical rough edges. Let’s refactor it to make the testing concepts clearer. Rather than go through this step by step, in which the only thing I would likely be testing is everyone’s patience, I’ll show you what I did to refactor the logic and then explain the key elements to notice.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 |
from dotenv import load_dotenv from langchain_ollama import ChatOllama from langchain_core.prompts import ChatPromptTemplate from langchain_core.runnables import chain from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables.history import RunnableWithMessageHistory from langchain_core.chat_history import BaseChatMessageHistory from langchain_community.chat_message_histories import ChatMessageHistory from langchain_community.chat_message_histories import SQLChatMessageHistory env = load_dotenv(".env") store = {} session_id = "jeff-chat" MODEL = "qwen2.5:latest" USE_SQLITE = False DB = "jeff-chat.db" # ============================================================ # SESSION HISTORY MANAGEMENT # ============================================================ def read_session_history(session_id: str) -> BaseChatMessageHistory: if USE_SQLITE: return SQLChatMessageHistory( session_id=session_id, connection=f"sqlite:///{DB}" ) else: if (session_id not in store): store[session_id] = ChatMessageHistory() return store[session_id] read_session_history(session_id).clear() read_session_history("control-session").clear() # ============================================================ # MODEL SETUP # ============================================================ model = ChatOllama( model=MODEL, base_url="http://localhost:11434", ) template = ChatPromptTemplate.from_messages([ ("system", "Please answer as concisely as possible."), ("placeholder", "{history}"), ("human", "{prompt}") ]) chain = template | model | StrOutputParser() history = RunnableWithMessageHistory( chain, read_session_history, input_messages_key="prompt", history_messages_key="history" ) # ============================================================ # CONVERSATION EXECUTION # ============================================================ prompt1 = "What is the smallest possible length?" prompt2 = "What is the smallest possible time?" prompt3 = "Do those values define the minimal scale of physical events?" response1 = history.invoke( {"prompt": prompt1}, config={"configurable": {"session_id": session_id}} ) response2 = history.invoke( {"prompt": prompt2}, config={"configurable": {"session_id": session_id}} ) response3 = history.invoke( {"prompt": prompt3}, config={"configurable": {"session_id": session_id}} ) print("=" * 60) print("CONVERSATION WITH HISTORY") print("=" * 60) print(f"Q: {prompt1}") print(f"A: {response1}") print() print(f"Q: {prompt2}") print(f"A: {response2}") print() print(f"Q: {prompt3}") print(f"A: {response3}") print() # ============================================================ # HISTORY INSPECTION # ============================================================ print("=" * 60) print("INSPECTING CONVERSATION HISTORY") print("=" * 60) session = read_session_history(session_id) print(f"Total messages in history: {len(session.messages)}") print("\nMessage contents:") for i, msg in enumerate(session.messages, 1): role = msg.__class__.__name__.replace("Message", "") content_str = str(msg.content) if len(content_str) > 100: content = content_str[:100] + "..." else: content = content_str print(f" {i}. [{role}] {content}") print() # ============================================================ # CONTROL COMPARISON (No History) # ============================================================ print("=" * 60) print("CONTROL: SAME QUESTION WITHOUT HISTORY") print("=" * 60) control_response = history.invoke( {"prompt": prompt3}, config={"configurable": {"session_id": "control-session"}} ) print(f"Without context: {control_response}") print() # ============================================================ # LIGHTWEIGHT INVARIANTS (Harness sanity checks) # ============================================================ print("=" * 60) print("HARNESS INVARIANTS") print("=" * 60) def check_role_alternation(roles: list[str]) -> bool: """ Verify roles alternate: Human, AI, Human, AI, ... Even positions (0,2,4...) must be Human. # Odd positions (1,3,5...) must be AI. alternation_ok = check_role_alternation(roles) """ for idx, role in enumerate(roles): if idx % 2 == 0 and role != "HumanMessage": return False if idx % 2 == 1 and role != "AIMessage": return False return True def check_invariants(name: str, session_id: str, expected_turns: int): """ Verify conversation history meets basic sanity checks. Invariants: 1. Message count matches expected turns (1 turn = human + AI) 2. Roles strictly alternate (Human, AI, Human, AI, ...) 3. All messages have non-empty content """ session = read_session_history(session_id) msgs = session.messages # Invariant 1: Correct message count # Each turn = 1 human message + 1 AI response expected_messages = expected_turns * 2 count_ok = (len(msgs) == expected_messages) # Invariant 2: Strict alternation of roles # Even indices (0,2,4...) should be Human # Odd indices (1,3,5...) should be AI roles = [m.__class__.__name__ for m in msgs] alternation_ok = check_role_alternation(roles) # Invariant 3: No empty messages nonempty_ok = all(str(m.content).strip() for m in msgs) # Report results status_count = "PASS" if count_ok else "FAIL" status_alternation = "PASS" if alternation_ok else "FAIL" status_nonempty = "PASS" if nonempty_ok else "FAIL" print(f"Session: {name} ({session_id})") print(f" Message count: {len(msgs)} " f"(expected {expected_messages}) -> {status_count}") print(f" Role alternation: {roles[:6]}" f"{'...' if len(roles) > 6 else ''} -> " f"{status_alternation}") print(f" Non-empty content -> {status_nonempty}") print() check_invariants("main", session_id, expected_turns=3) check_invariants("control", "control-session", expected_turns=1) # ============================================================ # OUTCOME CLASSIFICATION PATTERNS # ============================================================ # Phrases observed in model responses when handling questions # with missing referents. Extend these as you test more models. CLARIFICATION_PHRASES = [ "what do you mean", "which values", "what values", "those values refer", "can you clarify", "could you clarify", "clarify", "which ones", "what are those" ] UNCERTAINTY_PHRASES = [ "without context", "without more context", "not enough context", "i don't have", "i don't know which", "unclear", "ambiguous", "depends on what you mean" ] FALSE_CONFIDENCE_PHRASES = [ "the values are", "those values are", "you mean", "as we discussed", "as mentioned earlier", "as i said", "as i told you" ] PLANCK_PHRASES = [ "planck", "quantum", "scale", "fundamental", "minimum length", "minimum time" ] # ============================================================ # ACCEPTABLE OUTCOME CLASSES (Oracle-lite) # ============================================================ print("=" * 60) print("OUTCOME CLASSIFICATION (CONTROL RESPONSE)") print("=" * 60) def classify_control_response(response: str) -> tuple[str, str]: """ Classify how the model handles a question with missing referents. Good responses: ask for clarification or admit uncertainty Bad responses: confidently infer non-existent prior context """ r_low = response.lower() # Pattern 1: Asking "which values?" or "what do you mean?" asks_question = response.endswith("?") seeks_clarification = any(phrase in r_low for phrase in CLARIFICATION_PHRASES) if asks_question and seeks_clarification: return ("CLARIFICATION-SEEKING (GOOD)", "Requests missing referents for 'those values'") # Pattern 2: Saying "I don't know without context" admits_uncertainty = any(phrase in r_low for phrase in UNCERTAINTY_PHRASES) mentions_planck = any(phrase in r_low for phrase in PLANCK_PHRASES) if admits_uncertainty and mentions_planck: return ("HEDGED GENERIC FALLBACK (GOOD)", "Notes missing context, stays general") if admits_uncertainty: return ("UNCERTAINTY ACKNOWLEDGED (GOOD)", "Flags ambiguity without overcommitting") # Pattern 3: Saying "as we discussed..." (but we didn't!) false_confidence = any(phrase in r_low for phrase in FALSE_CONFIDENCE_PHRASES) has_number = any(ch.isdigit() for ch in r_low) if false_confidence: return ("CONFIDENT SPECIFICITY (SUSPICIOUS)", "Asserts prior context that doesn't exist") if has_number and mentions_planck and not admits_uncertainty: return ("CONFIDENT SPECIFICITY (SUSPICIOUS)", "Infers specific values without justification") # Pattern 4: Generic fallback to domain knowledge if mentions_planck: return ("GENERIC FALLBACK (MIXED)", "Defaults to Planck-scale explanation") return ("UNCLASSIFIED", "Inspect manually") label, rationale = classify_control_response(control_response) print(f"Class: {label}") print(f"Why: {rationale}") print() print("Raw response:") print(control_response) print() |
In the “CONVERSATION EXECUTION” section, notice I extracted the prompts into variables (prompt1, prompt2, prompt3) rather than writing them twice: once in the invoke() call and again in the print() statement. This follows the DRY principle: “Don’t Repeat Yourself.”
Whether and to what extent to apply the DRY principle in testing related code has been long debated. I don’t plan to settle that debate. What I will say is that when you’re building test harnesses, focusing (at least to some extent) on DRY isn’t just about code aesthetics. If you need to refine your test prompts (and you will; testing AI systems is iterative), you want to change them in exactly one place. Duplication creates maintenance headaches: you modify the prompt in the invoke call but forget to update the print statement, and suddenly your output logs don’t match what you actually asked the model.
More importantly, the variables make it easy to reference specific turns in your analysis. Later, when I inspect the control response, I can clearly say “the third prompt deliberately uses a referent (‘those values’) that requires prior context.” The variable name prompt3 gives me a clean handle for discussing that specific test case.
In the “LIGHTWEIGHT INVARIANTS” section, notice I extracted the alternation check into its own function, check_role_alternation(). This isn’t strictly necessary for this simple harness, but it does illustrate a useful pattern: when you’re building test infrastructure, isolating individual checks makes them easier to debug, test, and reuse. If I later wanted to check alternation in a different context (say, verifying a conversation loaded from a database) I could call this function directly.
In this same section, I also felt I had a problem with my check_invariants() function. It does something conceptually simple: verify that a conversation history looks well-formed. However, the code didn’t announce what it’s doing clearly enough. The modulo arithmetic (idx % 2), which is now refactored into the above function, was correct, but not self-documenting.
Adding a docstring that lists the three invariants upfront creates a conceptual roadmap. Then, section comments (“Invariant 1:”, “Invariant 2:”) create landmarks as you read through. The inline explanations (“Even indices (0,2,4…) should be Human”) can help transform mysterious arithmetic into explicit rules.
The biggest issue I felt I had was in the “ACCEPTABLE OUTCOME CLASSES” section, specifically in my classify_control_response() function. The original version mixed what we’re looking for (specific phrases) with why we’re looking for it (detecting patterns). When you’re reading through 30+ string literals inline, it’s hard to see the forest for the trees.
I realized I could separate concerns by extracting the phrase lists to module-level constants. This gives me two benefits. First, the constants become documentation; they show exactly what patterns have been observed across different models. Second, the classification function can focus on logic rather than listing strings. When you read the function, you see “Pattern 1: Asking for clarification” rather than wading through nine different ways to ask “what do you mean?”
These refactorings follow roughly the same pattern: separate mechanism from meaning. Constants capture the empirical observations (these are the phrases we’ve seen). Functions capture the conceptual framework (these are the patterns we’re testing for). Comments explain the bridge between them.
For a testing harness, in particular, this matters more than it would in, say, a traditional test script. I say that because you’re not just making the code work, you’re teaching readers of your code how to think about AI behavior systematically.
This isn’t just cleaner code. It’s clearer thinking! When you’re building testing harnesses, you’re building conceptual tools. Any refactoring I do in this context makes those concepts visible.
The Test Report
Let’s focus on the output we get from this test. Here I won’t reproduce all of what you might see. What I do want to make clear is that you’re not just looking at program output. You are looking a structured test report. Each section tells you something specific about how the conversational AI system behaved under test conditions.
The Main Experiment (Conversation with History)
This section shows the actual conversation flow. The model receives three sequential prompts and successfully tracks context across turns. You should be able to notice how the third response indicates that the model is clearly referencing the specific values it mentioned in the previous two answers. This demonstrates that the history mechanism is working; the model has access to prior turns.
History Inspection
This is your “ground truth” verification. You’re looking under the hood to confirm that the conversation history contains exactly what you expect: six messages (three human, three AI), properly alternated, all non-empty. If this section showed only two messages or revealed gaps in the history, you would know something broke in your session management. This is basic infrastructure validation.
The Control Experiment
Here’s where the test design pays off. We ask the exact same third question (“Do those values define the minimal scale of physical events?”) but to a fresh session with no history. The model has no prior context (no previous mention of Planck length or Planck time) yet the question uses the referent “those values.”
How does the model respond? Well, generally, it makes an educated guess. It falls back to general physics knowledge and talks about Planck constants, but you’ll likely notice the language is more generic and hedged. It doesn’t (and shouldn’t) says something like “The Planck length and time [that we just discussed]…” because there was no prior discussion.
Harness Invariants
These are your sanity checks: automated verification that the test infrastructure itself is working correctly. Both sessions pass all three checks: correct message counts, proper role alternation, no empty content. If any of these failed, you would know the problem was with your test harness, not the model’s behavior.
Outcome Classification
This is your automated oracle: a lightweight classifier that categorizes the control response. In most cases, you’ll likely get “GENERIC FALLBACK (MIXED).” The model likely defaulted to talking about Planck scales (reasonable domain knowledge) but likely didn’t explicitly request clarification about “those values” (which would have been ideal epistemic behavior).
Lot’s of “likely” in what I just said. Keep in mind something we talked about in the previous post: the classification isn’t pass/fail; it’s descriptive. “MIXED” means “this behavior is acceptable but not optimal.” A “GOOD” classification would mean the model asked “Which values?” or said “I don’t have enough context.” A “SUSPICIOUS” classification would mean the model confidently asserted it had discussed specific values when it hadn’t.
Why This Matters
This output structure (experiment, inspection, control, validation, classification) is reproducible. You can run this same harness against different local models and, in fact, against distributed models (GPT-4, Claude, Grok, etc.) and compare their classifications. You can modify the prompts and see how behavior changes. You can add more invariants or refine your classification patterns.
So, again, I will say that this isn’t just output. It’s a test report. It’s a test report based on reproducible evidence. You’re not just running code. You’re systematically probing how conversational AI systems handle context, ambiguity, and missing information with that reproducible evidence.
Next Steps!
This was a bit of an interlude post to take us from our initial testing example to scaling that example up. In the next post, we’ll do exactly that type of scaling and we’ll start with the refactored test we ended up with in this post.