AI and Testing: Refactoring Tests

In the previous post, we refined an AI test case that we had previously created as a testing example. In this brief post, I want to show a refactoring of that code. We will also align on the output of this test.

Refactoring Exercise

I’m choosing to focus on refactoring because that’s something that often doesn’t get talked about. Developers perform this activity all the time and, certainly, test engineers (such as those writing automation code), should be aware of the practice. First, let’s consider the code we ended up with:

from dotenv import load_dotenv
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import chain
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_community.chat_message_histories import SQLChatMessageHistory

env = load_dotenv(".env")
store = {}
session_id = "jeff-chat"

MODEL = "qwen2.5:latest"
USE_SQLITE = False
DB = "jeff-chat.db"

# ============================================================
# SESSION HISTORY MANAGEMENT
# ============================================================

def read_session_history(session_id: str) -> BaseChatMessageHistory:
  if USE_SQLITE:
    return SQLChatMessageHistory(
      session_id=session_id,
      connection=f"sqlite:///{DB}"
    )
  else:
    if (session_id not in store):
      store[session_id] = ChatMessageHistory()

    return store[session_id]

read_session_history(session_id).clear()
read_session_history("control-session").clear()

# ============================================================
# MODEL SETUP
# ============================================================

model = ChatOllama(
  model=MODEL,
  base_url="http://localhost:11434",
)

template = ChatPromptTemplate.from_messages([
  ("system", "Please answer as concisely as possible."),
  ("placeholder", "{history}"),
  ("human", "{prompt}")
])

chain = template | model | StrOutputParser()

history = RunnableWithMessageHistory(
  chain,
  read_session_history,
  input_messages_key="prompt",
  history_messages_key="history"
)

# ============================================================
# CONVERSATION EXECUTION
# ============================================================

response1 = history.invoke(
  {"prompt": "What is the smallest possible length?"},
  config={"configurable": {"session_id": session_id}}
)

response2 = history.invoke(
  {"prompt": "What is the smallest possible time?"},
  config={"configurable": {"session_id": session_id}}
)

response3 = history.invoke(
  {"prompt": "Do those values define the minimal scale of physical events?"},
  config={"configurable": {"session_id": session_id}}
)

print("=" * 60)
print("CONVERSATION WITH HISTORY")
print("=" * 60)
print(response1, end="\n\n")
print(response2, end="\n\n")
print(response3, end="\n\n")

# ============================================================
# HISTORY INSPECTION
# ============================================================

print("=" * 60)
print("INSPECTING CONVERSATION HISTORY")
print("=" * 60)

session = read_session_history(session_id)

print(f"Total messages in history: {len(session.messages)}")
print("\nMessage contents:")

for i, msg in enumerate(session.messages, 1):
  role = msg.__class__.__name__.replace("Message", "")
  content_str = str(msg.content)

  if len(content_str) > 100:
    content = content_str[:100] + "..."
  else:
    content = content_str

  print(f"  {i}. [{role}] {content}")

print()

# ============================================================
# CONTROL COMPARISON (No History)
# ============================================================

print("=" * 60)
print("CONTROL: SAME QUESTION WITHOUT HISTORY")
print("=" * 60)

control_response = history.invoke(
  {"prompt": "Do those values define the minimal scale of physical events?"},
  config={"configurable": {"session_id": "control-session"}}
)

print(f"Without context: {control_response}")
print()

# ============================================================
# LIGHTWEIGHT INVARIANTS (Harness sanity checks)
# ============================================================

print("=" * 60)
print("HARNESS INVARIANTS")
print("=" * 60)

def check_invariants(name: str, session_id: str, expected_turns: int):
  session = read_session_history(session_id)
  msgs = session.messages
  expected_messages = expected_turns * 2  # each turn = human + ai

  # 1) Count invariant
  count_ok = (len(msgs) == expected_messages)

  # 2) Role alternation invariant: HumanMessage, AIMessage, HumanMessage, ...
  roles = [m.__class__.__name__ for m in msgs]
  alternation_ok = True

  for idx, role in enumerate(roles):
    if idx % 2 == 0 and role != "HumanMessage":
      alternation_ok = False
      break

    if idx % 2 == 1 and role != "AIMessage":
      alternation_ok = False
      break

  # 3) Non-empty content invariant (useful to catch weird parsing/empty messages)
  nonempty_ok = all(str(m.content).strip() for m in msgs)

  # Report
  status_count = "PASS" if count_ok else "FAIL"
  status_alternation = "PASS" if alternation_ok else "FAIL"
  status_nonempty = "PASS" if nonempty_ok else "FAIL"

  print(f"Session: {name} ({session_id})")
  print(f"  Messages: {len(msgs)} (expected {expected_messages}) -> {status_count}")
  print(f"  Role order: {roles[:6]}{'...' if len(roles) > 6 else ''} -> {status_alternation}")
  print(f"  Non-empty content -> {status_nonempty}")
  print()

# Main conversation had 3 turns
check_invariants("main", session_id, expected_turns=3)

# Control conversation had 1 turn
check_invariants("control", "control-session", expected_turns=1)

print()

# ============================================================
# ACCEPTABLE OUTCOME CLASSES (Oracle-lite)
# ============================================================

print("=" * 60)
print("OUTCOME CLASSIFICATION (CONTROL RESPONSE)")
print("=" * 60)

def classify_control_response(response: str) -> tuple[str, str]:
  r = response.strip()
  r_low = r.lower()

  # Signals that the model is asking for missing referents
  clarification_markers = [
    "what do you mean", "which values", "what values",
    "those values refer", "can you clarify", "could you clarify",
    "clarify", "which ones", "what are those"
  ]

  # Signals that the model acknowledges missing context / uncertainty
  uncertainty_markers = [
    "without context", "without more context", "not enough context",
    "i don't have", "i don't know which", "unclear", "ambiguous",
    "depends on what you mean"
  ]

  # Signals of generic physics fallback (often fine if it stays
  # general / hedged)
  planck_markers = [
    "planck", "quantum", "scale", "fundamental",
    "minimum length", "minimum time"
  ]

  # Signals of confident specificity (riskier in the control session)
  specificity_markers = [
    "the values are", "those values are", "you mean", "as we discussed",
    "as mentioned earlier", "as i said", "as i told you"
  ]

  # Heuristic: look for numbers/units that could indicate the
  # model is inventing specifics. (Not always bad, but suspicious
  # if no context was provided.)
  has_number = any(ch.isdigit() for ch in r_low)
  mentions_planck = any(m in r_low for m in planck_markers)

  asks_for_clarification = any(m in r_low for m in clarification_markers) or r.endswith("?")
  acknowledges_uncertainty = any(m in r_low for m in uncertainty_markers)
  confident_specific = any(m in r_low for m in specificity_markers)

  # Classification logic (simple on purpose)
  if asks_for_clarification:
    return ("CLARIFICATION-SEEKING (GOOD)",
            "The response requests missing referents for "
            "“those values,” which fits the no-history condition.")

  if acknowledges_uncertainty and mentions_planck:
    return ("HEDGED GENERIC FALLBACK (GOOD)",
            "The response notes missing context and then stays "
            "general (e.g., Planck-scale discussion).")

  if acknowledges_uncertainty and not mentions_planck:
    return ("UNCERTAINTY ACKNOWLEDGED (GOOD)",
            "The response explicitly flags ambiguity or missing "
            "context without overcommitting to specifics.")

  # Here’s the main “smell test”: confident + specific,
  # especially with numbers, in a control session
  if confident_specific or (has_number and mentions_planck and not acknowledges_uncertainty):
    return ("CONFIDENT SPECIFICITY (SUSPICIOUS)",
            "The response appears to infer or assert specific "
            "prior values despite having no session context.")

  if mentions_planck:
    return ("GENERIC FALLBACK (MIXED)",
            "The response defaults to Planck-scale explanations. "
            "This can be acceptable, but watch for unjustified "
            "certainty.")

  return ("OTHER / UNCLASSIFIED",
          "The response doesn't match the main expected patterns. "
          "Inspect manually to decide if it's reasonable.")

label, rationale = classify_control_response(control_response)

print(f"Class: {label}")
print(f"Why:   {rationale}")
print()
print("Raw response:")
print(control_response)
print()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

from dotenv import load_dotenv

from langchain_ollama import ChatOllama

from langchain_core.prompts import ChatPromptTemplate

from langchain_core.runnables import chain

from langchain_core.output_parsers import StrOutputParser

from langchain_core.runnables.history import RunnableWithMessageHistory

from langchain_core.chat_history import BaseChatMessageHistory

from langchain_community.chat_message_histories import ChatMessageHistory

from langchain_community.chat_message_histories import SQLChatMessageHistory

env = load_dotenv(".env")

store = {}

session_id = "jeff-chat"

MODEL = "qwen2.5:latest"

USE_SQLITE = False

DB = "jeff-chat.db"

# ============================================================

# SESSION HISTORY MANAGEMENT

# ============================================================

def read_session_history(session_id: str) -> BaseChatMessageHistory:

if USE_SQLITE:

return SQLChatMessageHistory(

session_id=session_id,

connection=f"sqlite:///{DB}"

)

else:

if (session_id not in store):

store[session_id] = ChatMessageHistory()

return store[session_id]

read_session_history(session_id).clear()

read_session_history("control-session").clear()

# ============================================================

# MODEL SETUP

# ============================================================

model = ChatOllama(

model=MODEL,

base_url="http://localhost:11434",

)

template = ChatPromptTemplate.from_messages([

("system", "Please answer as concisely as possible."),

("placeholder", "{history}"),

("human", "{prompt}")

])

chain = template | model | StrOutputParser()

history = RunnableWithMessageHistory(

chain,

read_session_history,

input_messages_key="prompt",

history_messages_key="history"

)

# ============================================================

# CONVERSATION EXECUTION

# ============================================================

response1 = history.invoke(

{"prompt": "What is the smallest possible length?"},

config={"configurable": {"session_id": session_id}}

)

response2 = history.invoke(

{"prompt": "What is the smallest possible time?"},

config={"configurable": {"session_id": session_id}}

)

response3 = history.invoke(

{"prompt": "Do those values define the minimal scale of physical events?"},

config={"configurable": {"session_id": session_id}}

)

print("=" * 60)

print("CONVERSATION WITH HISTORY")

print("=" * 60)

print(response1, end="\n\n")

print(response2, end="\n\n")

print(response3, end="\n\n")

# ============================================================

# HISTORY INSPECTION

# ============================================================

print("=" * 60)

print("INSPECTING CONVERSATION HISTORY")

print("=" * 60)

session = read_session_history(session_id)

print(f"Total messages in history: {len(session.messages)}")

print("\nMessage contents:")

for i, msg in enumerate(session.messages, 1):

role = msg.__class__.__name__.replace("Message", "")

content_str = str(msg.content)

if len(content_str) > 100:

content = content_str[:100] + "..."

else:

content = content_str

print(f" {i}. [{role}] {content}")

print()

# ============================================================

# CONTROL COMPARISON (No History)

# ============================================================

print("=" * 60)

print("CONTROL: SAME QUESTION WITHOUT HISTORY")

print("=" * 60)

control_response = history.invoke(

{"prompt": "Do those values define the minimal scale of physical events?"},

config={"configurable": {"session_id": "control-session"}}

)

print(f"Without context: {control_response}")

print()

# ============================================================

# LIGHTWEIGHT INVARIANTS (Harness sanity checks)

# ============================================================

print("=" * 60)

print("HARNESS INVARIANTS")

print("=" * 60)

def check_invariants(name: str, session_id: str, expected_turns: int):

session = read_session_history(session_id)

msgs = session.messages

expected_messages = expected_turns * 2 # each turn = human + ai

# 1) Count invariant

count_ok = (len(msgs) == expected_messages)

# 2) Role alternation invariant: HumanMessage, AIMessage, HumanMessage, ...

roles = [m.__class__.__name__ for m in msgs]

alternation_ok = True

for idx, role in enumerate(roles):

if idx % 2 == 0 and role != "HumanMessage":

alternation_ok = False

break

if idx % 2 == 1 and role != "AIMessage":

alternation_ok = False

break

# 3) Non-empty content invariant (useful to catch weird parsing/empty messages)

nonempty_ok = all(str(m.content).strip() for m in msgs)

# Report

status_count = "PASS" if count_ok else "FAIL"

status_alternation = "PASS" if alternation_ok else "FAIL"

status_nonempty = "PASS" if nonempty_ok else "FAIL"

print(f"Session: {name} ({session_id})")

print(f" Messages: {len(msgs)} (expected {expected_messages}) -> {status_count}")

print(f" Role order: {roles[:6]}{'...' if len(roles) > 6 else ''} -> {status_alternation}")

print(f" Non-empty content -> {status_nonempty}")

print()

# Main conversation had 3 turns

check_invariants("main", session_id, expected_turns=3)

# Control conversation had 1 turn

check_invariants("control", "control-session", expected_turns=1)

print()

# ============================================================

# ACCEPTABLE OUTCOME CLASSES (Oracle-lite)

# ============================================================

print("=" * 60)

print("OUTCOME CLASSIFICATION (CONTROL RESPONSE)")

print("=" * 60)

def classify_control_response(response: str) -> tuple[str, str]:

r = response.strip()

r_low = r.lower()

# Signals that the model is asking for missing referents

clarification_markers = [

"what do you mean", "which values", "what values",

"those values refer", "can you clarify", "could you clarify",

"clarify", "which ones", "what are those"

]

# Signals that the model acknowledges missing context / uncertainty

uncertainty_markers = [

"without context", "without more context", "not enough context",

"i don't have", "i don't know which", "unclear", "ambiguous",

"depends on what you mean"

]

# Signals of generic physics fallback (often fine if it stays

# general / hedged)

planck_markers = [

"planck", "quantum", "scale", "fundamental",

"minimum length", "minimum time"

]

# Signals of confident specificity (riskier in the control session)

specificity_markers = [

"the values are", "those values are", "you mean", "as we discussed",

"as mentioned earlier", "as i said", "as i told you"

]

# Heuristic: look for numbers/units that could indicate the

# model is inventing specifics. (Not always bad, but suspicious

# if no context was provided.)

has_number = any(ch.isdigit() for ch in r_low)

mentions_planck = any(m in r_low for m in planck_markers)

asks_for_clarification = any(m in r_low for m in clarification_markers) or r.endswith("?")

acknowledges_uncertainty = any(m in r_low for m in uncertainty_markers)

confident_specific = any(m in r_low for m in specificity_markers)

# Classification logic (simple on purpose)

if asks_for_clarification:

return ("CLARIFICATION-SEEKING (GOOD)",

"The response requests missing referents for "

"“those values,” which fits the no-history condition.")

if acknowledges_uncertainty and mentions_planck:

return ("HEDGED GENERIC FALLBACK (GOOD)",

"The response notes missing context and then stays "

"general (e.g., Planck-scale discussion).")

if acknowledges_uncertainty and not mentions_planck:

return ("UNCERTAINTY ACKNOWLEDGED (GOOD)",

"The response explicitly flags ambiguity or missing "

"context without overcommitting to specifics.")

# Here’s the main “smell test”: confident + specific,

# especially with numbers, in a control session

if confident_specific or (has_number and mentions_planck and not acknowledges_uncertainty):

return ("CONFIDENT SPECIFICITY (SUSPICIOUS)",

"The response appears to infer or assert specific "

"prior values despite having no session context.")

if mentions_planck:

return ("GENERIC FALLBACK (MIXED)",

"The response defaults to Planck-scale explanations. "

"This can be acceptable, but watch for unjustified "

"certainty.")

return ("OTHER / UNCLASSIFIED",

"The response doesn't match the main expected patterns. "

"Inspect manually to decide if it's reasonable.")

label, rationale = classify_control_response(control_response)

print(f"Class: {label}")

print(f"Why: {rationale}")

print()

print("Raw response:")

print(control_response)

print()

Wow, right!? Just seeing it all in one shot shows you lots of work that we got through.

The code above works, but it has some pedagogical rough edges. Let’s refactor it to make the testing concepts clearer. Rather than go through this step by step, in which the only thing I would likely be testing is everyone’s patience, I’ll show you what I did to refactor the logic and then explain the key elements to notice.

from dotenv import load_dotenv
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import chain
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_community.chat_message_histories import SQLChatMessageHistory

env = load_dotenv(".env")
store = {}
session_id = "jeff-chat"

MODEL = "qwen2.5:latest"
USE_SQLITE = False
DB = "jeff-chat.db"

# ============================================================
# SESSION HISTORY MANAGEMENT
# ============================================================

def read_session_history(session_id: str) -> BaseChatMessageHistory:
  if USE_SQLITE:
    return SQLChatMessageHistory(
      session_id=session_id,
      connection=f"sqlite:///{DB}"
    )
  else:
    if (session_id not in store):
      store[session_id] = ChatMessageHistory()

    return store[session_id]

read_session_history(session_id).clear()
read_session_history("control-session").clear()

# ============================================================
# MODEL SETUP
# ============================================================

model = ChatOllama(
  model=MODEL,
  base_url="http://localhost:11434",
)

template = ChatPromptTemplate.from_messages([
  ("system", "Please answer as concisely as possible."),
  ("placeholder", "{history}"),
  ("human", "{prompt}")
])

chain = template | model | StrOutputParser()

history = RunnableWithMessageHistory(
  chain,
  read_session_history,
  input_messages_key="prompt",
  history_messages_key="history"
)

# ============================================================
# CONVERSATION EXECUTION
# ============================================================

prompt1 = "What is the smallest possible length?"
prompt2 = "What is the smallest possible time?"
prompt3 = "Do those values define the minimal scale of physical events?"

response1 = history.invoke(
  {"prompt": prompt1},
  config={"configurable": {"session_id": session_id}}
)

response2 = history.invoke(
  {"prompt": prompt2},
  config={"configurable": {"session_id": session_id}}
)

response3 = history.invoke(
  {"prompt": prompt3},
  config={"configurable": {"session_id": session_id}}
)

print("=" * 60)
print("CONVERSATION WITH HISTORY")
print("=" * 60)
print(f"Q: {prompt1}")
print(f"A: {response1}")
print()
print(f"Q: {prompt2}")
print(f"A: {response2}")
print()
print(f"Q: {prompt3}")
print(f"A: {response3}")
print()

# ============================================================
# HISTORY INSPECTION
# ============================================================

print("=" * 60)
print("INSPECTING CONVERSATION HISTORY")
print("=" * 60)

session = read_session_history(session_id)

print(f"Total messages in history: {len(session.messages)}")
print("\nMessage contents:")

for i, msg in enumerate(session.messages, 1):
  role = msg.__class__.__name__.replace("Message", "")
  content_str = str(msg.content)

  if len(content_str) > 100:
    content = content_str[:100] + "..."
  else:
    content = content_str

  print(f"  {i}. [{role}] {content}")

print()

# ============================================================
# CONTROL COMPARISON (No History)
# ============================================================

print("=" * 60)
print("CONTROL: SAME QUESTION WITHOUT HISTORY")
print("=" * 60)

control_response = history.invoke(
  {"prompt": prompt3},
  config={"configurable": {"session_id": "control-session"}}
)

print(f"Without context: {control_response}")
print()

# ============================================================
# LIGHTWEIGHT INVARIANTS (Harness sanity checks)
# ============================================================

print("=" * 60)
print("HARNESS INVARIANTS")
print("=" * 60)

def check_role_alternation(roles: list[str]) -> bool:
  """
  Verify roles alternate: Human, AI, Human, AI, ...
  Even positions (0,2,4...) must be Human.
  # Odd positions (1,3,5...) must be AI.
  alternation_ok = check_role_alternation(roles)
  """
  for idx, role in enumerate(roles):
    if idx % 2 == 0 and role != "HumanMessage":
      return False
    if idx % 2 == 1 and role != "AIMessage":
      return False

  return True

def check_invariants(name: str, session_id: str, expected_turns: int):
  """
  Verify conversation history meets basic sanity checks.

  Invariants:
  1. Message count matches expected turns (1 turn = human + AI)
  2. Roles strictly alternate (Human, AI, Human, AI, ...)
  3. All messages have non-empty content
  """
  session = read_session_history(session_id)
  msgs = session.messages

  # Invariant 1: Correct message count
  # Each turn = 1 human message + 1 AI response
  expected_messages = expected_turns * 2
  count_ok = (len(msgs) == expected_messages)

  # Invariant 2: Strict alternation of roles
  # Even indices (0,2,4...) should be Human
  # Odd indices (1,3,5...) should be AI
  roles = [m.__class__.__name__ for m in msgs]
  alternation_ok = check_role_alternation(roles)

  # Invariant 3: No empty messages
  nonempty_ok = all(str(m.content).strip() for m in msgs)

  # Report results
  status_count = "PASS" if count_ok else "FAIL"
  status_alternation = "PASS" if alternation_ok else "FAIL"
  status_nonempty = "PASS" if nonempty_ok else "FAIL"

  print(f"Session: {name} ({session_id})")
  print(f"  Message count: {len(msgs)} "
        f"(expected {expected_messages}) -> {status_count}")
  print(f"  Role alternation: {roles[:6]}"
        f"{'...' if len(roles) > 6 else ''} -> "
        f"{status_alternation}")
  print(f"  Non-empty content -> {status_nonempty}")
  print()

check_invariants("main", session_id, expected_turns=3)
check_invariants("control", "control-session", expected_turns=1)

# ============================================================
# OUTCOME CLASSIFICATION PATTERNS
# ============================================================
# Phrases observed in model responses when handling questions
# with missing referents. Extend these as you test more models.

CLARIFICATION_PHRASES = [
  "what do you mean", "which values", "what values",
  "those values refer", "can you clarify",
  "could you clarify", "clarify", "which ones",
  "what are those"
]

UNCERTAINTY_PHRASES = [
  "without context", "without more context",
  "not enough context", "i don't have",
  "i don't know which", "unclear", "ambiguous",
  "depends on what you mean"
]

FALSE_CONFIDENCE_PHRASES = [
  "the values are", "those values are", "you mean",
  "as we discussed", "as mentioned earlier",
  "as i said", "as i told you"
]

PLANCK_PHRASES = [
  "planck", "quantum", "scale", "fundamental",
  "minimum length", "minimum time"
]

# ============================================================
# ACCEPTABLE OUTCOME CLASSES (Oracle-lite)
# ============================================================

print("=" * 60)
print("OUTCOME CLASSIFICATION (CONTROL RESPONSE)")
print("=" * 60)

def classify_control_response(response: str) -> tuple[str, str]:
  """
  Classify how the model handles a question with missing
  referents.

  Good responses: ask for clarification or admit uncertainty
  Bad responses: confidently infer non-existent prior context
  """
  r_low = response.lower()

  # Pattern 1: Asking "which values?" or "what do you mean?"
  asks_question = response.endswith("?")
  seeks_clarification = any(phrase in r_low
                            for phrase in CLARIFICATION_PHRASES)

  if asks_question and seeks_clarification:
    return ("CLARIFICATION-SEEKING (GOOD)",
            "Requests missing referents for 'those values'")

  # Pattern 2: Saying "I don't know without context"
  admits_uncertainty = any(phrase in r_low
                          for phrase in UNCERTAINTY_PHRASES)
  mentions_planck = any(phrase in r_low
                       for phrase in PLANCK_PHRASES)

  if admits_uncertainty and mentions_planck:
    return ("HEDGED GENERIC FALLBACK (GOOD)",
            "Notes missing context, stays general")

  if admits_uncertainty:
    return ("UNCERTAINTY ACKNOWLEDGED (GOOD)",
            "Flags ambiguity without overcommitting")

  # Pattern 3: Saying "as we discussed..." (but we didn't!)
  false_confidence = any(phrase in r_low
                        for phrase in FALSE_CONFIDENCE_PHRASES)
  has_number = any(ch.isdigit() for ch in r_low)

  if false_confidence:
    return ("CONFIDENT SPECIFICITY (SUSPICIOUS)",
            "Asserts prior context that doesn't exist")

  if has_number and mentions_planck and not admits_uncertainty:
    return ("CONFIDENT SPECIFICITY (SUSPICIOUS)",
            "Infers specific values without justification")

  # Pattern 4: Generic fallback to domain knowledge
  if mentions_planck:
    return ("GENERIC FALLBACK (MIXED)",
            "Defaults to Planck-scale explanation")

  return ("UNCLASSIFIED", "Inspect manually")

label, rationale = classify_control_response(control_response)

print(f"Class: {label}")
print(f"Why:   {rationale}")
print()
print("Raw response:")
print(control_response)
print()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

from dotenv import load_dotenv

from langchain_ollama import ChatOllama

from langchain_core.prompts import ChatPromptTemplate

from langchain_core.runnables import chain

from langchain_core.output_parsers import StrOutputParser

from langchain_core.runnables.history import RunnableWithMessageHistory

from langchain_core.chat_history import BaseChatMessageHistory

from langchain_community.chat_message_histories import ChatMessageHistory

from langchain_community.chat_message_histories import SQLChatMessageHistory

env = load_dotenv(".env")

store = {}

session_id = "jeff-chat"

MODEL = "qwen2.5:latest"

USE_SQLITE = False

DB = "jeff-chat.db"

# ============================================================

# SESSION HISTORY MANAGEMENT

# ============================================================

def read_session_history(session_id: str) -> BaseChatMessageHistory:

if USE_SQLITE:

return SQLChatMessageHistory(

session_id=session_id,

connection=f"sqlite:///{DB}"

)

else:

if (session_id not in store):

store[session_id] = ChatMessageHistory()

return store[session_id]

read_session_history(session_id).clear()

read_session_history("control-session").clear()

# ============================================================

# MODEL SETUP

# ============================================================

model = ChatOllama(

model=MODEL,

base_url="http://localhost:11434",

)

template = ChatPromptTemplate.from_messages([

("system", "Please answer as concisely as possible."),

("placeholder", "{history}"),

("human", "{prompt}")

])

chain = template | model | StrOutputParser()

history = RunnableWithMessageHistory(

chain,

read_session_history,

input_messages_key="prompt",

history_messages_key="history"

)

# ============================================================

# CONVERSATION EXECUTION

# ============================================================

prompt1 = "What is the smallest possible length?"

prompt2 = "What is the smallest possible time?"

prompt3 = "Do those values define the minimal scale of physical events?"

response1 = history.invoke(

{"prompt": prompt1},

config={"configurable": {"session_id": session_id}}

)

response2 = history.invoke(

{"prompt": prompt2},

config={"configurable": {"session_id": session_id}}

)

response3 = history.invoke(

{"prompt": prompt3},

config={"configurable": {"session_id": session_id}}

)

print("=" * 60)

print("CONVERSATION WITH HISTORY")

print("=" * 60)

print(f"Q: {prompt1}")

print(f"A: {response1}")

print()

print(f"Q: {prompt2}")

print(f"A: {response2}")

print()

print(f"Q: {prompt3}")

print(f"A: {response3}")

print()

# ============================================================

# HISTORY INSPECTION

# ============================================================

print("=" * 60)

print("INSPECTING CONVERSATION HISTORY")

print("=" * 60)

session = read_session_history(session_id)

print(f"Total messages in history: {len(session.messages)}")

print("\nMessage contents:")

for i, msg in enumerate(session.messages, 1):

role = msg.__class__.__name__.replace("Message", "")

content_str = str(msg.content)

if len(content_str) > 100:

content = content_str[:100] + "..."

else:

content = content_str

print(f" {i}. [{role}] {content}")

print()

# ============================================================

# CONTROL COMPARISON (No History)

# ============================================================

print("=" * 60)

print("CONTROL: SAME QUESTION WITHOUT HISTORY")

print("=" * 60)

control_response = history.invoke(

{"prompt": prompt3},

config={"configurable": {"session_id": "control-session"}}

)

print(f"Without context: {control_response}")

print()

# ============================================================

# LIGHTWEIGHT INVARIANTS (Harness sanity checks)

# ============================================================

print("=" * 60)

print("HARNESS INVARIANTS")

print("=" * 60)

def check_role_alternation(roles: list[str]) -> bool:

"""

Verify roles alternate: Human, AI, Human, AI, ...

Even positions (0,2,4...) must be Human.

# Odd positions (1,3,5...) must be AI.

alternation_ok = check_role_alternation(roles)

"""

for idx, role in enumerate(roles):

if idx % 2 == 0 and role != "HumanMessage":

return False

if idx % 2 == 1 and role != "AIMessage":

return False

return True

def check_invariants(name: str, session_id: str, expected_turns: int):

"""

Verify conversation history meets basic sanity checks.

Invariants:

1. Message count matches expected turns (1 turn = human + AI)

2. Roles strictly alternate (Human, AI, Human, AI, ...)

3. All messages have non-empty content

"""

session = read_session_history(session_id)

msgs = session.messages

# Invariant 1: Correct message count

# Each turn = 1 human message + 1 AI response

expected_messages = expected_turns * 2

count_ok = (len(msgs) == expected_messages)

# Invariant 2: Strict alternation of roles

# Even indices (0,2,4...) should be Human

# Odd indices (1,3,5...) should be AI

roles = [m.__class__.__name__ for m in msgs]

alternation_ok = check_role_alternation(roles)

# Invariant 3: No empty messages

nonempty_ok = all(str(m.content).strip() for m in msgs)

# Report results

status_count = "PASS" if count_ok else "FAIL"

status_alternation = "PASS" if alternation_ok else "FAIL"

status_nonempty = "PASS" if nonempty_ok else "FAIL"

print(f"Session: {name} ({session_id})")

print(f" Message count: {len(msgs)} "

f"(expected {expected_messages}) -> {status_count}")

print(f" Role alternation: {roles[:6]}"

f"{'...' if len(roles) > 6 else ''} -> "

f"{status_alternation}")

print(f" Non-empty content -> {status_nonempty}")

print()

check_invariants("main", session_id, expected_turns=3)

check_invariants("control", "control-session", expected_turns=1)

# ============================================================

# OUTCOME CLASSIFICATION PATTERNS

# ============================================================

# Phrases observed in model responses when handling questions

# with missing referents. Extend these as you test more models.

CLARIFICATION_PHRASES = [

"what do you mean", "which values", "what values",

"those values refer", "can you clarify",

"could you clarify", "clarify", "which ones",

"what are those"

]

UNCERTAINTY_PHRASES = [

"without context", "without more context",

"not enough context", "i don't have",

"i don't know which", "unclear", "ambiguous",

"depends on what you mean"

]

FALSE_CONFIDENCE_PHRASES = [

"the values are", "those values are", "you mean",

"as we discussed", "as mentioned earlier",

"as i said", "as i told you"

]

PLANCK_PHRASES = [

"planck", "quantum", "scale", "fundamental",

"minimum length", "minimum time"

]

# ============================================================

# ACCEPTABLE OUTCOME CLASSES (Oracle-lite)

# ============================================================

print("=" * 60)

print("OUTCOME CLASSIFICATION (CONTROL RESPONSE)")

print("=" * 60)

def classify_control_response(response: str) -> tuple[str, str]:

"""

Classify how the model handles a question with missing

referents.

Good responses: ask for clarification or admit uncertainty

Bad responses: confidently infer non-existent prior context

"""

r_low = response.lower()

# Pattern 1: Asking "which values?" or "what do you mean?"

asks_question = response.endswith("?")

seeks_clarification = any(phrase in r_low

for phrase in CLARIFICATION_PHRASES)

if asks_question and seeks_clarification:

return ("CLARIFICATION-SEEKING (GOOD)",

"Requests missing referents for 'those values'")

# Pattern 2: Saying "I don't know without context"

admits_uncertainty = any(phrase in r_low

for phrase in UNCERTAINTY_PHRASES)

mentions_planck = any(phrase in r_low

for phrase in PLANCK_PHRASES)

if admits_uncertainty and mentions_planck:

return ("HEDGED GENERIC FALLBACK (GOOD)",

"Notes missing context, stays general")

if admits_uncertainty:

return ("UNCERTAINTY ACKNOWLEDGED (GOOD)",

"Flags ambiguity without overcommitting")

# Pattern 3: Saying "as we discussed..." (but we didn't!)

false_confidence = any(phrase in r_low

for phrase in FALSE_CONFIDENCE_PHRASES)

has_number = any(ch.isdigit() for ch in r_low)

if false_confidence:

return ("CONFIDENT SPECIFICITY (SUSPICIOUS)",

"Asserts prior context that doesn't exist")

if has_number and mentions_planck and not admits_uncertainty:

return ("CONFIDENT SPECIFICITY (SUSPICIOUS)",

"Infers specific values without justification")

# Pattern 4: Generic fallback to domain knowledge

if mentions_planck:

return ("GENERIC FALLBACK (MIXED)",

"Defaults to Planck-scale explanation")

return ("UNCLASSIFIED", "Inspect manually")

label, rationale = classify_control_response(control_response)

print(f"Class: {label}")

print(f"Why: {rationale}")

print()

print("Raw response:")

print(control_response)

print()

In the “CONVERSATION EXECUTION” section, notice I extracted the prompts into variables (prompt1, prompt2, prompt3) rather than writing them twice: once in the invoke() call and again in the print() statement. This follows the DRY principle: “Don’t Repeat Yourself.”

Whether and to what extent to apply the DRY principle in testing related code has been long debated. I don’t plan to settle that debate. What I will say is that when you’re building test harnesses, focusing (at least to some extent) on DRY isn’t just about code aesthetics. If you need to refine your test prompts (and you will; testing AI systems is iterative), you want to change them in exactly one place. Duplication creates maintenance headaches: you modify the prompt in the invoke call but forget to update the print statement, and suddenly your output logs don’t match what you actually asked the model.

More importantly, the variables make it easy to reference specific turns in your analysis. Later, when I inspect the control response, I can clearly say “the third prompt deliberately uses a referent (‘those values’) that requires prior context.” The variable name prompt3 gives me a clean handle for discussing that specific test case.

In the “LIGHTWEIGHT INVARIANTS” section, notice I extracted the alternation check into its own function, check_role_alternation(). This isn’t strictly necessary for this simple harness, but it does illustrate a useful pattern: when you’re building test infrastructure, isolating individual checks makes them easier to debug, test, and reuse. If I later wanted to check alternation in a different context (say, verifying a conversation loaded from a database) I could call this function directly.

In this same section, I also felt I had a problem with my check_invariants() function. It does something conceptually simple: verify that a conversation history looks well-formed. However, the code didn’t announce what it’s doing clearly enough. The modulo arithmetic (idx % 2), which is now refactored into the above function, was correct, but not self-documenting.

Adding a docstring that lists the three invariants upfront creates a conceptual roadmap. Then, section comments (“Invariant 1:”, “Invariant 2:”) create landmarks as you read through. The inline explanations (“Even indices (0,2,4…) should be Human”) can help transform mysterious arithmetic into explicit rules.

The biggest issue I felt I had was in the “ACCEPTABLE OUTCOME CLASSES” section, specifically in my classify_control_response() function. The original version mixed what we’re looking for (specific phrases) with why we’re looking for it (detecting patterns). When you’re reading through 30+ string literals inline, it’s hard to see the forest for the trees.

I realized I could separate concerns by extracting the phrase lists to module-level constants. This gives me two benefits. First, the constants become documentation; they show exactly what patterns have been observed across different models. Second, the classification function can focus on logic rather than listing strings. When you read the function, you see “Pattern 1: Asking for clarification” rather than wading through nine different ways to ask “what do you mean?”

These refactorings follow roughly the same pattern: separate mechanism from meaning. Constants capture the empirical observations (these are the phrases we’ve seen). Functions capture the conceptual framework (these are the patterns we’re testing for). Comments explain the bridge between them.

For a testing harness, in particular, this matters more than it would in, say, a traditional test script. I say that because you’re not just making the code work, you’re teaching readers of your code how to think about AI behavior systematically.

This isn’t just cleaner code. It’s clearer thinking! When you’re building testing harnesses, you’re building conceptual tools. Any refactoring I do in this context makes those concepts visible.

The Test Report

Let’s focus on the output we get from this test. Here I won’t reproduce all of what you might see. What I do want to make clear is that you’re not just looking at program output. You are looking a structured test report. Each section tells you something specific about how the conversational AI system behaved under test conditions.

The Main Experiment (Conversation with History)

This section shows the actual conversation flow. The model receives three sequential prompts and successfully tracks context across turns. You should be able to notice how the third response indicates that the model is clearly referencing the specific values it mentioned in the previous two answers. This demonstrates that the history mechanism is working; the model has access to prior turns.

History Inspection

This is your “ground truth” verification. You’re looking under the hood to confirm that the conversation history contains exactly what you expect: six messages (three human, three AI), properly alternated, all non-empty. If this section showed only two messages or revealed gaps in the history, you would know something broke in your session management. This is basic infrastructure validation.

The Control Experiment

Here’s where the test design pays off. We ask the exact same third question (“Do those values define the minimal scale of physical events?”) but to a fresh session with no history. The model has no prior context (no previous mention of Planck length or Planck time) yet the question uses the referent “those values.”

How does the model respond? Well, generally, it makes an educated guess. It falls back to general physics knowledge and talks about Planck constants, but you’ll likely notice the language is more generic and hedged. It doesn’t (and shouldn’t) says something like “The Planck length and time [that we just discussed]…” because there was no prior discussion.

Harness Invariants

These are your sanity checks: automated verification that the test infrastructure itself is working correctly. Both sessions pass all three checks: correct message counts, proper role alternation, no empty content. If any of these failed, you would know the problem was with your test harness, not the model’s behavior.

Outcome Classification

This is your automated oracle: a lightweight classifier that categorizes the control response. In most cases, you’ll likely get “GENERIC FALLBACK (MIXED).” The model likely defaulted to talking about Planck scales (reasonable domain knowledge) but likely didn’t explicitly request clarification about “those values” (which would have been ideal epistemic behavior).

Lot’s of “likely” in what I just said. Keep in mind something we talked about in the previous post: the classification isn’t pass/fail; it’s descriptive. “MIXED” means “this behavior is acceptable but not optimal.” A “GOOD” classification would mean the model asked “Which values?” or said “I don’t have enough context.” A “SUSPICIOUS” classification would mean the model confidently asserted it had discussed specific values when it hadn’t.

Why This Matters

This output structure (experiment, inspection, control, validation, classification) is reproducible. You can run this same harness against different local models and, in fact, against distributed models (GPT-4, Claude, Grok, etc.) and compare their classifications. You can modify the prompts and see how behavior changes. You can add more invariants or refine your classification patterns.

So, again, I will say that this isn’t just output. It’s a test report. It’s a test report based on reproducible evidence. You’re not just running code. You’re systematically probing how conversational AI systems handle context, ambiguity, and missing information with that reproducible evidence.

Next Steps!

This was a bit of an interlude post to take us from our initial testing example to scaling that example up. In the next post, we’ll do exactly that type of scaling and we’ll start with the refactored test we ended up with in this post.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …

Refactoring Exercise

The Test Report

The Main Experiment (Conversation with History)

History Inspection

The Control Experiment

Harness Invariants

Outcome Classification

Why This Matters

Next Steps!

Leave a Reply Cancel reply