AI and Testing: Personal Marketability

In the posts in this series, I’ve been taking you through a lot of concepts and tooling. That’s going to continue but, for this post, it felt prudent to take a little break and talk about why doing all this can matter. That gets into interviewing and potentially being hired.

Right now it’s hard not to notice that many opportunities are touting the desire for testers to have some knowledge of AI. This can be as simple as using AI to help write tests, or using AI-based testing tools (such as Functionize, Mabl, Testim, Applitools, TestRigor, and Virtuoso QA), or actually being able to test an AI that the company itself is using or creating.

Two Paths to AI Testing Fluency

When job descriptions mention “AI experience in testing,” they’re usually referring to one of two distinct skill sets, and it’s worth understanding the difference.

AI as Your Testing Assistant

The first, and often more immediately accessible, path is using AI tools to make your existing testing work ostensibly more efficient. This is where you’re still writing the tests, but AI is helping you write them faster and (allegedly) better.

Tools like GitHub Copilot or Cursor can generate boilerplate test code, suggest assertions based on your function signatures, or help you write complex selectors for web automation. Playwright now integrates with AI assistants to help debug flaky tests or generate page object models from your application’s actual DOM structure. Model Context Protocol (MCP) servers can connect AI assistants directly to your testing infrastructure, letting you ask questions like “why is this test failing in CI but passing locally?” and get answers grounded in your actual logs and system state.

This tier of AI usage doesn’t require you to understand how large language models or generative AI work internally. You’re using AI the same way you would use any other productivity tool: it suggests, you evaluate, you accept or reject. The AI handles the mechanical work (remembering syntax, generating repetitive code, searching documentation) while you handle the intellectual work of deciding what needs testing and whether the AI’s suggestions are correct.

Interview questions at this level tend to focus on practical workflow: “How have you incorporated AI tools into your test development process?” or “What challenges have you encountered when using AI-generated test code?” The key is demonstrating that you can leverage these tools effectively while maintaining quality standards. You still need to understand testing principles; the AI just makes you potentially faster at applying them.

As such, I consider this kind of thing very low-hanging fruit and anyone who specializes in testing should have no problems accommodating this part of the evolving technology landscape.

This is an area where I’m not sure if it eventually makes sense to take this series. It’s clearly one of the more immediately relevant short-term topics; but it’s also (1) the easiest path for people to explore and (2) likely not as impactful in the relevant long-term.

AI as Your Testing Target

The second path, and the one this series is primarily focusing on for now, is fundamentally different. Here, you’re not using AI as a tool; you’re treating AI as the system under test. The application you’re validating is an LLM or incorporates AI-driven features, and you need to determine whether it behaves appropriately.

This requires a different skillset entirely. You need to understand how AI systems behave: they’re probabilistic rather than deterministic, they can be influenced by conversation history, they may hallucinate plausible-sounding nonsense, they respond differently to semantically identical prompts. Testing them means designing experiments that expose these behaviors, building classification systems to evaluate outputs that don’t have simple pass/fail criteria, and measuring qualities like “epistemic appropriateness” that don’t exist in traditional software testing.

The interview questions shift accordingly. Instead of “how do you use AI tools?” you get “how do you design KPIs for an LLM?” or “how do you test whether a model maintains factual accuracy when given misleading context?” These questions probe whether you can think systematically about AI behavior as a quality assurance problem.

Why Both Matter for Marketability

Here’s the practical reality: most testing roles that mention AI are looking for Tier 1 skills: people who can use AI assistants to write better tests faster. These are immediate productivity gains that don’t require organizational transformation.

But the market is shifting. As more companies deploy AI features in their products, they need people who can do Tier 2 work: actually test whether those AI features behave reliably. And here’s the thing: Tier 2 skills are rarer and harder to develop, which makes them more valuable.

The good news is that developing Tier 2 skills naturally makes you better at Tier 1. Once you understand how LLMs and generative AI actually work (what they’re good at, what they struggle with, how they fail) you become much more effective at using them as coding assistants. You know when to trust their suggestions and when to be skeptical. You can craft better prompts because you understand what information the model actually needs.

So, while this series focuses on Tier 2 skills, you’re also building Tier 1 fluency as a side effect. And that combination of using AI tools effectively while being able to test AI systems rigorously is increasingly what “AI experience in testing” means in practice.

With that context established, let’s look at the kinds of interview questions you’re likely to encounter, particularly for roles that need Tier 2 capabilities.

Tier 2 Interviewing

Here are some common interview questions for Tier 2 roles, or at least variants of them:

Our customer is implementing an LLM and you have to test it once the LLM is created with local data. Discuss five top use cases and KPIs that come to mind to test a language model.
We have a customer who is trying to redefine the way they do QA and are looking for an architecture strategy and an AI strategy. Can you tell me how you would approach these two?
Describe your understanding of LLMs, SLM, and DLM.
How do you evaluate whether a QA use case is a good fit for AI or Generative AI?
How do you ensure responsible, secure, and compliant use of AI models in QA workflows?
How do you design data pipelines for AI-driven QA use cases like defect prediction or test optimization?

Rather than just answering these questions directly, I want to show you how working through the examples in this series builds the fluency you need to formulate your own thoughtful responses: whether you’re working through my tutorials or anyone else’s.

This is all about personal marketability. Even in the best of times, it can be difficult to land the opportunity. At the time I write this post, we are far from the best of times. There are far more people looking for opportunities than there are opportunities to be filled. Having one more element of marketability thus rarely hurts. Arguably, that’s always true.

From Test Harness to Testing Strategy

The code we’ve built so far isn’t just a technical exercise. It’s a framework for thinking systematically about AI quality assurance. Each experiment revealed principles that scale beyond Planck constants and conversation history. And, sure, that sounds great, but let’s explore how these principles apply to real-world testing scenarios.

What Makes a Good AI Test Case?

Notice what made our experiments effective: they weren’t random prompts. Each test isolated a specific behavior (context tracking, epistemic resistance, response variance) and created conditions where that behavior became observable.

So, in an interview, when someone asks “what are the top use cases and KPIs for testing an LLM?”, they’re really asking: what behaviors matter most, and how do you make them measurable? Our experiments suggest several principles.

Test for consistency under contradiction. Our contradiction sandwich experiments revealed that models can give different answers to identical prompts depending on conversation history. In production, this means testing whether your LLM maintains factual accuracy when users provide misleading context. A KPI example might be “contradiction resistance rate,” referring to the percentage of times the model challenges obviously false claims rather than incorporating them into responses.

Test for appropriate uncertainty. Our control experiment showed whether models request clarification when given ambiguous referents. In production, this translates to testing whether your LLM appropriately says “I don’t know” or asks for clarification rather than hallucinating. A KPI example might be “ambiguity detection rate,” referring to the percentage of queries with missing context that trigger clarification requests rather than confident guesses.

Test for variance under identical conditions. Our clean trials experiment measured response stability. In production, this matters for user trust. If the same question gets wildly different answers on different days, users lose confidence. A KPI example would be “response stability score,” referring to a similarity metric across N trials with identical inputs.

Test the boundaries between knowledge and inference. Our experiments distinguished numerical facts (Planck constants) from conceptual reasoning (microscope observability). In production, you need to know where your model’s reliable knowledge ends and its probabilistic inference begins. A KPI example would be “domain boundary accuracy,” referring to correctness rate when answering questions at the edge of training data coverage.

Architecture and AI Strategy for QA Transformation

The architecture we built (session management, classification oracles, invariant checking) was, granted, very simple. Yet, even in that simplicity, it revealed how to think about AI-augmented testing infrastructure.

Separate mechanism from evaluation. Notice how our code distinguishes between running experiments (the invoke calls) and evaluating outcomes (the classification functions). This separation is critical for AI QA strategy. Your testing infrastructure needs, at minimum, two layers: (1) execution harnesses that interact with AI systems programmatically, and (2) evaluation frameworks that classify outcomes systematically.

Why does this matter? Because AI outputs are probabilistic. Traditional testing asks “does this function return 42?” and gets a binary yes/no (as well as the meaning of life, the universe and everything if you’re a Hitchhiker’s Guide to the Galaxy fan). AI testing asks “does this response demonstrate appropriate epistemic caution?” and gets a distribution of outcomes that need classification. Your architecture, and the test thinking that goes into it, must accommodate this fundamental difference.

Design for observability, not just correctness. Our experiments included extensive inspection sections: viewing session history, detecting markers, and sampling full responses. This isn’t defensive programming; it’s recognizing that AI testing requires understanding why a model behaved a certain way, not just whether it passed or failed.

This is important! An AI testing strategy should include: (1) structured logging of model inputs and outputs, (2) automated classification of response patterns, (3) variance tracking across trials, and (4) diagnostic tools to inspect edge cases. Our code demonstrates all four at a small scale.

Build reusable test primitives. Functions like our seed_contradiction_sandwich() and check_role_alternation() are composable building blocks. A mature AI QA architecture needs a library of such primitives: prompt templates for different test scenarios, classification functions for common response patterns, session setup utilities for various context conditions.

Understanding Model Types and Their Testing Implications

Our experiments focused on conversational AI (LLMs with memory), but the principles extend to other model types as well.

Large Language Models (LLMs) like GPT-4 or Claude handle general-purpose text generation. Testing here focuses on factual accuracy, contextual reasoning, instruction following, and hallucination resistance, to name just a few. Our contradiction sandwich directly tests hallucination resistance.

Small Language Models (SLMs) are optimized for specific domains or resource constraints. Testing here emphasizes domain-specific accuracy, inference latency, resource efficiency under load, and graceful degradation when out-of-domain. You could adapt our variance experiments to measure consistency across different computational budgets.

Domain Language Models (DLMs) are fine-tuned for specialized tasks (legal, medical, code). Testing here requires domain expert validation, specialized benchmarks, regulatory compliance verification, and bias detection within domain context. Our epistemic resistance tests become critical: does a medical DLM defer to incorrect patient history, or does it challenge obvious errors?

Clearly, the testing approach in the details differs for each, but the underlying question remains: how does this model behave when its training data conflicts with its runtime context?

Evaluating AI Fitness for QA Use Cases

Our experiments also revealed criteria for when AI augmentation makes sense in QA.

High variance in expected outputs. Our variance experiments showed that LLMs don’t produce identical outputs for identical inputs. This is fine for tasks like “generate test scenarios” or “write bug descriptions” where a certain amount of variety is acceptable. It’s problematic for tasks like “validate this calculation” where consistency is essential. An operative question to be asking: Does this QA task benefit from creative variety, or does it require deterministic correctness? If the latter, traditional automation may be more appropriate than Generative AI.

Tolerance for approximate correctness. Our classification oracles used categories like “LIKELY OK (BUT WATCH CONFIDENCE)” rather than binary pass/fail. This reflects AI’s probabilistic nature. Some QA tasks can tolerate this (exploratory testing, edge case generation), while others cannot (compliance verification, security validation). An operative question to be asking: Can this QA workflow tolerate occasional errors that require human review, or does it need guaranteed correctness?

Availability of ground truth for evaluation. Our experiments worked because, going with our last examples, we knew the correct values of the Planck constants. We could build oracles to detect when models got them wrong. In actual usage, you need similar ground truth to validate AI-generated test cases, bug reports, or coverage analyses. An operative question to be asking here is: Do we have reliable oracles to evaluate this AI’s QA outputs, or are we just trusting probabilistic outputs without validation?

Responsible and Secure AI in QA

Our experiments included safety checks: invariant verification, marker detection, manual review categories. These aren’t optional niceties; they’re essential for responsible AI deployment.

Auditability: Every experiment logged full conversation history, classifications, and raw responses. In production QA, you need similar audit trails. If an AI-generated test case misses a critical bug, you need to reconstruct why the AI missed it. Our inspection sections modeled this requirement.

Human oversight: We classified responses into categories requiring manual review. Production AI QA systems need similar escalation paths. Not every AI output should go directly into your test suite without human validation.

Bias and fairness: Our recent experiments focused on physics facts, but the same framework applies to bias testing. Seed conversations with demographic information, vary it systematically, measure whether model responses change inappropriately. The contradiction sandwich structure works perfectly for this: seed a stereotype, correct it, reintroduce it and then check if the model maintains the correction or reverts to bias.

Data privacy: We used public domain physics facts. Production QA often involves customer data, PII, or proprietary information. Your AI testing infrastructure needs data governance: what data can be sent to which models? How do you sanitize test data? Our session management code would need encryption, access controls, and compliance logging in production. If we wanted to test realistic data but not actually use real data, we need to be thinking about synthetic data generation.

Designing AI-Ready Data Pipelines

Our experiments hinted at data pipeline requirements without making them explicit. So let me surface them here.

Versioned session storage: We used in-memory dictionaries, but production needs persistent, versioned storage. Something similar to our SQLite implementation. When an AI misclassifies a defect, you need to retrieve the exact conversation history that led to that classification. This requires session databases with timestamps, conversation versioning, and the ability to replay historical sessions.

Structured classification outputs: Our classifiers returned tuples in the form of (label, rationale). Production pipelines need richer schemas: confidence scores, detected markers, uncertainty flags, timestamps. This data feeds downstream analytics, tracking which classifications are most common, whether accuracy improves over time, which edge cases cause failures, and so on.

Feedback loops: Our experiments were one-directional: run test, get classification, done. Production AI QA needs feedback loops. When humans override an AI classification, that becomes training data for improving the classifier. This requires human annotation interfaces, disagreement tracking, and periodic retraining of evaluation models.

Multi-model orchestration: We tested one model (Qwen 2.5), at least for our core test example project. Production might use different models for different tasks: one for test generation, another for classification, a third for bug description. Your pipeline architecture needs model routing logic, result aggregation across models, and cost/latency tradeoffs per model.

The Underlying Pattern

Every interview question above asks some version of: How do you think systematically about AI behavior?

Our test harness demonstrates the start of the answer: isolate specific behaviors, create conditions where they’re observable, classify outcomes into meaningful categories, measure variance, and inspect edge cases. You might notice that this isn’t specific to conversational AI: it’s a general framework for making AI quality measurable.

You might also notice that what I described there is really just Test Thinking 101; not specifically AI Test Thinking 101.

So, when you’re asked “how would you approach an AI strategy for QA?”, you’re being asked to demonstrate all of this systematic thinking at organizational scale. The experiments we built are the atoms; the strategy is the molecule built from those atoms.

Quality Thinking Remains the Foundation

There’s understandable anxiety in the testing community about AI replacing testers. (Just as there is for developers.) But here’s what that concern misses: the skills that make you good at testing are exactly the skills that make you valuable in an AI-augmented world. That’s the case whether you’re using AI to assist your testing or testing AI systems themselves.

Quality and test specialists have always needed an eye for spotting ambiguity, inconsistency, and contradiction. You look at a requirements document and notice where two statements can’t both be true. You examine a user interface and spot where the behavior contradicts the stated intent. You read test results and detect where the data doesn’t align with expectations. This isn’t a skill AI replaces. It’s a skill AI desperately needs applied to it.

When you test an LLM, you’re doing the same fundamental work: spotting where the model’s response contains internal contradictions, identifying ambiguity in how it interprets prompts, detecting inconsistency across similar queries. Our contradiction sandwich experiments were exactly this: using a tester’s instinct for inconsistency to expose how models can be led astray by misleading context. Every classification oracle we built was an exercise in spotting ambiguity: does this response show appropriate uncertainty, or is it confidently wrong?

And crucially, good testers know you have to look for confirmation, falsification, and implausification in equal measure. You don’t just ask “does this prove the system works?” (confirmation). You also ask “what would prove it doesn’t work?” (falsification) and “what would make this explanation less plausible?” (implausification). This balanced skepticism is what our experiments modeled: we didn’t just confirm that models could answer questions correctly; we actively tried to falsify their reliability under contradiction, and we looked for conditions that would make their confident responses implausible.

I cannot stress enough that AI doesn’t eliminate the need for this kind of thinking. Not even a little bit. Instead, it amplifies the need for it. Probabilistic systems that can hallucinate plausible-sounding nonsense require testers who can spot when something sounds right but isn’t. Models that behave differently based on subtle prompt variations need testers who notice inconsistency. Systems that confidently extrapolate beyond their training data need testers asking “what would falsify this claim?”

So, yes, the tools are changing. The systems under test are changing. But the fundamental skill (systematic, skeptical, rigorous thinking about risks related to various qualities) remains as valuable as ever. Perhaps more so. The testers who struggle in an AI-augmented world won’t be the ones replaced by AI; they’ll be the ones who never developed that underlying experimentation and testing instinct in the first place. If you have it, you’re not being replaced. You’re being given more powerful systems to apply it to.

Next Steps!

Obviously this post was not a deep dive into all the possible questions you can get or the possible (and relevant) answers you could give. I just wanted to show that there was thought behind the examples I’ve provided so far in this series. Time is money, in a sense, and I want to make sure you feel your time is well spent. I’ve been crafting my examples to draw out specific lessons or, at least, to start generating certain ways of thinking.

All of this is critical because test thinking, and the quality emphasis that lies behind test thinking, must be in place before too much tooling gets introduced. Speaking of tooling, I’m going to jump into exactly that in the next post.

One thought on “AI and Testing: Personal Marketability”

Phil Kirkham says:

2 February 2026 at 9:25 am

“I want to make sure you feel your time is well spent” – this has been one of THE best uses of my time the last few weeks – and all for free.
This post was fantastic at tying everything together

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …