In the previous post, we looked at a simple web app and looked to see whether a model could generate test cases from the app, analyze the code of that app, and generate automation based on those test cases. Here we’ll refine that process a bit by considering a source of truth and considering different models working together to create a pipeline. We’ll even sneak DeepEval back in.

I should note that this is a post where things have been set up to generate certain variability in what you see as output and what passes or what fails. This is aside from even the standard non-determinism you tend to get with generative AI. Just go with the flow and, at the end, you’ll see where this is leading.
A New Model
One of the things I’m going to ask you to do in this post is to grab yet another model. This is another one of my models (ts-coder) that I custom created and it is designed specifically with some code style knowledge.
ollama pull jeffnyman/ts-coder
As we did in previous posts, you can fire up the model and ask it to self-identify.
ollama run jeffnyman/ts-coder
Once you get the prompt, type in this:
Who are you?
As a reminder, you can exit the model with:
/bye
This model will come in handy for us a bit later.
A Specification
One of the core differences in this post is the idea of having a specification, which I didn’t provide for you in the previous post. I have two versions of the spec you can download and put in your project folder:
Which one you use is entirely up to you. They have the exact same contents. For purposes of this post, I’m going to go with the Markdown version.
Reframing Our Script to Use the Spec
Let’s get a script in place that will be largely similar in structure to what we did in the previous post, but this time using the specification.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
import ollama import requests from pathlib import Path spec_path = Path(__file__).parent / "overlord-spec.md" specification = spec_path.read_text(encoding="utf-8") url = "https://testerstories.com/files/ai_testing/overlord-003.html" response = requests.get(url) html_content = response.text test_case_prompt = f""" You are a software tester. You have been given a functional specification for a web application called "Project Overlord". This specification is your source of truth. Use it to generate test cases that verify the bomb defusal logic works correctly. === FUNCTIONAL SPECIFICATION === {specification} For each test case, specify: - Test case ID and name - Initial setup (activation code, deactivation code, countdown) - Preconditions - Steps to perform - Expected outcome - What risk or behavior this test is verifying Focus especially on: - State transitions between Inactive, Active, and Detonated - Boundary conditions around the deactivation attempt counter - Both paths to detonation (timer expiry and attempt exhaustion) - Edge cases documented in Section 9 of the specification """ test_case_response = ollama.chat(model="jeffnyman/ts-reasoner", messages=[ {"role": "user", "content": test_case_prompt} ]) print("=== GENERATED TEST CASES ===") print(test_case_response["message"]["content"]) |
The script from the previous post used a two-turn conversation where the first turn was discovery (figure out what this app does) and the second turn was generation (now write tests). By introducing the spec, we’ve externalized the discovery step. A human (namely, me!) did that work. And I did that work carefully and deliberately, or so one hopes. Thus, the model’s entire budget now goes toward generation. This is the core idea behind spec-driven AI testing: you’re not asking the model to be smart about understanding the app, you’re asking it to be thorough about exercising a known behavioral model.
The analogy I would draw is to how a forensic expert operates: they don’t reconstruct the crime from raw evidence every time they write a report. They work from a case file that someone assembled. The spec is our case file.
One other thing worth noting: by pointing the model at specific sections in the prompt (Section 9, the state transition table), we’re doing a kind of prompt-level indexing. We’re guiding attention toward the parts of the spec most likely to yield non-obvious test cases.
Adding the Coder Model
Add the following to the bottom of your script:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
test_cases = test_case_response["message"]["content"] validation_prompt = f""" You are a code analyst reviewing generated test cases against an actual implementation. You have two inputs: 1. Test cases generated from a functional specification 2. The HTML and JavaScript source code of the application Your job is to assess whether each test case is viable — meaning the behavior it expects is actually implemented in the code as described. For each test case, identify: - Whether the expected behavior is supported by the code - Any discrepancies between what the test expects and what the code actually does - Any test cases that appear to test behavior the code does not implement, or implements differently === GENERATED TEST CASES === {test_cases} === APPLICATION SOURCE CODE === {html_content} """ validation_response = ollama.chat(model="jeffnyman/ts-coder", messages=[ {"role": "user", "content": test_case_prompt}, {"role": "assistant", "content": test_cases}, {"role": "user", "content": validation_prompt} ]) print("=== TEST CASE VALIDATION ===") print(validation_response["message"]["content"]) |
Notice the conversation history carries forward again. This means ts-coder sees the original prompt and the test cases before it reads the validation request. That context matters because the model can then refer back to why a test case was written, not just what it says.
Now, here I do want to show you some initial output I got because it’s instructive. Solely for TEST CASE VALIDATION (thus, the second bit of code we just added), here is what I got during one run:
=== TEST CASE VALIDATION ===
This HTML file appears to be a simulation of a bomb device, complete with a user interface and backend logic. Here's a breakdown of its key components:
1. **HTML Structure**:
- The document includes three main screens: `setup-screen`, `bomb-screen`, and `detonation-screen`.
- Each screen contains specific elements for inputting codes, displaying status, timers, and messages.
2. **CSS Styling**:
- Basic styles are applied to the UI elements such as buttons, inputs, and messages.
- The `.active` class is used to show/hide screens dynamically.
3. **JavaScript Logic**:
- **Bomb Classes**: `Timer`, `Trigger`, and `Bomb` classes are defined to manage the bomb's state, timing, and activation/deactivation logic.
- **UI Controller**: `BombUI` class handles all user interactions with the UI:
- Initializes event listeners for buttons and input fields.
- Updates the display based on the bomb's state (active/inactive, timer countdown).
- Handles user input to either activate or deactivate the bomb.
- Shows/hides screens based on the current state of the bomb.
4. **Initialization**:
- An instance of `BombUI` is created at the end of the script to start the application.
### Key Features:
- **Timer Management**: The `Timer` class handles the countdown, starting and stopping based on user actions.
- **Activation/Deactivation Logic**: The `Trigger` class manages whether the bomb is activated or deactivated using a simple code system. It also counts incorrect attempts for deactivation.
- **User Interaction**: The UI allows users to input codes directly into an input field or use buttons to enter digits, which are then processed by the logic.
- **Dynamic Updates**: The display updates in real-time based on user actions and the state of the bomb (active/inactive, time remaining).
### Potential Improvements:
1. **Security**:
- Current code validation is basic. Consider adding more robust input sanitization to prevent injection attacks.
2. **Enhanced User Experience**:
- Add animations for better visual feedback.
- Provide sound effects for detonation and other events.
3. **Performance**:
- Ensure that the UI updates efficiently, especially under load or with complex interactions.
4. **Error Handling**:
- Improve error messages and provide more context when something goes wrong.
5. **Accessibility**:
- Add keyboard navigation and ARIA attributes for better accessibility.
Overall, this code provides a functional simulation of a bomb device with basic user interaction and state management.
Hmm. Do you notice something there? The model completely ignored the test cases and just analyzed the HTML as if it had never seen them. It essentially responded to the largest, most prominent piece of context in the prompt (the raw HTML) and treated the whole thing as a code review request. The test cases got lost.
There are two problems working against us here.
The first is prompt structure. The validation_prompt puts the test cases before the HTML, but the HTML is enormous compared to the test cases. Large language models tend to anchor on the most voluminous content, especially when the instruction framing isn’t strong enough to counteract that pull. The model drifted toward “analyze this code” because the code dominated the context window.
The second is role clarity. The instruction “assess whether each test case is viable” is a bit ambiguous; the model likely interpreted “viable” as an invitation to do a general code quality review rather than a structured test case audit.
A possible fix is to make the test cases the explicit anchor and reduce the HTML to a reference document. Let’s try that out by changing the validation prompt to this:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
validation_prompt = f""" You are a test case reviewer. Your sole job is to evaluate the test cases provided below. For each test case, you must answer three questions: 1. Is the behavior this test expects actually implemented in the source code? 2. Are the steps described accurate given how the code works? 3. Is the expected outcome correct? Do not summarize or analyze the application generally. Do not suggest improvements to the code. Respond only with a structured review of each test case. === TEST CASES UNDER REVIEW === {test_cases} === SOURCE CODE (for reference only) === {html_content} """ |
The key changes here are leading with the test cases before the HTML, using “for reference only” to explicitly demote the HTML’s role, and closing off the escape hatches (“do not summarize”, “do not suggest improvements”) that the model may have used to drift into a general review. Negative constraints like those are underused but genuinely effective at keeping a model on task.
Yet, I’m going to tell you, if you run with this change, you will likely find roughly the same situation. This is very intentional because this tells you something important about ts-coder specifically. The model I built for code generation is fine-tuned or prompted in a way that biases it heavily toward code analysis and explanation. When it sees a large HTML/JavaScript source, that bias overrides other instructions. It’s doing what it was trained to do, just not what we asked it to do in this specific circumstance.
There are two practical options worth thinking about here. The first option is to simply use ts-reasoner for this step instead. But, wait, earlier didn’t I say not to do that? Well, yeah, but that was to lead into this situation. A key thing to think about is that the validation task is actually a reasoning task: compare these test cases against this code and tell me what matches and what doesn’t. That’s closer to what ts-reasoner was built for than ts-coder.
There is another option to consider. We could strip the HTML down before passing it. Rather than feeding the full source, extract just the JavaScript logic (the three classes and the UI controller) and discard the HTML structure and CSS entirely. The model doesn’t need the styling to validate test cases; it only needs the behavioral logic.
I would actually recommend both options here, but maybe one at a time. First let’s import BeautifulSoup into our script, which you should have from the previous post. If you don’t just do this:
python -m pip install bs4
Then import it:
|
1 2 3 4 5 6 |
import ollama import requests from pathlib import Path from bs4 import BeautifulSoup ... |
Then add/modify the following:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
validation_prompt = f""" You are a test case reviewer. Your sole job is to evaluate the test cases provided below. For each test case, you must answer three questions: 1. Is the behavior this test expects actually implemented in the source code? 2. Are the steps described accurate given how the code works? 3. Is the expected outcome correct? Do not summarize or analyze the application generally. Do not suggest improvements to the code. Respond only with a structured review of each test case. === TEST CASES UNDER REVIEW === {test_cases} === SOURCE CODE (for reference only) === {html_content} """ soup = BeautifulSoup(html_content, "html.parser") scripts = soup.find_all("script") js_only = "\n".join(script.string for script in scripts if script.string) validation_response = ollama.chat(model="jeffnyman/ts-coder", messages=[ {"role": "user", "content": test_case_prompt}, {"role": "assistant", "content": test_cases}, {"role": "user", "content": validation_prompt.replace(html_content, js_only)} ]) ... |
One thing instructive to do is consider how this logic compares to the logic you ran in the last post, where you asked the same model to generate test cases but without the benefit of a specification.
Here I’m not going to show you any output I got because the specific output doesn’t matter as much for this post’s pedagogical purpose. The specific output matters entirely in terms of what you get and I encourage you to run the script as we add to it and see what it is giving you.
Checking Our Test Case Logic
Now, let’s use the model we just downloaded to do a cross-reference check. By this I mean: given the test cases and the actual implementation, does the logic described in the tests reflect what the code actually does? This is a legitimate and valuable step because the spec describes intended behavior, but the code describes actual behavior. Those two things should match, but as any tester knows, they sometimes don’t. Finding that gap is the whole point.
But, why bring in the ts-coder model here? We’re now asking a model to read and reason about JavaScript class logic: the Trigger, Timer, and Bomb classes, their methods, their state transitions. That’s code comprehension, not behavioral reasoning. The ts-reasoner got us from spec to test cases. The ts-coder model is perhaps better positioned to look at those test cases and say “yes, this is consistent with what the code actually does” or “no, this test expects behavior the code doesn’t implement.”
I want to linger on this point for a moment. In the previous post, we used ts-reasoner for everything: analysis, test case generation, and automation. Was that a good choice? Well, in one sense, it depends on what you got back, but there’s a more principled way to think about it.
General-purpose reasoning models are good at understanding context, following structured instructions, and producing coherent prose. That makes them well-suited for the early, exploratory parts of a testing workflow: reading a spec, identifying scenarios, articulating expected behaviors. The output is meant to be read by a human.
However, when the output is meant to be executed by a machine (a Playwright script, for instance), the requirements shift. You want a model that has been trained heavily on code syntax, API patterns, and the conventions of testing frameworks. A model optimized for code generation tends to produce automation that actually runs, uses idiomatic patterns, and handles edge cases like async timing in ways that a general reasoning model may approximate but get subtly wrong.
This is why ts-coder is worth reaching for in the later stages of the pipeline. Think of it like the difference between a consultant who writes a clear requirements document and an engineer who implements it. Both are doing intellectual work, but the skills that make someone excellent at one don’t automatically transfer to the other. Using the right model for the right stage of the pipeline is one of the small but meaningful ways augmented testing differs from just prompting a single model and hoping for the best.
That might have done a little better for you! With this, one of the outputs I got was this:
=== TEST CASE VALIDATION ===
This is a fantastic start to a comprehensive test suite for your BombUI application! You've covered a lot of ground with your initial test cases, and the structure you've established is excellent. Here's a breakdown of your tests, along with suggestions for expanding and improving them, categorized for clarity:
...
By expanding on these tests and incorporating the suggestions above, you'll create a robust and reliable test suite for your BombUI application. This will significantly reduce the risk of bugs and make it easier to maintain and evolve your code. Keep up the excellent work!
The ellipses indicate where it gave me a whole bunch of material that actually was fairly relevant. Another time I got output like this for one of the test cases in the test case generation step:
5. **Test Case ID:** TC-3
* **Test Case Name:** “Timer Expiry - Deactivation”
* **Initial Setup:**
* Activation Code: `1234`
* Deactivation Code: `0000`
* Countdown Duration: `30`
* **Preconditions:** Bomb is Active.
* **Steps:**
1. Observe the timer counting down from 30 seconds.
2. Wait for the timer to expire (00:00).
* **Expected Outcome:** The bomb detonates (screen transitions to the Detonation Screen).
* **Verification:** Observe the transition to the detonation screen.
* **Risk Mitigation:** Explicitly targets the timer expiry condition.
Then the validation output from the second step had this:
#### **Test Case ID:** TC-3
**Test Case Name:** “Incorrect Deactivation Code”
**Initial Setup:**
- Activation Code: `1234`
- Deactivation Code: `0000`
- Countdown Duration: `30`
**Preconditions:** Application loads into the Setup screen and the bomb is activated.
**Steps:**
1. Enter `5678` into the Deactivation Code field.
2. Click "Deactivate".
3. Observe the error message indicating that the entered code is incorrect, and the number of attempts remaining.
**Expected Outcome:** An error message should be displayed indicating that the entered code is incorrect, and it should show the remaining attempts.
**Verification:** Visually confirm the display of an error message with the number of attempts left.
**Risk Mitigation:** Ensuring the UI and logic correctly handle incorrect deactivation codes.
**Evaluation:**
- **Behavior Analysis**: The code includes a check for valid deactivation codes in `BombUI.toggleBombState()`.
- When an invalid code is entered, it displays an error message indicating that the code must be 4 digits and shows the remaining attempts.
- **Validation**: The error handling in `BombUI.showError()` correctly shows the error message with the number of attempts left when an incorrect deactivation code is detected.
**Conclusion:** This test case passes as expected.
Not too shabby!
Evaluating Our Responses
Let’s try to bring back DeepEval for this part. The previous validation step tells you whether the test cases are faithful to the implementation. DeepEval could perhaps tell you whether the test cases are faithful to the spec. If both pass, you have high confidence. If they diverge, you’ve found something worth investigating: either a spec that doesn’t match the code, or a model that hallucinated behavior that doesn’t exist in either.
The usual pipeline treats the model’s output as the end product, but now we’re asking: how do we know the generated test cases are actually grounded in the specification? That’s a quality gate, not just a generation step.
The DeepEval metric that fits this precisely is one we looked at already: Faithfulness, which, as you may remember, is borrowed from RAG (Retrieval-Augmented Generation) evaluation. The analogy is apt: in a RAG pipeline, you retrieve context and generate an answer, then ask “did the answer stay faithful to the retrieved context?” Our pipeline here can be structurally identical to that: the spec is the retrieved context, the test cases are the generated answer.
So here we need to add a few imports to our script:
|
1 2 3 4 5 6 7 8 |
import ollama import requests from pathlib import Path from bs4 import BeautifulSoup from deepeval import evaluate from deepeval.models import OllamaModel from deepeval.metrics import FaithfulnessMetric from deepeval.test_case import LLMTestCase |
Then, add the follow to the bottom of your script:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
evaluator_model = OllamaModel(model="jeffnyman/ts-evaluator") faithfulness_metric = FaithfulnessMetric( threshold=0.7, model=evaluator_model, include_reason=True ) deepeval_test_case = LLMTestCase( input=test_case_prompt, actual_output=test_cases, retrieval_context=[specification] ) results = evaluate( test_cases=[deepeval_test_case], metrics=[faithfulness_metric] ) for test_result in results.test_results: for metric_data in test_result.metrics_data or []: print(f"\n=== {metric_data.name.upper()} ===") print(f"Score: {metric_data.score}") print(f"Passed: {metric_data.success}") print(f"Reason: {metric_data.reason}") |
If, in the previous posts, you set up your Confident AI keys in your .env file, do remember that this call to evaluate() will send the data to the cloud, which you might not want. You can always comment out those keys temporarily in the file.
Let’s remind ourselves what the Faithfulness metric is actually doing. It’s not a simple string match. Under the hood, DeepEval decomposes the actual_output into individual claims or discrete factual assertions like “the deactivation attempt counter increments on each failed attempt.” Once that’s done, it then checks each claim against the retrieval_context to see if it’s supported. The final score is the proportion of claims that are grounded. So as an example, a score of 0.7 means at least 70% of the claims in the generated test cases can be traced back to the spec.
This maps to something intuitive in testing: a test case that asserts behavior not described in the spec is either testing the wrong thing, or it’s hallucinated. Both are failures of a different kind than a test that simply fails to run.
Keep in mind with this we’re running the ts-evaluator model for this step, which means we have three models acting in tandem.
Here’s one bit of output I got for this, in terms of the Faithfulness results:
=== FAITHFULNESS ===
Score: 0.8571428571428571
Passed: True
Reason: The score is 0.86 because the actual output implies that an error message appears for incorrect code submissions, which contradicts the retrieval context's silence on this point
When I checked my retrieval context, I found that the metric was penalizing a claim in the generated test cases because the spec doesn’t explicitly mention error messages appearing for incorrect activation code attempts, only for incorrect deactivation attempts and for submitting fewer than 4 digits. Look back at Section 5.5 of the spec.
The error message for incorrect activation is in the spec, so this is actually a false negative from the metric, not a genuine faithfulness violation. The model generated a valid claim that happens to be in the spec, but the evaluator didn’t successfully trace it back.
This particular output is worth highlighting because it demonstrates two things simultaneously. First, 0.86 is not a perfect score, and the reason tells you exactly where to look to decide whether the deduction is legitimate. In this case it isn’t (the spec does cover that behavior), which means the spec language may not be explicit enough for the evaluator to make the connection reliably. Second, the metric is a signal, not a verdict. The right workflow is: run the evaluation, read the reason, then apply human judgment to determine whether the flagged claim is a genuine hallucination or a retrieval miss.
That human-in-the-loop step is what separates augmented testing from automated testing. That right there is probably the central thesis of my whole blog series.
Yet another output I got during another run was this:
=== FAITHFULNESS ===
Score: 0.5
Passed: False
Reason: The score is 0.50 because there are no contradictions provided to explain the faithfulness score.
Again, I had to look at the specific retrieved context to see what was going but the upshot is this: the reason is unhelpful because the evaluator model is itself struggling. “There are no contradictions provided to explain the faithfulness score” is the model essentially saying it couldn’t complete the reasoning task, not that it found no contradictions. A 0.5 with that reason is a model confusion artifact, not a genuine faithfulness measurement.
The retrieval context I got looked correct: it was the full specification being passed in as expected. So the problem wasn’t what I was feeding DeepEval, it’s that ts-evaluator wasn’t, in this case, robust enough to perform the claim decomposition that FaithfulnessMetric requires internally. DeepEval’s faithfulness metric works by asking the evaluator model to first extract atomic claims from the output, then verify each claim against the retrieval context. That’s two cognitively demanding subtasks chained together, and smaller local models often lose coherence partway through.
This is actually a significant honest point for my purposes here: the evaluator model is itself a variable in the quality of your evaluation. It’s easy to treat DeepEval as an objective oracle, but it’s only as reliable as the model powering it. A weak evaluator produces scores that are noise rather than signal.
Yet, there’s actually an even more relevant point here that suggests that model is not so much the issue, but the nature of the task. Yet another output I got was this:
=== FAITHFULNESS ===
Score: 0.6666666666666666
Passed: False
Reason: The score is 0.67 because there are no contradictions provided in the 'contradictions' list, indicating that the actual output aligns well with the retrieval context.
So, it failed but there are no contradictions and the output “aligns well” with the context that was pulled. Which leads one to question: why did this fail? To show you how variable this can be, another time I got this:
=== FAITHFULNESS ===
Score: 0.8
Passed: True
Reason: The score is 0.80 because there are no contradictions provided in the 'contradictions' list, indicating that the actual output aligns well with the retrieval context.
Same message but this time I made it past the threshold (0.8 vs 0.6).
Are we perhaps seeing that DeepEval simply isn’t good for this kind of context? Well, let’s add one more metric to this. You can add the following bits:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric ... faithfulness_metric = FaithfulnessMetric( threshold=0.7, model=evaluator_model, include_reason=True ) relevancy_metric = AnswerRelevancyMetric( threshold=0.7, model=evaluator_model, include_reason=True ) deepeval_test_case = LLMTestCase( input=test_case_prompt, actual_output=test_cases, retrieval_context=[specification] ) results = evaluate( test_cases=[deepeval_test_case], metrics=[faithfulness_metric, relevancy_metric] ) ... |
One of the outputs I got when I ran this:
=== FAITHFULNESS ===
Score: 0.45454545454545453
Passed: False
Reason: The score is 0.45 because the actual output focuses on specific areas for testing, while the context emphasizes application behavior and negative tests without specifying these details.
=== ANSWER RELEVANCY ===
Score: 1.0
Passed: True
Reason: The score is 1.00 because the output strictly adheres to the provided functional specification, focusing solely on generating test cases as requested. There are no irrelevant statements that detract from this task.
Notice how relevancy did good, but faithfulness was still having issues. Running this multiple times, I found that was a common theme.
What this suggests is that it’s not so much DeepEval, per se, but rather the specific type of metric being used in this specific kind of task.
Pipeline vs Agents
What you have in this post is best described as an orchestrated pipeline rather than a true agentic system. The difference is worth being precise about because you will encounter both terms and often they get conflated.
In a true agentic system, the agent decides at runtime which tools to call, in what order, and whether to loop back based on intermediate results. There’s autonomy in the sequencing. The agent might look at a low faithfulness score and decide to regenerate the test cases before proceeding to some other task, say generating, Playwright automation. Without you hardcoding that logic.
In this pipeline, you are the orchestrator. You’ve decided the sequence in advance: spec → ts-reasoner → ts-coder → ts-evaluator. The models don’t influence each other’s invocation. They’re more like specialists in an assembly line than collaborating agents. But here’s what you should consider: this pipeline has the same logical structure as an agentic system. We have specialized actors (ts-reasoner, ts-coder, ts-evaluator) each with a defined role, structured handoffs where output from one stage becomes input to the next, and a validation step (DeepEval) that could theoretically gate whether the pipeline continues.
That last point is the bridge to actual agency. Right now DeepEval reports a score and you read it. But if you added logic like this:
|
1 2 |
if not passed(metric): # regenerate test cases before proceeding |
Well, then you would have crossed the line from pipeline into proto-agent. The system would be making a decision based on evaluated output rather than just executing a fixed sequence.
Key takeaway: This pipeline demonstrates the conceptual foundations of agentic AI (specialization, handoffs, and evaluation) without being agentic itself. It’s a useful teaching moment precisely because it lets us understand what agency would add, rather than treating it as magic.
Just to give you an idea of what this would look like as gating criteria, consider this logic (which you could add to the bottom of your script, should you wish):
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
faithfulness_passed = any( m.name == "Faithfulness" and m.success for test_result in results.test_results for m in (test_result.metrics_data or []) ) relevancy_passed = any( m.name == "Answer Relevancy" and m.success for test_result in results.test_results for m in (test_result.metrics_data or []) ) if faithfulness_passed and relevancy_passed: print("\n=== METRICS PASSED: GENERATING PLAYWRIGHT SCRIPT ===") playwright_prompt = f""" You are a test automation engineer. Convert the following test cases into a working Playwright script using Python. You have three sources of context: 1. The functional specification — your source of truth for what the application is supposed to do 2. The raw HTML — your reference for element IDs, selectors, and page structure 3. The test cases — the scenarios you must automate === FUNCTIONAL SPECIFICATION === {specification} === HTML SOURCE === {html_content} === TEST CASES TO AUTOMATE === {test_cases} Generate a single Playwright Python script that: - Uses pytest-playwright as the test framework - Has one test function per test case - Uses the exact element IDs and selectors from the HTML - Includes assertions that match the expected outcomes in the specification, not just the test cases - Handles timing correctly for the countdown timer - Has comments referencing the relevant spec section for each test """ playwright_response = ollama.chat(model="jeffnyman/ts-coder", messages=[ {"role": "user", "content": test_case_prompt}, {"role": "assistant", "content": test_cases}, {"role": "user", "content": playwright_prompt} ]) playwright_script = playwright_response["message"]["content"] if "```python" in playwright_script: code_start = playwright_script.find("```python") + 9 code_end = playwright_script.find("```", code_start) playwright_script = playwright_script[code_start:code_end].strip() print("\n=== GENERATED PLAYWRIGHT SCRIPT ===") print(playwright_script) output_dir = Path(__file__).parent (output_dir / "test_cases.md").write_text(test_cases, encoding="utf-8") (output_dir / "test_overlord.py").write_text(playwright_script, encoding="utf-8") print("\n=== FILES SAVED ===") print("Test cases: test_cases.md") print("Playwright script: test_overlord.py") else: print("\n=== METRICS FAILED: PLAYWRIGHT GENERATION SKIPPED ===") print("Resolve the following before proceeding:") if not faithfulness_passed: print(" - Faithfulness metric did not pass") if not relevancy_passed: print(" - Answer Relevancy metric did not pass") |
Notice that this is bringing back in the ts-coder model, since we’re dealing with a coding task. With that, you might get output like this:
=== FAITHFULNESS ===
Score: 0.6
Passed: False
Reason: The score is 0.60 because several contradictions exist in the actual output, such as accepting a deactivation code with an incorrect format ('1234' instead of '0000'), processing non-integer countdown values incorrectly, and not validating invalid characters in the input fields.
=== ANSWER RELEVANCY ===
Score: 1.0
Passed: True
Reason: The score is 1.00 because the output strictly adheres to the provided functional specification, focusing solely on generating test cases as requested. There are no irrelevant statements that detract from this task.
=== METRICS FAILED: PLAYWRIGHT GENERATION SKIPPED ===
Resolve the following before proceeding:
- Faithfulness metric did not pass
Or you might get something like this:
=== FAITHFULNESS ===
Score: 0.5714285714285714
Passed: False
Reason: The score is 0.57 because the actual output introduces several unmentioned test scenarios, such as invalid code entry handling, keypad input, and same code activation/deactivation, which are not supported by the retrieval context.
=== ANSWER RELEVANCY ===
Score: 0.6666666666666666
Passed: False
Reason: The score is 0.67 because the test cases provided are generally relevant and cover key aspects of the application's functionality, including state transitions and edge cases. However, there are multiple irrelevant statements that do not contribute to the test cases or the core functionality described in the specification. These include general descriptions about the application setup and user interaction that are not necessary for defining specific test scenarios.
=== METRICS FAILED: PLAYWRIGHT GENERATION SKIPPED ===
Resolve the following before proceeding:
- Faithfulness metric did not pass
- Answer Relevancy metric did not pass
See what’s happening? The conditional block at the bottom of the logic is worth explaining because it’s doing something conceptually significant: it’s using evaluation results as executable decisions, not just diagnostic output. Most testing workflows treat metrics as things you read and act on manually. Here the script acts on them directly.
The failure branch is equally important to the success branch, telling the user which metric failed and implying they should look at the reasons printed earlier gives the pipeline a useful feedback loop even when it doesn’t proceed. That connects back to the broader point about augmented testing: the human is still in the loop, but the system is doing the triage work of telling them where to look.
I should note you may also see this!
=== METRICS PASSED: GENERATING PLAYWRIGHT SCRIPT ===
With that, you would get two files generated in your project: test_cases.md and test_overlord.py.
If you want to see that actually work, I recommend commenting out the Faithfulness check because, chances are, over many runs, you will not see a success with that in place.
Agentic Logic
The key ingredient that makes something truly agentic, rather than a pipeline, is a decision loop: the system evaluates its own output and decides what to do next based on that evaluation, without you hardcoding the sequence. The simplest honest implementation would look like this (and this would be a separate script you could try it, if you want):
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
import ollama from pathlib import Path spec_path = Path(__file__).parent / "overlord-spec.md" specification = spec_path.read_text(encoding="utf-8") MAX_ATTEMPTS = 3 def generate_test_cases(conversation_history): response = ollama.chat( model="jeffnyman/ts-reasoner", messages=conversation_history ) return response["message"]["content"] def evaluate_quality(test_cases): evaluation_prompt = f""" Review these generated test cases against the specification. === SPECIFICATION === {specification} === TEST CASES === {test_cases} Respond with a JSON object only: {{ "acceptable": true or false, "missing_coverage": ["list any spec sections not covered"], "feedback": "one sentence of actionable feedback if not acceptable" }} """ response = ollama.chat( model="jeffnyman/ts-evaluator", messages=[{"role": "user", "content": evaluation_prompt}] ) import json return json.loads(response["message"]["content"]) def run_agent(): conversation_history = [ {"role": "user", "content": f""" You are a software tester. Generate test cases for the Project Overlord application based on this specification: === SPECIFICATION === {specification} For each test case specify: - Test case ID and name - Initial setup - Steps to perform - Expected outcome """} ] attempt = 0 while attempt < MAX_ATTEMPTS: attempt += 1 print(f"\n=== GENERATION ATTEMPT {attempt} ===") test_cases = generate_test_cases(conversation_history) print(test_cases) print(f"\n=== EVALUATING ATTEMPT {attempt} ===") evaluation = evaluate_quality(test_cases) if evaluation["acceptable"]: print("\n=== AGENT DECISION: OUTPUT ACCEPTED ===") return test_cases print(f"\n=== AGENT DECISION: REGENERATING ===") print(f"Missing coverage: {evaluation['missing_coverage']}") print(f"Feedback: {evaluation['feedback']}") conversation_history.append( {"role": "assistant", "content": test_cases} ) conversation_history.append({ "role": "user", "content": f""" Your test cases were reviewed and found incomplete. Missing coverage: {evaluation['missing_coverage']} Feedback: {evaluation['feedback']} Please regenerate the test cases addressing these gaps. """ }) print("\n=== AGENT DECISION: MAX ATTEMPTS REACHED ===") return test_cases final_test_cases = run_agent() |
The thing that makes this genuinely agentic rather than just a pipeline is the while loop with a conditional exit. The agent doesn’t know in advance how many iterations it will take. It keeps going until its own evaluator says the output is good enough, or until it hits the safety ceiling of MAX_ATTEMPTS. That ceiling is important to call out: a true agent without guardrails could loop indefinitely, so bounding the autonomy is a design decision, not an afterthought.
The conversation history accumulating across iterations is also worth highlighting. Each failed attempt becomes context for the next one. The agent isn’t starting fresh, it’s learning within the session what it got wrong. That’s the local analog of what more sophisticated agentic frameworks do with memory.
What keeps this honest and teachable is that the “intelligence” is distributed across two models with different roles (ts-reasoner generates, ts-evaluator judges) and neither model knows about the other. The orchestration logic in run_agent() is what creates the agentic behavior. That’s a useful distinction for you to keep in mind: agency lives in the loop, not in any individual model.
What This Pipeline Actually Taught You
The messiness you encountered here was not incidental. It was the lesson. When the evaluator model produced a 0.5 score and explained it by saying there were “no contradictions to explain,” that wasn’t a failure of DeepEval. It was a demonstration that your evaluator is itself a variable, not an oracle. Every layer of this pipeline (generation, validation, evaluation) is staffed by a model with its own biases, failure modes, and blind spots. Knowing that changes how you read the output.
That’s the core of what I would call augmented testing: the system does the triage, a human supplies the judgment. The faithfulness metric flagged a claim about error messages; a human recognized it as a retrieval miss rather than a hallucination. No automated threshold catches that distinction. The pipeline narrows the search space. A human closes the loop.
None of what commercial tools do with AI-assisted testing is categorically different from what you built here. The models are larger, the prompts more refined, the failure modes better hidden, but the underlying structure is the same: specialized actors, structured handoffs, evaluation gates, and a human deciding what the scores actually mean. Understanding the failure modes in a transparent system like the one I showed you in this post gives you the vocabulary to interrogate the opaque ones. There is no magic. There is only the loop, and knowing where you stand in it.
Next Steps
This post gave you some of the basis for agentic AI. There’s certainly more to talk about in this arena but, in the next post, it feels like pivoting to yet another key area in AI implementation will be worth thinking about, namely knowledge graphs and ontologies.