Writing comprehensive test cases means understanding every component, state transition, and edge case in your application. Can an AI model look at a web application and figure out what needs testing? Well, let’s find out. We’ll give a local AI model the HTML for a bomb defusal simulator, ask it to analyze the code, and see if it can generate meaningful test cases, then convert those into working Playwright scripts.

Test Supporting Tools
One thing I want to head off at the start is the tool focus you may be aware of. Some of you may be aware of tools like Testim, Applitools, TestRigor, Virtuoso QA, Mabl, and Functionize, to name just a few.
Is what I’m going to do in this post the same as those tools? Yes and no. I’m going to be showing the conceptual foundation, but these commercial tools have added significant layers on top. Let me break down what I’m showing here versus what they’re selling.
I’m going over the core concept.
- AI reads application structure (HTML/DOM)
- AI reasons about functionality (what does this app do?)
- AI generates test scenarios (what should we test?)
- AI produces executable code (Playwright commands)
This is fundamentally what’s happening under the hood in all these tools. What those commercial tools add, however, is potentially significant.
- So-called “self-healing” selectors (as seen in Testim, Mabl, and Functionize). The idea is that when your script says
click("#submit-btn")and the ID changes to #submit-button, traditional tests break. These tools ostensibly use AI to figure out “oh, that’s probably the same button” based on visual position, text content, surrounding elements, and historical patterns. - Visual AI Testing (this is Applitools specifically). The idea here is to take screenshots and use computer vision to detect visual regressions, like “the button moved 3 pixels” or “the color changed.”
- Natural language test creation (such as in TestRigor and Virtuoso). Instead of writing code, you write: “click Login, enter username as ‘test@example.com’, verify dashboard appears.” The AI translates that to actual automation.
- Learning from usage (which is what most of the tools purport to do). This is the idea of watching how real users interact with your app and suggest tests based on actual usage patterns.
- Maintenance and reporting infrastructure (again, most of the tools purport to do at least some of this). Here is where you get test result dashboards, CI/CD integration, team collaboration tools, historical trend analysis, and smart test prioritization.
The Pedagogical Approach
What I’m doing here is more of the pedagogical approach. I’m trying to help understand the black box, as it were. Commercial tools often say: “Just click this, magic happens!” What I’m trying to show is, “Here’s exactly how the magic works.” This is like the difference between using a calculator versus understanding arithmetic or, apropos to our context, using a framework versus understanding the underlying patterns.
As with any such tooling, when you understand the mechanism, you can adapt it to your specific context, combine it with domain-specific knowledge, and build custom workflows the tools may not support.
My goal with a pedagogical approach is that readers can then evaluate those commercial tools and think: “Oh, they’re doing X better than my script, but Y is just marketing fluff.” Think of it like teaching someone to build a simple search engine. The pedagogical approach is, “Here’s how to crawl pages, index words, and rank results.” What Google would say is, “Here’s a search engine with twenty-five years of refinement, billions in infrastructure, and thousands of engineers.” Both are “search engines,” but one teaches principles, the other is a production system.
The Application
To check out the application, you can go to the first version of Project Overlord.
Let’s say this is all you’re given. You have no requirements docs and all the developers and business analysts you could ask about the app have taken a siesta. So, you’re on your own with some exploratory testing.
Starting to Script
Let’s say you’ve played around a little bit but now you want to give AI a chance to do likewise. One thing you’re going to need as we do this is another Python dependency, which you can install. (Reminder to be in your virtual environment.)
python -m pip install bs4
What we’re installing here is something called Beautiful Soup and this is a Python library that parses HTML documents into a structured format. While we could send the raw HTML directly to our AI model, Beautiful Soup gives us the option to clean up or extract specific sections if needed.
Now let’s get a script in place:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import ollama import requests from bs4 import BeautifulSoup url = "https://testerstories.com/files/ai_testing/overlord-001.html" response = requests.get(url) html_content = response.text soup = BeautifulSoup(html_content, "html.parser") prompt = f""" Here is a web application that simulates a bomb defusal system: {html_content} Please analyze this application and describe: 1. What are the main components? 2. What is the sequence of actions a user must take? 3. What are the different states the application can be in? """ response = ollama.chat(model="jeffnyman/ts-reasoner", messages=[ {"role": "user", "content": prompt} ]) print(response["message"]["content"]) |
We start by fetching the HTML content of my bomb defusal application using the requests library. Then we create a prompt that embeds the entire HTML and asks the AI model three specific questions: what are the main components, what sequence of actions must a user take, and what states can the application be in. We send this prompt to our local ts-reasoner model via Ollama’s API, and the model returns its analysis based purely on reading the HTML structure and embedded JavaScript.
Evaluating the Output
You’re going to get some output from this, of course, but what do you do with it? We just spent a lot of time looking at DeepEval and you might wonder if that could help here.
Well, not really. DeepEval (and similar tools) are designed to evaluate AI outputs when there’s a known correct answer or clear quality metrics. Here, however, human judgment really matters. It’s like asking someone to critique a painting. The DeepEval approach would be “Measure color balance, symmetry, rule-of-thirds compliance.” The human approach might be that, but also “Does this evoke the intended emotion? Is the composition effective?” The second requires human judgment because there’s no objective answer.
In this case, the human tester needs to evaluate a few things:
- Completeness: “Did it catch the major pieces?”
- Accuracy: “Are the state transitions it described actually correct?”
- Usefulness: “Does this analysis help me write better tests?”
- Missing insights: “Did it notice the 3-attempt limit on deactivation?”
The model’s output isn’t “right” or “wrong” in an absolute sense. It’s effectively an interpretation. Your job as a tester is to evaluate whether this analysis gives you useful testing insights. Did it identify the key components? Did it understand the state machine? Would its analysis help someone new to this app write good tests?
This is why AI augments testing rather than replacing it. The AI can rapidly analyze structure and suggest possibilities, but the human tester brings domain knowledge and critical thinking to evaluate those suggestions.
One thing I want you to think about here is when a tool like DeepEval could help in this context. As you go through this post, keep that thought in the back of your mind.
I recommend digging into whatever output you got. Beyond looking in your browser, you can look at the source for this first Project Overlord example in my repo.
Now, let’s modify our script a bit. The modifications start at line 22.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
import ollama import requests from bs4 import BeautifulSoup url = "https://testerstories.com/files/ai_testing/overlord-001.html" response = requests.get(url) html_content = response.text soup = BeautifulSoup(html_content, "html.parser") prompt = f""" Here is a web application that simulates a bomb defusal system: {html_content} Please analyze this application and describe: 1. What are the main components? 2. What is the sequence of actions a user must take? 3. What are the different states the application can be in? """ analysis_response = ollama.chat(model="jeffnyman/ts-reasoner", messages=[ {"role": "user", "content": prompt} ]) first_response = analysis_response["message"]["content"] test_case_prompt = """ Based on your analysis, generate test cases that would verify the bomb defusal logic works correctly. For each test case, specify: - Initial setup (activation code, deactivation code, countdown) - Steps to perform - Expected outcome """ test_case_response = ollama.chat(model="jeffnyman/ts-reasoner", messages=[ {"role": "user", "content": prompt}, {"role": "assistant", "content": first_response}, {"role": "user", "content": test_case_prompt} ]) print(test_case_response["message"]["content"]) |
Before we move on, here’s something to consider: there are some specific bugs lurking in this version of the bomb defusal application. You might have already discovered some while exploring the app yourself.
The interesting question is: did the AI generate test cases that would help you find this bug? Look through the test cases it produced. Do any of them exercise the faulty behavior? Or did the AI generate “reasonable-sounding” tests that would all pass, missing the bug entirely?
This reveals something important about AI-assisted testing: the model can analyze structure and suggest scenarios, but it’s reasoning from the code as written. If the bug is a mismatch between intent and implementation, static analysis alone won’t catch it.
In my output, these were the test cases I got:
1. Test Case: Successful Deactivation Before Detonation
2. Test Case: Deactivation Just Before Detonation
3. Test Case: Incorrect Deactivation Code (During Countdown)
4. Test Case: Multiple Activation/Deactivation Cycles
5. Edge Case: Zero-Second Countdown
Test Case 3 was instructive:
**3. Test Case: Incorrect Deactivation Code (During Countdown)**
* **Initial Setup:**
* Activation Code: 2468
* Deactivation Code: 9012
* Countdown: 30 seconds
* **Steps to Perform:**
1. Enter '2468' (activation code) and press "Activate."
2. Observe the countdown.
3. Enter '9012' (incorrect deactivation code) *before* the countdown reaches zero.
* **Expected Outcome:** The countdown stops at 0:00. The bomb status should change to "Detonated." A confirmation message should be displayed ("Bomb detonated due to incorrect code").
* **Risk:** Misinterpretation of the deactivation process – incorrect code entry should lead to detonation.
* **Falsification:** If the bomb status remains “Active” or “Inactive” after an incorrect deactivation code is entered, or if it doesn’t detonate, the test fails.
This is actually wrong. The deactivation code is not incorrect, as per its own setup! It actually is providing the correct one. Also, even if the incorrect deactivation code was used once, that will not detonate the bomb. Although this test case, if carried through correctly, would expose one of the bugs.
Let’s first try a detective mode. Add the following to the bottom of your script:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# Detective Mode bug_hunting_prompt = """ Now, analyze the JavaScript code in the web app more carefully for potential bugs or edge cases. Pay special attention to: - Error handling paths - what happens when user actions fail? - State cleanup - is data properly reset after failures? - User experience after errors - can users recover and try again? - Missing cleanup or reset logic in failure scenarios Describe any potential issues you find and suggest test cases that would expose them. """ bug_hunt_response = ollama.chat(model="jeffnyman/ts-reasoner", messages=[ {"role": "user", "content": prompt}, {"role": "assistant", "content": first_response}, {"role": "user", "content": test_case_prompt}, {"role": "assistant", "content": test_case_response["message"]["content"]}, {"role": "user", "content": bug_hunting_prompt} ]) print("\n=== BUG HUNTING ANALYSIS ===") print(bug_hunt_response["message"]["content"]) |
My output here was interesting. One part I got was:
1. **Insufficient Error Handling:** The code’s error handling appears rudimentary. It primarily checks for non-numeric input but doesn't handle the situation where the code doesn’t match the activation code *after* it's been entered. This could lead to an unresponsive UI or unexpected behavior when the user tries to deactivate.
Here, the model’s response is partially correct but imprecise. The model correctly identified that there’s insufficient error handling, specifically around the failure paths. The code does handle errors (shows error messages), but doesn’t properly clean up state afterward. However, what the model got wrong is in the specific claim: “doesn’t handle the situation where the code doesn’t match the activation code after it’s been entered.” Actually, looking at the code:
- Lines 765-767: It does handle incorrect activation (shows error message)
- Lines 774-776: It does handle incorrect deactivation (shows error with remaining attempts)
The model recognized there’s a problem in the error handling area, but couldn’t precisely articulate what the bug actually is. It’s like a student who knows something is wrong with a math problem but can’t quite explain what. They’re in the right neighborhood, just not at the exact address.
The real bug is more subtle and I’ll come to that in a moment.
I also got this in my analysis from the model:
2. **Lack of State Reset on Failure:** A critical issue is the lack of a robust mechanism to reset the bomb’s state (timer, activation code) if an error occurs – particularly during input. If the user enters an invalid code, the timer might continue running, leading to a confusing user experience and potential game over scenarios.
This one is more problematic. It’s mostly incorrect with a grain of truth buried in confusing language. Claim 1 on the “lack of a robust mechanism to reset the bomb’s state” is false. Looking at the code:
- Line 623: this.reset() exists and is called on successful deactivation
- Lines 641-644: The reset() method properly clears active and attempts
- The timer has its own stop() method (lines 575-580) that works correctly
The reset mechanisms exist and work fine when called. Notice Claim 2: “if the user enters an invalid code, the timer might continue running.” Well, that’s correct behavior, not a bug. Again, look at the code:
- Line 766: Invalid activation code – bomb stays inactive, timer never starts
- Line 775: Invalid deactivation code – timer keeps running (this is intentional! You’re trying to defuse a bomb; it doesn’t pause while you think)
The timer continuing after a failed deactivation attempt is by design. That’s the whole tension of the game! You have limited time and limited attempts.
Claim 3 (“leading to a confusing user experience”) is subjective and arguable. This is actually standard bomb-defusal game logic. The timer counting down while you frantically try codes is the core mechanic.
So, what’s going on here? Well, the model is pattern matching from common programming issues.
- “Error handling without cleanup” is a common bug pattern ✔
- “State not reset properly” is a common bug pattern ✔
- But it applied these patterns where they don’t actually fit ✘
It’s like a doctor who knows “chest pain + shortness of breath = heart problem” but applies it to someone who just ran up stairs.
Notice here that it’s not just about missing bugs; AI can also invent bugs that don’t exist by over-applying patterns. Domain knowledge also matters. A human tester who understands bomb-defusal game mechanics would know: “Of course the timer keeps running during failed attempts. That’s the game!” Absent a specification, the AI has no such context.
However, there is a tiny grain of truth here. There is a state management issue, but it’s not what the model described. The model completely missed that and instead invented a non-existent problem about timer/state reset.
The Lurking Bug
Here’s the bug: After entering an incorrect deactivation code (or activation code), the display doesn’t clear. Try typing a new code. Nothing happens. You’re stuck until you manually hit Delete. So, for example:
- Go to the bomb by “Provision Bomb”
- Enter 1234 and press “Activate”
- Enter in 4444 and press “Deactivate” (invalid code).
If the user now tries to enter in another code, they can’t. They have to first delete the existing code.
So, now let’s add a part to our script that is the consultant mode.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# Consultant Mode bug_fix_prompt = """ There's a bug in this application: After entering an incorrect activation or deactivation code, the code display doesn't clear. Users cannot type a new code until they manually press Delete. Looking at the JavaScript code in the web app: 1. Identify where this bug occurs 2. Explain why it happens 3. Suggest a fix 4. Propose a test case that would verify the fix works """ bug_fix_response = ollama.chat(model="jeffnyman/ts-reasoner", messages=[ {"role": "user", "content": prompt}, {"role": "assistant", "content": first_response}, {"role": "user", "content": test_case_prompt}, {"role": "assistant", "content": test_case_response["message"]["content"]}, {"role": "user", "content": bug_hunting_prompt}, {"role": "assistant", "content": bug_hunt_response["message"]["content"]}, {"role": "user", "content": bug_fix_prompt} ]) print("\n=== BUG DIAGNOSIS AND FIX ===") print(bug_fix_response["message"]["content"]) |
I want to show one output I got for this in its entirety because it’s instructive.
=== BUG DIAGNOSIS AND FIX ===
Okay, let’s tackle this specific bug – the persistent code display after an incorrect code entry.
**1. Identification of the Bug Location:**
Based on the description, the bug likely resides within the `BombUI` class's `handleCodeInput` method (or a closely related function responsible for updating the code display). This is where the code is appended to the `codeStack` array and then rendered to the HTML element. The failure to clear the display happens after an incorrect code is validated.
**2. Explanation of Why It Happens:**
The most probable cause is a missing or incomplete cleanup operation within the `handleCodeInput` method *after* a code is validated as incorrect. Specifically, there's no mechanism to clear the content of the HTML element that displays the code – likely a `<div>` or `<span>` element – when the code is deemed invalid. The code likely just appends the new invalid character to the `codeStack` array without removing the existing content.
**3. Suggested Fix:**
The fix involves adding a clear function to the `handleCodeInput` method. This function should:
1. **Clear the Display:** Immediately clear the content of the HTML element displaying the code using `textContent = '';` or `innerHTML = '';` (depending on how the code is rendered).
2. **Handle Invalid Input:** If the code is invalid (e.g., wrong length, non-numeric characters), trigger the error handling logic (show an error message, etc.).
3. **Update `codeStack`:** Only append the valid code to the `codeStack` array if the code is valid.
Here's a conceptual outline of the modification:
```javascript
handleCodeInput(event) {
const input = event.target;
const code = input.value;
// Clear the display first
this.clearCodeDisplay();
// Validate the code...
if (isValidCode(code)) {
// Append to codeStack and update UI
this.appendCode(code);
input.value = ''; // clear the input field
} else {
// Handle invalid code
this.showError("Invalid code. Please try again.");
}
}
```
**4. Proposed Test Case to Verify the Fix:**
* **Test Case Name:** "Persistent Code Display - Fixed"
* **Steps:**
1. Provision the bomb.
2. Enter an incorrect code (e.g., "123" when the expected format is "12345").
3. Observe the code display – it should remain populated with the incorrect code.
4. **Crucially:** Manually press the "Delete" key within the code display element. The code display should now be cleared.
5. Enter a valid code. Verify that the code displays correctly, and the “Delete” key no longer has an effect.
**Additional Notes for Verification:**
* Inspect the HTML source code to confirm the content of the display element after the error.
* Use browser developer tools (e.g., the console) to check for any error messages or unusual behavior.
To refine the analysis further, providing snippets of the relevant JavaScript code would be extremely helpful. However, this response addresses the specific bug and proposes a focused solution and test case to resolve it.
This response is fascinating, it’s a mix of correct intuition and complete hallucination. Before reading on, see if you can spot the hallucination.
Look at what the model suggested: it invented a handleCodeInput method that doesn’t exist in our code! It assumed we have a text input field that users type into, when we actually have clickable number buttons. The model got the conceptual problem right (“missing cleanup in error paths”) but hallucinated the implementation details. It’s reasoning from generic web form patterns rather than our actual code structure.
It’s like saying to a mechanic: “My car won’t start after I try jumping it incorrectly.” The mechanic responds: “The issue is with your flux capacitor relay. You need to recalibrate the quantum alternator.” Your car doesn’t have those parts. But the general underlying idea (electrical system problem) is somewhat vaguely correct.
If you do your own analysis of the JavaScript code, you might ultimately come to lines 759 through 777 and the codeStack variable. Let me trace through what’s happening in the code. The bug flow is:
- User types 1234 : codeStack = [‘1’, ‘2’, ‘3’, ‘4’]
- User clicks “Activate” : triggers successfully, codeStack = [] is cleared
- User types 4444 : codeStack = [‘4’, ‘4’, ‘4’, ‘4’]
- User clicks “Deactivate” : fails validation, shows error BUT…
- The codeStack is never cleared! It still contains [‘4’, ‘4’, ‘4’, ‘4’]
- User tries to type a new digit : addDigit() checks if (this.codeStack.length < 4) : FALSE, so nothing happens
The root cause is that in toggleBombState(), the codeStack is only cleared on success:
|
1 2 3 4 5 6 7 8 |
if (this.bomb.trigger.activate(code)) { this.bomb.timer.start(); this.codeStack = []; // <-- Only clears on success this.updateDisplay(); } else { this.showError('Incorrect activation code'); // Missing: this.codeStack = []; } |
The same issue exists in the deactivation path. The fix is that we should clear the codeStack after any activation/deactivation attempt (success or failure), not just on success. This matches expected behavior. After you submit a code, the display should clear so you can try again.
I do have a second version of Project Overlord you can check out (along with the code in the repo).
With the fix in place, the test flow now is
- Type 1234 -> Activate (success) → Display clears ✔
- Type 4444 -> Deactivate (fail) → Display clears ✔
- Type new code immediately → Works! ✔
A (Potential) Usability Issue
Throughout all the model’s analysis, from test case generation to bug diagnosis, you will find that it likely consistently assumed users would type codes into a text field. But if you actually use the application, you’ll discover: you can’t type anything. The interface uses clickable number buttons, like an ATM or phone keypad). The code display is just that: a display. Users click buttons, JavaScript updates the array, and the display reflects it.
What likely happens here is that model analyzed the HTML structure (<input> elements exist) but didn’t understand the interaction model (buttons → array → display update). This is a fundamental limitation: AI reads code syntax, not user experience. This is why hands-on testing is irreplaceable. Within ten seconds of using the app, a human tester would think: “Wait, why can’t I type? Oh, it’s only button-based input.” The AI never had that realization because it never actually used the application.
And if it never actually used the application, it most definitely didn’t test the application.
I do have a have third version of the Project Overlord that does allow typing (along with code in the repo).
Generating Automation Logic
Let’s see if the model can generate some automated execution logic using Playwright. Let’s remove the detective and consultant logic from the script. The modified logic starts at line 43.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
import ollama import requests from bs4 import BeautifulSoup url = "https://testerstories.com/files/ai_testing/overlord-001.html" response = requests.get(url) html_content = response.text soup = BeautifulSoup(html_content, "html.parser") prompt = f""" Here is a web application that simulates a bomb defusal system: {html_content} Please analyze this application and describe: 1. What are the main components? 2. What is the sequence of actions a user must take? 3. What are the different states the application can be in? """ analysis_response = ollama.chat(model="jeffnyman/ts-reasoner", messages=[ {"role": "user", "content": prompt} ]) first_response = analysis_response["message"]["content"] test_case_prompt = """ Based on your analysis, generate test cases that would verify the bomb defusal logic works correctly. For each test case, specify: - Initial setup (activation code, deactivation code, countdown) - Steps to perform - Expected outcome """ test_case_response = ollama.chat(model="jeffnyman/ts-reasoner", messages=[ {"role": "user", "content": prompt}, {"role": "assistant", "content": first_response}, {"role": "user", "content": test_case_prompt} ]) print("=== GENERATED TEST CASES ===") print(test_case_response["message"]["content"]) playwright_prompt = """ Now, take your first test case (the "happy path" - successful bomb activation and deactivation) and convert it into executable Python code using the Playwright library. Important details about the interface: - Users cannot type directly into the code display - Users must click numbered buttons (with class "num-btn" and data-value="0" through "9") - There's a Delete button (id="delete-btn") to remove digits - The "Provision Bomb" button (inside a form with id="setup-form") starts the bomb - The "Activate/Deactivate" button (id="action-btn") submits the code Use playwright.sync_api and include: - All necessary imports - Clear comments explaining each step - Proper selectors based on the actual HTML structure - Assertions to verify expected outcomes (check status text, timer, etc.) Format it as a complete, runnable Python script. """ playwright_response = ollama.chat(model="jeffnyman/ts-reasoner", messages=[ {"role": "user", "content": prompt}, {"role": "assistant", "content": first_response}, {"role": "user", "content": test_case_prompt}, {"role": "assistant", "content": test_case_response["message"]["content"]}, {"role": "user", "content": playwright_prompt} ]) print("\n=== PLAYWRIGHT AUTOMATION CODE ===") print(playwright_response["message"]["content"]) playwright_code = playwright_response["message"]["content"] # Extract just the code (remove any markdown formatting) if "```python" in playwright_code: code_start = playwright_code.find("```python") + 9 code_end = playwright_code.find("```", code_start) clean_code = playwright_code[code_start:code_end].strip() else: clean_code = playwright_code with open("test_bomb_defusal.py", "w") as f: f.write(clean_code) print("\n=== CODE SAVED TO test_bomb_defusal.py ===") |
What this should do is generate some Playwright logic and write the test_bomb_defusal.py to the same directory as your model script that we’ve been writing.
To run this, you will need to install Playwright:
python -m pip install playwright
Then install the browsers:
playwright install
This downloads Chromium, Firefox, and WebKit. If you only want, say, Chromium, you can do this:
playwright install chromium
Note that you might have to tell it the page you want to load as it may not get that part correct. For example, my script generated this:
|
1 |
page.goto("file:///path/to/your/bomb_app.html") |
I’m going to show you the script I got generated for me:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
from playwright.sync_api import sync_playwright import time # Function to execute the test case def run_test_case_1(): with sync_playwright() as p: browser = p.chromium.launch(headless=False) # Keep browser open for inspection page = browser.new_page() # Navigate to the application's home page (replace with actual URL) page.goto("https://testerstories.com/files/ai_testing/overlord-001.html") # 1. Enter Activation Code ("1234") page.locator("#activation-code").click() #Click on the input page.locator("text=1").click() page.locator("text=2").click() page.locator("text=3").click() page.locator("text=4").click() # 2. Enter Deactivation Code ("5678") page.locator("#deactivation-code").click() page.locator("text=5").click() page.locator("text=6").click() page.locator("text=7").click() page.locator("text=8").click() # 3. Click "Provision Bomb" page.locator("#setup-form").click() # Wait for bomb to start (check that the status text changes) time.sleep(2) # Adjusted wait time for better reliability assert page.locator("#bomb-status").text() == "Bomb is Active", "Bomb did not activate correctly." # 4. Enter Deactivation Code again ("5678") page.locator("#deactivation-code").click() page.locator("text=5").click() page.locator("text=6").click() page.locator("text=7").click() page.locator("text=8").click() # Wait for bomb to detonate time.sleep(5) #Wait for the timer to reach 0 # Assert that bomb has detonated assert page.locator("#bomb-status").text() == "Bomb Defused!", "Bomb did not detonate correctly." # Close the browser browser.close() if __name__ == "__main__": run_test_case_1() |
This gets some errors when it runs. I won’t elaborate all of those here, but what I did to fix it was this:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
from playwright.sync_api import sync_playwright def run_test_case_1(): with sync_playwright() as p: browser = p.chromium.launch(headless=False) page = browser.new_page() page.goto("https://testerstories.com/files/ai_testing/overlord-001.html") page.fill("#activation-code", "1234") page.fill("#deactivation-code", "5678") page.fill("#countdown-duration", "30") page.click("button.btn-danger") # or page.click("text=Provision Bomb") page.wait_for_selector("#bomb-status") page.click("button[data-value='1']") page.click("button[data-value='2']") page.click("button[data-value='3']") page.click("button[data-value='4']") page.click("#action-btn") page.wait_for_selector("#bomb-status.active") assert "ACTIVE" in page.locator("#bomb-status").inner_text() page.click("button[data-value='5']") page.click("button[data-value='6']") page.click("button[data-value='7']") page.click("button[data-value='8']") page.click("#action-btn") page.wait_for_selector("#bomb-status.inactive") assert "INACTIVE" in page.locator("#bomb-status").inner_text() print("✓ Test passed!") browser.close() if __name__ == "__main__": run_test_case_1() |
When AI-Generated Code Meets Reality
The AI produced what looked like a reasonable Playwright script: proper syntax, appropriate methods, even helpful comments. But when I ran it? Immediate failure. This is where human testing expertise becomes essential. The AI made assumptions about the application that seemed logical but were wrong:
- It assumed the setup inputs worked like the bomb interface (they don’t; one uses form fields, the other uses clickable buttons)
- It chose vague selectors that matched multiple elements
- It tried to interact with UI elements before they existed on screen
A human tester, after spending just thirty seconds actually using the application, would immediately notice: “Oh, I can’t type in this field; I have to click buttons.” The AI analyzed the HTML structure but never experienced the actual user flow.
A logical question is: Could better prompting have helped? Possibly. We could have been more explicit: “The setup screen uses input fields you type into, but the bomb screen uses clickable number buttons.”
Another logical question is: Would a more advanced model do better? Maybe. Vision-capable models that can see screenshots might recognize the button-based interface. Models with better reasoning might catch the phase transitions.
Yet, here’s the key insight: the model got us approximately 80% of the way there in seconds. It understood Playwright syntax, identified the main workflow, and structured a test. What it couldn’t do was bridge the gap between code structure and actual behavior. That’s where human judgment turned a broken script into a working test.
This, I would argue, is augmented testing in action: AI provides the scaffolding, humans provide the understanding. Neither replaces the other; each amplifies what the other does well.
A Crucial Caveat: We’re Working with Training Wheels
It’s worth emphasizing: this bomb defusal application is deliberately simple. All the HTML, CSS, and JavaScript live in a single file. There are no external dependencies, no build processes, no API calls, no database interactions. The entire state machine fits in a few hundred lines of readable code.
And even with this simplicity, the AI struggled with:
- Understanding the multi-phase UI flow
- Choosing correct selectors
- Recognizing interaction patterns
Real-world applications are vastly more complex:
- Styles split across multiple CSS files and frameworks
- JavaScript bundled and minified across dozens of modules
- Dynamic content loaded from APIs
- Authentication flows and session management
- Shadow DOM, iframes, and other encapsulation
- Third-party widgets and components
- State management libraries (Redux, Vuex, etc.)
When you feed an AI a modern single-page application with thousands of lines of TypeScript, React components, and asynchronous data fetching, the analysis becomes exponentially harder. The AI might correctly identify some components but completely miss how they interact, when they render, or what state changes trigger which behaviors.
This particular article uses a toy application intentionally: not to make AI look good, but to isolate the core concepts of AI-assisted testing in a controlled environment. If the AI struggles here, imagine the challenges with production code.
The skills you can learn from this (prompting effectively, critically evaluating AI output, debugging generated code, understanding when to trust and when to verify) become even more crucial as application complexity increases.
Next Steps
In the next post, we can iterate on this example by considering how, or whether, things change when we have a specification to go off of. We can also switch things up a bit and see if changing the model helps.