If you are going to have an AI that “does testing” — as opposed to some other activity like analysis or pattern recognition — you are going to have to move from a focus solely on perception and add the dimension of actions. I’m finding a lot of folks promising “AI-based testing tools” or those eagerly hoping for them are very much confusing this distinction. So let’s talk about it and, as we do, let’s create an illustrative example.
Before getting started, I hope you’ll forgive me standing on a soapbox for one second. We need more people exploring these ideas out in the open, making open source solutions available. Too much of this work is being done behind closed doors and that’s counter-productive to all of us learning as a community. I realize tool vendors have no incentive to do this publicly since they want to make money on their solutions. That’s totally understood. But some of the core principles need to be explored and provided so that we can all experiment together.
With that digression, let’s focus on why I said we need to move from mere perceptions (of data, for example) to actions.
Agents Taking Actions
Actions are something that influence an environment. That is what testing does. Testing is an active process of exploring an environment, determining how it responds to various conditions. Testers learn from those interactions. So if we are talking about an AI tool that “does testing,” it must, by definition, learn. And it must do so by taking action and being able to process feedback from those actions.
This is where “machine learning” comes into the picture, ostensibly to mirror some aspect of human learning.
But what kind of learning?
Types of Learning
As their first port of call, a lot of people travel to supervised and unsupervised learning. In supervised learning, the overall goal is to use an existing mapping between input and desired output to better apply that mapping to situations where the output is missing. Unsupervised learning tries to find similarities and differences between data points and, rather than using a pre-existing mapping, works to discover the mapping.
Reinforcement learning, on the other hand, is a type of machine learning that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences. Unlike supervised or unsupervised, reinforcement learning more closely models what a human would respond to because it uses a concept of rewards, both positive and negative, as feedback to modify behavior.
Reinforcement learning, rather than using or discovering mappings, tries to find a suitable action model — a policy — that maximizes the total cumulative reward of the agent.
So let’s agree for now that if we want an AI tool to truly model testing, what we want to do is focus on an agent using some sort of process to make decisions in an environment.
Formalizing the Learning Process
There is a concept called a Markov decision process (MDP) which is an approach in reinforcement learning to take decisions in a specific type of environment, often called a gridworld.
A gridworld environment consists of states in the form of grids. An MDP tries to capture an environment in the form of a grid by dividing it into states, actions, transitions (between states), and rewards. The solution to an MDP is called a policy and the objective is to find the optimal policy for whatever the MDP task is.
So the question might be: can we treat an application that we’re going to test as a gridworld? Can we model this such that an AI test agent is taking actions in this gridworld and trying to find a policy that will allow them to accomplish a task?
My Plan of Attack
I’m going to take you through a gridworld here. First I’ll talk about it purely conceptually. Then I’ll show it to you visually. Finally we’ll look at some code that implements it. As we do this, I’m going to risk complicating this a bit by asking you to think of the application as a type of game. Not an actual game, per se, but at least subject to the same kind of thinking as games.
To foster that, I’m going to describe the application and how that maps to another concept: that of Pac-Man, a very traditional grid-style environment.
This may seem like an odd approach. I’m doing this because a lot of reinforcement learning these days is focused on agents playing games, such as AlphaGo playing Go, or DeepMind playing Atari 2600 games, or CherryPi playing Starcraft. This can make it seem like reinforcement learning is limited to that venue. I’m hoping that by drawing a direct correlation between an application and a game you can see why the reinforcement learning approach is relevant.
Let’s Get Conceptual
Let’s say we have a one-dimensional gridworld. This gridworld is made up of spaces that are numbered. This will be our application. We’re going to explore this environment. That exploration will be fueled by testing.
- Application: You will start in this environment as a User. In this environment is a Feature that you can use. If you do, you get value from the application. However, there is a Bug in this environment. If you run into this Bug, you get booted from the application, deriving no value from the application.
- Pac-Man: You will start in this environment as Pac-Man. In this environment is a solitary Food Pellet that you have to eat. If you do, you win the game. However, there is a Ghost in this world. It doesn’t move around, luckily, but if you run into it, you will lose the game.
So there’s our setup. Let’s add a bit more to this:
- If you eat the pellet or find the feature, you get a reward of 500.
- If you run into the ghost or the bug, you get a negative reward (punishment) of -500.
You are building a test AI agent that will explore each environment and try to do the right thing.
Now, I haven’t shown you the gridworld environment. But I’ve said that it’s one dimensional so let’s say you only have the option of going left or right. Given that, which way will you have your agent go? Which way should the agent explore?
Well, in lieu of any guidanace, clearly your agent simply has to try it out, right?
I’m going to introduce a constraint here. For every step the agent takes, until you encounter the Ghost / Bug or the Food Pellet / Feature, the agent doesn’t know of the rewards. Thus what we have is a case of delayed rewards. Delayed rewards are, of course, different from immediate rewards. In the case of this environment, there is nothing that tells the agent the “right” direction. After every step there is no reward, such that the agent might have a suggestion of going in the right or wrong direction.
That constraint could be relaxed, of course, but for now let’s go with that.
Given all that, how do we model this situation as something that an algorithm can learn?
Well, consider this: all of the gridworld locations (states), the actions (of which there are only two: left, right), and the rewards (positive and negative) can be modeled by something called a Markov process, which I talked about earlier.
A Markov process can be understood as a collection of states S with some actions A possible from every state with some probability P. Each such action will lead to some reward R. It should be noted that the reward can be zero. The probability part comes in because in some gridworlds there may be only a probability, not a certainty, that a given action will take place the same way each time.
For example, let’s say that moving right will in fact cause Pac-Man to move right 90% of the time, but 10% of the time he will move left. This is a randomness inherent in the agent. Another example might be that the environment states periodically become “holes” that Pac-Man can fall into. So even if moving right would lead to a safe tile, in some cases it might not. This is a randomness inherent in the environment.
In the case of an application, we can imagine cases where what the user action does doesn’t actually get recognized. They swipe the screen, but it doesn’t work. Or they click a button but it doesn’t submit. We can also imagine a workflow being stopped up by an error, such as 404 or an application crash.
The key point here is that a Markov decision process is used for modeling decision making in situations where the outcomes can be partly random and partly under the control of a decision maker, i.e., the agent. And that can happen whether the agent is a machine or a human.
Let’s Get Visual
So we want to build an agent to traverse our gridworld of ghosts / bugs and food pellets / features and we want to do this as a human would. What would a human do? Well, a human would presumably just try things out. Here’s an example of what your gridworld would look like visually:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | | | | | | | | | | | |
This is the case for either Pac-Man or the application. The idea is that each numbered location is a particular state.
Note here I’m not saying where you, as Pac-Man or User, start off in the world. I’m also not showing where the Ghost / Bug is. Nor am I showing where the Food Pellet / Feature is. In fact, as Pac-Man or a User, you wouldn’t necessarily even have the above map of the environment. Yet, you certainly could, in some situations. A human playing a Pac-Man game, for example, would see the entire gridworld. A human using an application may know the workflow that a given set of actions would provide.
But here’s the key thing if you are building an AI test tool that learns: the agent would not know all this initially. That’s an interesting interplay when you try to get algorithms to learn like a human. You have to consider how the agent observes the environment and whether they have the same observations that the human does. The same might apply to testing an application. I, as a tester, may simply be exploring to see what I can find without having any sort of “map” to guide me.
To make this simple, I’m going to streamline a few indicators here:
- I’ll refer to the agent as A and that can refer to Pac-Man or a User, as you prefer.
- I’ll refer to the Food Pellet or Feature as F.
- I’ll refer to the Ghost and the Bug as T, for “trouble.”
So our agent might start off in, say, space 2. The agent’s view of the environment would be just this:
| 2 | | A |
The agent wouldn’t even know how big the environment is. And that’s important! After all, you could spend a long time trying to find that F, right? So long that you might wonder if it even exists.
TESTER! Are you starting to think that my F (“feature”) and T (“bug”) are being reversed here? Is the agent “looking for” the feature or the bug? If you have been feeling that, excellent! Your instincts are spot on. I’ll revisit this.
For now, how does our agent act? Well, how would a human act in this situation? A human would simply have to try things out, going left and right. Implicitly what a human is doing is deriving a policy. This policy tells them what to do and when. So the human might say “I’m going to go right.” Here’s the map.
| 2 | 3 | | | A |
The human says “Okay, I’m going to right again.”
| 2 | 3 | 4 | | | | A |
So far the policy of “going right” is working. Let’s try it again.
| 2 | 3 | 4 | 5 | | | | | A |
Here the human agent is exploiting a policy that it seems to work. But it’s also not leading to much.
Of course, that depends on what you’re looking for, right? If what those steps above did was go through a workflow in the application, then that is leading to a lot: it’s showing me a successful interaction with the feature. For Pac-Man who just wants to win the game, you could argue it hasn’t done much since the food pellet hasn’t been found but, on the other hand, Pac-Man hasn’t gotten killed yet and he does know more of where the food pellet is not.
Going back to our human, who perhaps feels they are wasting time, maybe the human wants to switch to explore mode instead. So the human decides to go left instead. Now, of course, the human has to traverse where it went. So let’s say the human goes left three times to end up back where it started.
| 2 | 3 | 4 | 5 | | A | | | |
Now the human says “I’ll go left again.”
| 1 | 2 | 3 | 4 | 5 | | T | | | | |
And lo and behold the Trouble (Ghost or Bug) is there! The agent has a (negative) reward of -500.
TESTER! See what happened? In some situations, finding the “Trouble” leads to a negative reward. For Pac-Man that would be a situation to avoid. But for a tester, that’s a situation to seek out, right? This is a crucial aspect of understanding what a “AI test tool” would have to consider.
Think of the policy as a revealed map of the environment. The human agent may now think, “Well, it looks like the Trouble spot exists on grid state 1. So if I’m in grid state 2, I should definitely not go left.” Maybe that’s a good policy. Unless, of course, with each gridworld the position of the Trouble is randomized, transient or changes for other factors. But, in general, given the first training episode, the agent has figured out a policy.
Clearly the better the policy, the better are the chances of the agent “doing the right thing.” Assuming that we state what the “right thing” is, of course. For Pac-Man that’s winning the game and thus not running into the Trouble (Ghost). For a tester, that’s “winning the game” by finding bugs and thus most definitely running into the Trouble (Bug).
TESTER! But there is also value in exploring the application and going to areas where there are no bugs. The reward there is finding no bugs (regression). See how this can get complicated?
For any agent, the quality of its policy will improve upon training and will keep on improving. Thus, as agent designers, we want to have our agent keep learning a more refined policy. Thus what we want is a particular type of quality learning. Let’s call that Q Learning.
From Theory to Model
Artificial intelligence and machine learning are often about turning theory, like the above, into a model. One part of that model was the Markov process. But another part of that model is about the learning. Here we’ll consider a particular algorithm that is based on a Bellman equation, of which I’ll provide a very simplified account:
Here Q(s,a) is the current policy of action a from state s, while r is the reward for the action. But what is that max(Q(s’,a’)) part?
This defines the maximum future reward. So say the agent took action a at state s to reach state s’. From here we may have multiple actions, each corresponding to some rewards. The maximum of that reward is computed.
Also, γ is what’s called a discount factor. This value varies from 0 to 1. The nearer the value is to 0 means immediate rewards are given preference. The nearer the value is to 1 means that future rewards are considered more and more important. If the value is 1 exactly, that means the value of immediate rewards and future rewards are equivalent.
Now, why am I showing you this?
Our AI test agent has to be running that equation in its “head” for each action it takes. And for each action it takes, it stores up the result of the calculation. This creates a sort of policy table. In the context of Q Learning we’ll call it a Q Table and that table holds Q Values. Those values are simply the result of the calculation carried out each time the agent takes an action and learns something about its environment.
Let’s consider our above case. There are no points given for the agent moving around. So what table might have been built up?
s s' r 2 -> 3: 0 3 -> 4: 0 4 -> 5: 0 5 -> 4: 0 4 -> 3: 0 3 -> 2: 0 2 -> 1: -500
This shows the agent moving from one state (s) to another (s’) and getting a reward (r).
While I’m showing rewards of 0 here, do keep in mind the max() part of the equation we looked at before. This is taking the maximum value over a series of other actions. So the immediate reward from any state will include some factor — determined by the discount — of the maximum reward of all the actions possible from that state.
This doesn’t help the agent at all when the map is unknown. But remember that carrying out the policy reveals the map. So eventually the agent will start to learn that actions that move in the “right” direction do have a slight future reward (beyond 0) even though the immediate reward is 0. This is how agents “learn” how to distinguish better and worse actions. This is exactly how humans deal with delayed rewards and it is very much how testers have to “buffer” rewards as they are testing an application.
I should note that in the case of our gridworld, the probability of all actions is 1. Meaning, whatever action the agent takes, that is the action that occurs. Further, states themselves do not change. The Trouble spot does not move and the environment is static. So there is no randomness or uncertainty in the agent or the environment.
A Working Implementation
Let’s actually make a program out of this. If you have Python, you can create the following script and run it. I’ve kept it reasonably well commented for learning purposes.
Here is an example environment.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| G | | | | | | . | | | |
The environment indicates the number of states, so 10 entries in the
array. It also indicates the number of possible actions for each state.
Thus each element of the array has an array of 2 entries. These two
entries correspond to the action of "Left" and "Right". So an action
of "Right" from tile 6 or an action of "Left" from tile 8 have a high
reward of 500 because those both lead to tile 7, which is where the
only and only dot is. An action of "Left" from tile 2 leads to the
ghost and so it has a large negative reward. Going "Left" from tile
1 or "Right" from tile 10 return values of None as there are no
tiles in either direction.
environment = [
The q table indicates the knowledge of the agent regarding the moves
from each state. We know the rewards based on the actions because we
created the environment table. But the agent will not have that
q_table = [
win_loss_positions = [0, 6]
return position in win_loss_positions
step_matrix = [x is not None for x in environment[position]]
possible = 
def get_successor_state(position, agent_action):
if agent_action == 0:
return position - 1
return position + 1
A discount factor is always set between 0 and 1. This models the fact that
future rewards are worth less than immediate rewards. This value represents
how much future events lose their value according to how far away in time
they are. To prioritize rewards in the distant future, you should keep the
value closer to one. A discount factor closer to zero indicates that only
rewards in the immediate future are being considered.
discount = 0.9
A learning factor is always set between 0 and 1. Setting it to 0 means that
the Q values are never updated, hence nothing is learned. Setting a high
value such as 0.9 means that learning can occur quickly.
learning_rate = 0.1
Here is where the agent trains over a series of episodes. One episode lasts
until the game has ended in either victory or defeat.
for episode in range(1000):
# Get a starting location, at random.
state = random.choice([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# Loop while the goal state has not been reached.
while not game_over(state):
# Get all possible next states from the current state.
possible_actions = get_possible_actions(state)
# Select any one action randomly. Remember: the agent has no idea
# which action is better than any other at the start. So a random
# strategy is no better or worse than any other. This agent is not
# guided by a heuristic.
action = random.choice(possible_actions)
# Find the next state corresponding to the action selected.
successor = get_successor_state(state, action)
# Update the q-table based on what has been learned.
q_table[state][action] = (q_table[state][action] + learning_rate *
(environment[state][action] + discount *
# Go to that next state.
state = successor
print("Episode", episode, "done.")
Here is the result of that execution:
I’ve stylized the results a bit by providing colors.
This output should be interesting to you. In a very simple bit of code, we have had an agent create a policy. This policy essentially figured out that if you find yourself in any particular state, choose the action that has the higher value. In the graph, that would be anything with a darker shade of green. But the agent has also formed a policy that says to be wary of those states that are shading away from green and outright avoid those states that are red.
This was a very simple example, along a one-dimensional gridworld. Imagine the complexities of moving up to a two-dimensional gridworld. Or a three-dimensional one. Or where there were more conditions.
- In the case of Pac-Man, perhaps there were more ghosts. And perhaps they moved around. Perhaps there were more food pellets. But there are also perhaps power pellets that Pac-Man can eat that temporarily render the ghosts harmless.
- In the case of an application, perhaps there are many features that intersect. Any interaction can lead to a possible bug. Some bugs may only occur with certain data conditions. Some bugs may be gated by state or by time. Maybe certain configuration settings can influence how the environment responds.
TESTER! Again, please note, the above visual display of colored numbers may be reversed if you consider finding the bug a good thing. But is it then equally true that not finding a bug is a negative reward? After all, not finding a bug could be because the feature is working. And we do want the agent to verify that right? So what does an AI agent do with such rewards? How does an AI testing agent modulate its policy?
The Promise of AI Testing?
What I hope you can see is that those promising “AI-based testing tools” had better be able to reason about exactly what I’m talking about here.
Usually what they’re talking about is not “AI-based testing” but rather aspects of testing, such as using machine learning algorithms to read logs, looking for anomalies or correlations. Others are suggesting visual recognition such that an agent can learn the states of an application by looking at screen images and seeing what is and isn’t present on the image.
Note, of course, that learning what states an application can be in visually is not really learning the application.
A key point for me here is not to lambast the above ideas but rather to suggest that “AI-based testing” has a much wider remit than those aspects.
Fundamentals of Testing
We start getting to some fundamentals here because, if you think about it, the discipline of testing is about recognizing that we use agents to act in environments. Those agents use techniques to understand that environment. Sometimes those agents are humans; sometimes those agents are tools that support the humans.
Agents and their techniques have certain limitations. Sometimes those limitations are imposed by the agent, sometimes by the technique itself, or sometimes by the environment. Sometimes all three at once! Sometimes certain techniques or tools, when used alone, are insufficient but gain when they are combined.
All of this work is exploration, experimentation, and investigation. And it leads to discoveries, some of which can become insights. And there is a feedback loop. Those discoveries and insights can feed back into the agents and the techniques, refining both.
There is a healthy intersection between the discipline of testing as carried out by humans and the way AI tools could support what humans do. We do, however, have to ovecome some of the hype and hyperbole and start asking serious questions about what that intersection looks like.
We also have to better start challenging tool vendors who are in danger (intentionally or not) of inflating their claims as they try to ride a wave of interest in how AI will complement humans in the art of delivering software that provides value.