Framing Automation-Based AI

I recently talked about a focus on being able to test an AI before you trust an AI to test for you. Here I want to provide a bit more focus on how worth it this idea might be. But my goal here is not to dampen the spirits of those who want to build such tools; rather I want to suggest some of the challenges and provide a bit of the vocabulary. I want to give you a way to frame the current situation with AI and its value as a test-supporting technology. This framing is important because we are getting to a point where tool vendors are starting to promise quite a bit about what AI can do in the context of testing. We’re having a lot of talks of AI pop up at conferences and, quite frankly, it’s clear from many of these that the those giving the talks have only a passing knowledge of the complexities of AI and machine learning. There’s one thing I need to make very clear: if we talk about an AI actually testing your application, we can talk about this in three contexts: (1) testing as an analysis activity, (2) testing as a design activity or (3) testing as an execution activity. Here I am solely talking about the execution part. This is not about designing test cases or the much more problematic idea of an AI putting pressure on design. This is also not about having an AI look at test results. So in the context of execution, the test-supporting part of this has to be about the AI actually learning the application. If there isn’t a learning component, then the AI is arguably no better than current automation approaches. But that’s a key point I think you need to keep in mind. As you read this post, ask yourself if it even makes sense to develop an AI to do certain of these tasks. Or is it the case that our current test-supporting tools do fine as they are?

Terms and Concepts

So first of all let’s get the terminology right. Artificial Intelligence is the science that is concerned with intelligent algorithms, whether or not they learn from data. Machine Learning is a subfield of AI devoted to algorithms that learn from data. Deep Learning is a subset of ML, that puts a focus on neural networks. All of this learning can fall into multiple categories. For example, in “supervised” learning, an algorithm is presented with a set of prior, labeled examples and has the benefit of identifying associations between the data and the labeled outcome, or classification. In “unsupervised” learning, prior sets of labeled examples are not available, but unlabeled, or uncategorized, data are. I’m going to restate these areas a bit differently here.

Supervised and Unsupervised

Supervised learning is learning from a training set of labeled examples provided by a knowledgable external supervisor. Each example is a description of a situation together with a specification — the label — of the correct action the system should take to that situation, which is often to identify a category to which the situation belongs. The object of this kind of learning is for the system to extrapolate, or generalize, its responses so that it acts correctly in situations not present in the training set. Unsupervised learning, by contrast, is typically about finding structure hidden in collections of unlabeled data. Put another way, this form of learning does not rely on examples of correct behavior. I bring these up because a lot of people touting AI for test tooling suggest supervised learning and unsupervised learning. For learning an application — such as GUI widgets — does that really make a lot of sense? Currently we use various types of automation to recognize a page or screen by widgets. And none of that requires AI at all. What we really want is something that can operate like a human. After all, when we test, we test like a human. We test via usage patterns that match how a human would use our application or service. Humans will learn to use our application, sometimes correctly, sometimes incorrectly. And so that’s likely what we want to mimic. While both supervised and unsupervised approaches could arguably be learning, you’ll note in many of my articles that I’ve focused on aspects of reinforcement learning instead. And there’s a reason for that. What sets reinforcement learning apart from supervised and unsupervised learning is the emphasis on maximizing a reward signal. Another key difference is that reinforcement learning explicitly considers the whole problem of a goal-directed agent interacting with a (possibly uncertain) environment. I’ll come back to this in a bit.

Black Boxes, Models and Learning

Let’s take a quote from The Sentient Machine:

Today’s most successful machine learning systems—deep learning systems—make use of neural networks, massive collections of statistical weights, and activation functions. To the human eye, these are essentially jumbles of numbers that are constantly adjusted to account for new data. In these structures, knowledge and learning are represented in ways mostly indecipherable to human observers. Thus, the criticism follows, these systems appear to present a sort of “black box” that is immune to human introspection and analysis.

That should be worrisome. After all, introspection and analysis is a large part of what testing is all about. So we don’t want aspects that are immune to those very things. This applies whether we’re testing an AI or using an AI to help us with testing. If you read my previous post with the Flappy Tester, you’ll see that I was able to provide a model for you that generated the behavior you wanted. Or, alternatively, you could have started from scratch and done the training yourself, thus trying to build the model. But the fact that you can go from no model at all to a complete model should be a bit concerning particularly if you had no idea at all of how the system could go from one to another that easily. So let’s go back to my Flappy Tester approach. Suppose you want to teach a neural network to play that game. That’s in fact what that post showed you. Input to your network was screen images and output was basically just “flap.” Yes, the output was a decision: whether or not to flap, which means to accelerate the bird by flapping its wings. It would make sense to treat that as a classification problem – for each game screen you have to decide whether you should flap. Or, rather, the times when you should flap versus not flapping. Sound straightforward? Sure, but then you need training examples, and a lots of them. Of course you could go and record game sessions using expert players, but that’s not really how we learn. We don’t need somebody to tell us a million times which move to choose at each screen. We just need occasional feedback that we did the right thing and can then figure out everything else ourselves. And that’s largely what my Flappy Tester approach showed you. That was learning. This is the task reinforcement learning tries to solve. Reinforcement learning lies somewhere in between supervised and unsupervised learning. Whereas in supervised learning you have a target label for each training example and in unsupervised learning you have no labels at all, in reinforcement learning you have often-sparse and time-delayed labels. Those are the rewards. Based only on those rewards the agent has to learn to behave in the environment.

Learning Problems

There are some interesting problems that crop up in this context and they would certainly impact a test tool based on AI. One is what is known as the credit assignment problem. Basically what it means is this: which of the preceding actions was responsible for getting the reward and to what extent? The other is known as the explore-exploit dilemma, which breaks down to this: should you exploit the known working strategy or explore other, possibly better strategies? Okay, so suppose you are an agent, situated in an environment (e.g. Flappy game). The environment is in a certain state (e.g., location of the bird, location of an upper pipe, location of a lower pipe, location of a gap between pipes, etc). The agent can perform certain actions in the environment (flap or not). These actions sometimes result in a reward (e.g. increase in score). Actions transform the environment and lead to a new state, where the agent can perform another action, and so on. The rules for how you choose those actions are called a policy. The set of states and actions, together with rules for transitioning from one state to another, make up what’s called a Markov decision process. This is how you formalize a reinforcement learning problem, so that you can reason about it. And this is a really important point! If you are going to come up with an AI testing tool — and thus have AI support your testing — you have to understand the problem and how that problem is formalized. In the case of Flappy Tester, one episode of this process (e.g. one game) forms a finite sequence of states, actions and rewards. The episode ends with terminal state (e.g. the bird dying). A Markov decision process relies on the Markov assumption, that the probability of the next state depends only on current state and action, but not on preceding states or actions. Yet, how true is that in many applications? Think about that. Does a Markov property hold true for applications that you test? I’ll revisit these ideas shortly but for now let’s talk about another application.

A Web-Based UI Example

Let’s take a look at my Veilus application. This is nothing more than a sample web site I put up to test automation tools against. I should note that the page can take a little bit to come up since my app has to spin up on a Heroku cluster. Once the app is up, go to my Stardate Calculator page. Of course, to do that, you have to login. And to do that, you need to know the credentials to do so. (The user name and password is admin.) Once you have the credentials, you have to actually find where to login. Go ahead and do that. Now let’s think about what we did there and how we would automate that with an AI. First, an AI has to recognize application state. That’s what Flappy Tester had to do, for example. For training, you can give your AI many screens. Then label them: Login Screen, Stardate Calculator, etc. This would be teaching the machine to recognize what state the application is in. But don’t we sort of do that now with test automation? We check for a given object on the screen, for example. It’s somewhat akin to “labeling” the screens for a machine learning algorithm. Then there’s knowing what data to use in a given state. Here that was login. But clearly I’m not going to have an AI tool sit there and try login credentials until it gets them right, am I? Of course not. I’m going to do what we do know with automation: provide that as part of test data.

Stardate

So let’s go to the Stardate example page. Now calculate a TNG (The Next Generation) stardate. To do that you have to know to enable the form. Once that’s done, click “TNG era”. Decide whether to use the Leap Year calculation. In this case, let’s not. A default stardate is provided but we don’t want to use that. Put in this number instead: 56844.9. Then click Convert. And you’ll get some calendar date that pops up. Again we have to know what types of inputs can be applied to a particular state. In automation, this would usually come down to an application element and a value. The element would be where to apply the value and perhaps how. The value may be a click. But it also may be some text. There’s also recognizing that this simple action was completed. This is a form of output. It’s not an outcome, but an output. I entered text and, in fact, the text field is no longer empty. I clicked a checkbox and a form became available to me. An AI could eventually learn those actions. But, of course, I can just tell my automation what do. But it’s not just telling our tools what to do. We need a context in which to apply these inputs. This is like a workflow. And then within that workflow we need to know outcomes. Now, we can try to train a neural network to interact with the Veilus application just as we did with Flappy Bird. This would mean the network needs to learn how to recognize screens. Those screens may appear slightly differently at different times so it would not be enough to just train on image recognition. But rather on element recognition and perhaps even some heuristics. Then the network has to figure out actions to take based on what elements are available. Of course, some of those actions will be useless. Some will be out of order. So it has to figure all that out. The goal is to have the AI start interacting with the application as a human would. But what happens when actions on one screen impact actions that are possible or useful on another? What happens when it’s a series of combined inputs that lead to certain outputs that ultimately deliver an outcome?

Overlord

Let’s try another example. Go to my Overlord. Here you have way to provision a mad scientist bomb. You can provision with the default codes and countdown or set your own. Try it out. Play around with it. You might find something interesting from a learner perspective there. See if you can. I’ll revisit at the end of the post. But, for now, think about the difference with this application versus the Stardate one. Stardate was entirely contained on one page. Overlord stretches across two. Now think about that Amazon example that I gave you in the previous post. You can start to see, hopefully, the challenges that it would take to get an AI agent to deal with this. That still leaves open the question of whether we should bother.

Environments and Uncertainty

All ideas of reinforcement learning involve interaction between an active decision-making agent and its environment, within which the agent seeks to achieve a goal despite uncertainty about its environment. Uncertainty? What uncertainty? With my application above, once you knew what actions to take there was no uncertainty, right? Sure, that’s (mostly) true. But what if there are bugs? What if there is a 404 page? What if the output is delayed? Or there is a race condition? What if the server crashes or there is some internal error? What if there is a JavaScript bug that only shows up in the console? Critical to all of this, from an AI perspective, the agent’s actions — just like that of a human — are permitted to affect the future state of the environment, thereby affecting the options and opportunities available to the agent at later times. Correct choice requires taking into account indirect, delayed consequences of actions, and thus may require foresight or planning. Think of how Overlord worked. At the same time, in all these examples the effects of actions cannot be fully predicted; thus the agent must monitor its environment frequently and react appropriately.

Goals

All these examples involve goals that are explicit in the sense that the agent can judge progress toward its goal based on what it can sense directly. For example, in Stardate, the agent expects some calendar date to be generated. In Overlord, the agent expects a bomb to be provisioned with the settings that were chosen or the defaults if no specific changes were made. In all of these examples the agent can use its experience to improve its performance over time. The knowledge the agent brings to the task at the start — either from previous experience with related tasks or built into it by design — influences what is useful or easy to learn, but interaction with the environment is essential for adjusting behavior to exploit specific features of the task. That’s why supervised and unsupervised learning could potentially have very little to offer us from a testing tool based on AI. But, as you’ve seen in some of my other posts, reinforcement learning can be a little more problematic.

Rewards

The central idea behind Reinforcement Learning is that an agent will learn from the environment by interacting with it and receiving rewards for performing actions. Reinforcement Learning is just a computational approach of learning from action. Going back to Flappy Tester, our agent (flappy bird) receives state S0 from the environment (the game screen). Based on that state S0, the agent takes an action A0 (agent will flap). The environment transitions to a new state S1 (a new game screen). The environment gives some reward R1 to the agent (not dead: +1). This reinforcement loop outputs a sequence of state, action and reward. The goal of the agent is to maximize the expected cumulative reward. But now ask yourself a simple question … and this is one many people touting AI don’t ask.

Intention and Agency

Why is the goal of the agent to maximize the expected cumulative reward? Well, some would say the answer to that is that reinforcement learning is based on the idea of the reward hypothesis. All goals can be described by the maximization of the expected cumulative reward. But now ask a further question: Why would the AI agent try to do this? We know why a human would. But how do we impart that goal-seeking to the agent? I won’t tackle that here in this post but you should be thinking about it. This gets into the heart of intention; the desire to achieve a goal to find something out. That’s how we, as humans, learn. That’s how we, as humans, make value judgments about what we are doing or getting. That’s how we, as humans, come to some idea of quality. I want to call attention to one part that is easy to overlook: the cumulative part. Cumulative rewards might make you think you can just add up the rewards. In very simple systems, you probably could. But the rewards that come sooner are more probable to happen, since they are more predictable than the long term future reward. This means there is an idea of a discount applied to rewards. This discount is often referred to as gamma. The larger the gamma, the smaller the discount. This means the learning agent cares more about the long term reward. The smaller the gamma, the bigger the discount. This means our agent cares more about the short term reward. So what you really have is discounted cumulative expected rewards. The larger gamma would work well for the stardate example. The smaller gamma would work well for the overlord example. The agent doesn’t know that to start off with, of course. In reality, though, we want the reverse. The smaller gamma (bigger discount) should be for stardate. The larger gamma (smaller discount) should be for overlord. The idea of having intentions, setting goals based on those intentions, making decisions to achieve those goals, and looking for reward signals in the environment is critical to how humans learn and decide whether they are getting value or not from their experiences.

Aside: The Overlord Testing

I promised above to return to the Overlord example, just to see how your testing of it went. One thing to note that can be interesting from a learning perspective is that once the bomb is detonated, that’s it. At least for that session. If you go back and try to provision a bomb after one has detonated, you will always see the detonation animation playing. The same applies for setting the custom codes. Once you’ve set that, that’s it. All your settings — including the bomb’s state (such as whether it blew up and its countdown timer) are locked in for the remainder of a session. As a human you might have figured out. Ask how you would have your AI figure that out. Certainly you could just say, “Well, I would tell it that if it sees the same thing, X, after doing some action Y, then that’s a problem.” But therein lies the problem: you had to tell the agent that. You effectively would be putting in some conditionals. And we can do all of that perfectly well with our current non-AI automation tools. And, of course, you also get into the question of how they AI-based testing tool would not only learn this situation but learn enough to know that it should report on it. Keep in mind, it acts like a human. That’s part of the point of it. Yet, some humans will blissfully ignore bugs. In fact, you don’t even know if the session thing is a bug or not. Maybe that was intentional on my part. Some human testers, however, wouldn’t even think to ask. Or might not notice it. Or might not care if they did notice it. And this was a for a relatively “in your face” kind of example. Now imagine examples that are not as clear-cut. And where there are many such examples. And they stretch over the course of workflows that are used as part of a UI, thus there might be a time delay in terms of the action that causes the problem and the observable that shows there is a problem.

Does an AI-Based Test Tool Make Sense?

This seems like a good point to close off this post. Regardless of asking whether we can import these concepts into an AI style tool, the question really is this: should we bother? Consider how you would automate my above pages that I showed with with non-AI based tools. Now consider how you would do so with AI-based tools. Does the effort seem worth it? This is not a leading question. It may very well be worth it. But you should be asking the question rather than just assuming that the answer is foreordained. Consider the difference in the UI example I showed here and that of Flappy Bird from the previous post. Would any sort of automation with an AI be able to tell you that the UI is too cluttered with info? Or too confusing? Or has misleading directions? Would any sort of AI be able to tell you that Flappy Bird, as it is constructed, is too hard? Too unfair? Too easy? Too forgiving? Yet, while you’re thinking about that, keep in mind that as I stated earlier, I only really talked about the execution part of testing here. There is still the analysis and design components of testing to consider and I’ll tackle those in a future post. Right now my hope is that this post got you thinking about some of the challenges but also whether the alleged benefits are there for an AI-based testing tool. And that alone should arm you with some pragmatic skepticism when you hear tool vendors touting how they use AI in their tools. Also, with this post and the previous, you perhaps now have a bit of the vocabulary in context to challenge certain claims but also to engage with your fellow testers about this technology.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …