The Tester Role in Machine Learning, Part 2

This post continues on directly from the first one in the series. We’ll take the CartPole example we started with and continue our journey into how testing — particularly that done by a specialist tester — intersects with the domains of data science and machine learning.

First let’s start off this post by making sure you have a example that we can play around with. Here’s the basis of what we started with in the previous post:

We ended the last post talking about tests and the idea of a test scenario in this particular context. So let’s pick it up from there.

The (Very General) Test Scenario

In the CartPole environment, the agent gets a reward as long as the pole is still somewhat upright and the cart is still within the area of the screen that we can see. That’s an imprecise way of saying what I said more precisely in the first post. Recognizing that shift from precision to imprecision is important in these contexts.

Also in the CartPole environment, an episode is over as soon as the pole falls beyond a certain angle or the cart strays too far off to the left or right. Again, an imprecise way to say what was previously stated more precisely.

STOP! As a tester, I’m sure you can see the possibility for issues when ambiguity occurs. That’s a skill you likely already have practiced in any environment wherein there is information about how a system works. We often have to shift between these polarities of description.

Tester Hat On!

As a tester, having gone through the first post as you hopefully did and the slight recap above, take a moment here before you read on. Make sure you verbalize what success looks like. In fact, try writing down what success looks like.

How we think about something, how we verbalize something, and how we write down something is often very different. As you’re thinking about this, also take a moment to consider this: what is a test scenario here? When will that test scenario be said to have passed?

Go ahead. Take a moment; I’m not going anywhere.

Hopefully you came up with something like: “Success looks like the agent learning how to keep the pole upright for over 195 time steps, and it has to do this 100 times consecutively.”

That, basically, is our test scenario, right? Keep the pole upright and within the bounds and do so until the reward is at least 195. And then do that 100 more times. Is that the test scenario? Is that it? Just one scenario? But what about domain-to-range issues I talked about the previous post? What about risks and the many tests that I also talked about?

If you’re struggling to reconcile a few of my statements up to this point, perhaps the cognitive friction you’re feeling is about what exactly a “test” is in this context.

And if that’s the case, then, as a tester, are you wondering what exactly you will do in this situation? It sounds like you need a developer to do the implementation work. But then the algorithm will do the rest of the work, which a developer can verify. What do you do? Think about how you want to start checking this system in the first place. What are you going to be looking for? What do you need the system to provide for you? And how much of this is automated?

In fact, can any of it even be “manual” testing?

Well, yes! With these posts, you’ve been doing “testing as a design activity” up to this point. Learning about how to reason about and describe the system is part of designing tests for the system. We’re going to get into “testing as an execution activity” next. And, yes, that is largely going to be of the automated sort of execution.

Run CartPole Episodes

So let’s try something really simple. Let’s create a very basic script that will run the CartPole environment in a episode consisting of 500 timesteps. Modify that starting script so that it looks like this:

Hopefully it’s fairly clear that what we’re doing here is simply setting up the environment, having the agent operate in the environment for 500 “turns” or “steps” and, at each such step, the agent is choosing a random action from its possible action space.

Calling the step method on the environment is essentially just asking the agent to take one of its possible actions from the action_space. Here we’ve chosen to “sample” the action space to get a random action, of which, remember, there are only two: move left or move right.

The render method simply draws the environment on the screen so you can see it. For some environments, rendering may not be something you want to do in all cases.

Understand the Feedback

Testing is about getting feedback from systems under test. So let’s break down what’s happening, which will better help you understand the context. Calling that reset method on the environment, as we do outside the loop, provides something like this:

array([-0.04703739, -0.01945379,  0.02304621, -0.00838042])

This is returning an initial observation, which corresponds to a start state. This is what the agent will observe as they start out. Each call to the step, within the loop, will be returning something like this:

(array([-0.02820683, -0.16418285, -0.04160898,  0.29520379]), 1.0, False, {})

You’re getting four values back.

  • The first value is an observation. This is an environment-specific object representing the agent’s observation of the environment, given as a Python tuple. Here you can see it’s just an observation of the four aspects of the CartPole environment that we talked about before: x, x_dot, theta, and theta_dot.
  • The second value is the reward. This is the amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward. As mentioned before, CartPole gives a reward of 1.0 for each action taken.
  • The third value is the value for “done” and this will be either true or false. When “done” is false, the environment is still operating. When done is true, it’s time to reset the environment again. “Done” means the environment reached some sort of termination condition. Note that “done” is not indicating success or failure.
  • The fourth vaue that you get returned is called “info” and it basically provides diagnostic information if any is available. This would be a perfect place to coordinate with developers to return information that might be helpful in understanding what’s going on during execution. This will save you from doing a lot of after-the-fac analysis, should that be necessary.

Now, if you ran the above, you saw a message in your console saying something like this:

You are calling 'step()' even though this environment has already returned done = True.
You should always call 'reset()' once you receive 'done = True' -- any further steps
are undefined behavior.

The reason for the environment being “done” is because you either achieved a goal (solved the environment) or you reached a situation where you failed to achieve the goal. For example, if the pole tips too far, the environment terminates. That’s what that message is telling you: done was set to true, so the environment finished, but you just kept on going. We’ll deal with that in a bit.

Understand the Observation

Those observations are important. Keep in mind that the environment starts in some state, with some set of values for those four numbers that the agent observes. Each step another state can be observed by the agent and that state is made up solely of another set of values for those same four numbers.

Visually, as the above script runs, you’ll likely see something like this:

It might take a second for the Gym environment to clear away when this is done, so don’t panic if you don’t see the application close right away. So the above is what you likely saw. What you want to see is something like this:

Clearly our agent isn’t there yet but that shouldn’t come as a huge shock.

STOP! That doesn’t come as a shock, right? Remember your execution context here. What was it?

Answer: we’re having the agent choose random actions. In this simple example, that’s probably easy to remember but as these environments get more complex and more things are going on, understanding why you are seeing the output — visual or otherwise — that you are seeing becomes fairly important. This is the equivalent of observing what you are testing and analyzing what you are seeing as a series of continuous results.

Gathering Data

So now let’s refine our example a bit:

Run that and you’ll get something like this as output:

[2017-12-01 02:46:29,606] Making new env: CartPole-v0
[-0.00212347  0.02898076 -0.04581049  0.04440929]
[-0.00154386 -0.16545538 -0.04492231  0.32229394]
[-0.00485297  0.03027653 -0.03847643  0.01578956]
[-0.00424743  0.22592855 -0.03816064 -0.28878054]
[ 0.00027114  0.03137096 -0.04393625 -0.00837313]
[ 0.00089856  0.22709458 -0.04410371 -0.31458834]
[ 0.00544045  0.42281613 -0.05039548 -0.62084747]
[ 0.01389677  0.61860431 -0.06281243 -0.92896688]
[ 0.02626886  0.81451538 -0.08139176 -1.24070868]
[ 0.04255916  1.01058253 -0.10620594 -1.55773856]
[ 0.06277081  1.20680391 -0.13736071 -1.88157807]
[ 0.08690689  1.40312883 -0.17499227 -2.21354881]
Completed after 12 timesteps.

Running another time, you might get something like this:

[2017-12-01 04:13:14,822] Making new env: CartPole-v0
[ 0.04968516  0.03100998  0.04498377  0.04095471]
[ 0.05030536 -0.1647272   0.04580286  0.34748424]
[ 0.04701081  0.02971436  0.05275255  0.06958912]
[ 0.0476051   0.22404188  0.05414433 -0.20599461]
[ 0.05208594  0.02818915  0.05002444  0.10326462]
[ 0.05264972  0.22255983  0.05208973 -0.17322563]
[ 0.05710092  0.41689907  0.04862522 -0.44903163]
[ 0.0654389   0.61130072  0.03964458 -0.72599926]
[ 0.07766491  0.80585272  0.0251246  -1.00594551]
[ 0.09378197  1.00063031  0.00500569 -1.29063382]
[ 0.11379457  1.19568825 -0.02080699 -1.5817454 ]
[ 0.13770834  1.39105154 -0.0524419  -1.88084373]
[ 0.16552937  1.1965385  -0.09005877 -1.60488613]
[ 0.18946014  1.00258995 -0.12215649 -1.34158378]
[ 0.20951194  1.19901905 -0.14898817 -1.66985655]
[ 0.23349232  1.00590985 -0.1823853  -1.42704236]
Completed after 16 timesteps.

What those two episodes are doing is consistently showing us the data that we need to be reasoning about. Or, rather, that our agent needs to be reasoning about. This data ouptut matches the data science input we started with. In the first case, it took only twelve timesteps before the episode ended. In the second case, it took sixteen timesteps.

But — did it succeed or fail?

Well, as a tester you had best consider how you’re going to recognize that. Gathering data is one thing. Being able to make decisions about that data is another thing entirely. Also, as a tester, you should probably be thinking of a further refinement to the test execution.

Stop a moment. Can you think what that might be? I’m running 500 timesteps in the episode. What should I refine here about the test execution?

Many Scenarios

We’re only running one episode here. It’s hard for an agent to learn from a single episode because they no chance to refine their learning based on experience. So a slight further refinement would be to have many episodes. When you have a learning algorithm in place, that algorithm will need multiple episodes to be put through its paces. Here’s a simple way to add an episode loop:

Go ahead and run that. But do note that even with all the visuals and data you are getting, determining if you succeeded (test passed) or not (test failed) is somewhat difficult.

At bare minimum, as a technical tester in this kind of environment, being able to construct something like the above is necessary. This is essentially a test harness to start getting information from an environment. We are fully into testing as an execution activity. And, yes, some would call this “checking” to distinguish it from “testing.” But no scientist or experimentalist would. And testers are, ultimately, a form of scientist and experimentalist.

I already covered thoughts on the “checking vs testing” debate. We don’t have to revisit all that here, but bottom-line: if you want to be taken seriously as a tester in data science, machine learning or artificial intelligence environments (or, ideally, anywhere), don’t bring up this distinction.

Your Tester Role

So again, I’ll go back to what I’ve been asking a few times now: as a tester, what role are you playing in this, if any? What value are you actually contributing here?

I’ve given you probably one of the simplest types of machine learning environments — only four variables to keep track of and two simple actions — and it even has a nice visualization to boot. What you will actually be dealing with is often massively more complex and sometimes without any sort of ready visualization.

You need working implementations but you also need some intuition behind how those implementations work. You also need to understand some of the data science behind why they work. Data is rapidly becoming key, with the algorithms being secondary. There’s a reason machine learning is sometimes referred to as computational statistics.

There’s much to what a tester should be doing and thinking about in these contexts. I covered a bit of the distinction here between testing being used as a way to put pressure on design as well as testing being used as an execution mechanism to verify the implementation of the design. I haven’t at all covered testing as a framing activity, which I’ll get to in the next post.

Barely Scratching the Surface

This isn’t even close to done. The first part of any machine learning problem is gathering the data. Well, we did that. Our environment provides a very straightforward way of gathering data in that we basically just ran through the simulation many times and took random steps every time. So we accumulated the actions and corresponding observations.

The second part of any machine learning problem is, once you have the data, you need to define a model. We started down this path, considering, as we did, what the expected inputs and the desired outputs are. But we didn’t really model anything about the agent. We just took some random actions.

But even then we come to the third part of a machine learning problem: the prediction. We need our agent to predict what is the better action to take given a certain state that the environment is in.

So we have our data, we have our model (sort of). We have to train our model. This means having the agent iterate through many episode to see how well the model performs. But there are certainly different ways we could train the model. There are many different types of algorithms to which the data we have could be applied. We haven’t even really touched any of this.

So if this post and the previous one seemed a little long and a little complicated, consider this: we didn’t even get to the complicated part yet. If you are questioning what your role is just based on what you’ve seen so far, it’s going to be harder for you to conceptualize your role as things get even more involved.

In the third post in this series, which will be the most in-depth of this series, I’ll break down this example a bit more and we’ll get into how you recognize if what you are testing is, in fact, providing the value that it is supposed to.


This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.