The Tester Role in Machine Learning, Part 4

This is the last of a four part series (see parts 1, 2 and 3). The goal has been investigating whether specialist testers have a role in machine learning environments uniquely distinct from development roles in those same environments. These posts have been getting you up to speed on what that might look like. Here we finish off that journey.

In most of these posts, I’ve spent a lot of time going over details and taking you somewhat step-by-step through script creation because those scripts were acting as test harnesses. Along the way, I asked you to consider how much of that work you were actually doing, as a tester, as opposed to just having it handed to you.

For this final part, I’m actually going to give you the full scripts. How you got them is less important than what you do with them. Or, rather, how you consume the results from them. We already did testing as a design activity in a few of these posts. We did some testing as an execution activity as well, which we’ll continue a bit here. We also did a little — very little — of testing as a framing activity.

So you can run the provided scripts. You can see the results. And then we’ll regroup and think about what that part of the testing process means for you, as a specialist tester, and why this is testing as a framing activity.

Some Dependencies

For these scripts you will need another dependency beyond those we used already, which is a matplotlib. You can get this library by the following:

pip install matplotlib

This will be used so that we can draw some graphs. Currently we’ve been relying on text output but as we’ve seen, getting a whole bunch of data thrown back at us doesn’t really help us reason about the system all that much. In fact, as we saw, we couldn’t even necessarily tell if a bit of data indicated success or failure.

Finally, for the last of these examples, you will need TensorFlow installed. TensorFlow is library that allows for symbolic math. It’s name comes from the fact that it can perform computations with multi-dimensional data arrays which are commonly called tensors. Tensors take you into linear algebra and vector calculus, subjects I will tackle not at all here. To install it:

pip install tensorflow

Note that this will work just fine on a Mac or Linux. It should work okay on Windows — if you are using a 64-bit version of Python, as recommended in the first post. On a 32-bit version of Python, that might not work. If on Windows, your best bet is usually to use Python as part of a distribution like Anaconda or the Enthought distribution. One thing you can do is try to download one of the TensorFlow wheels. For example, download “tensorflow-1.4.0-cp36-cp36m-win_amd64.whl”, although the version number of TensorFlow might differ. Once you have that wheel downloaded, try this:

pip install tensorflow-1.4.0-cp36-cp36m-win_amd64.whl

If you are told that it’s “not a supported wheel on this platform”, you can try this next bit. Rename the wheel as follows:


Note the “-py3-none-any” which essentially replaces all other bits on that filename. Then try the pip installation again against this newly named file. That worked for me on Windows 8 and Windows 10. Keep in mind, even if all this works, this relies on certain DLL files in Windows and during execution you might find something reported like this: “DLL load failed: %1 is not a valid Win32 application.” If that happens, it probably means you are using a 32-bit Python while the binaries being called upon are 64-bit.

If all that seems a bit much, just try one of those distributions I mentioned or make sure you have a 64-bit Python installed.

Test Scenarios

We talked a few times about our test scenarios and what a “scenario” means in the context of this example. The notion of a scenario can take on varying levels of granularity in these kinds of environments. Here we’ll treat each algorithm — the strategy the agent uses to find a policy — as a test scenario. It is the outputs of those scenarios that we will be reasoning about.

The scenarios below are what I briefly covered in the last post from a conceptual point of view. Here we put them into action. In all cases, I’ll present the scripts with a few explanatory notes. Technically you don’t have to run them if you don’t want to. I say that because I am going to come back and present their output to you. That said, I would certainly recommend running them and perhaps even annotating them with print statements if you want to see how things are operating.

Test Scenario: Random Policy

Here is

Here our agent is being trained and the train_agent method calls out to a create_policy method to get the specific algorithm that will handle this. Here it’s the random algorithm we were talking about in the last post. Each episode is a single execution of our learning algorithm. That’s what the run_episode method is doing. As you can see, 200 actions will be taken in the environment for each episode. The agent will train like this for 10,000 times.

Getting results is handled at the bottom. We’re training the agent 100 times and we’re doing this just to get some data for the graphs. But notice what we have here. We have the agent training being doing 100 times. For each training, the agent is running an episode 10,000 times. And each episode will consist of 200 steps.

That run_episode method can run multiple episodes with multiple random policies. That’s really important to understand. I say that because while this “random policy” approach may be one test scenario, there are actually thousands of test cases being executed! Remember that discussion we had previously about the domain-to-range ratio? Now you’re seeing a bit more of how that comes into play.

An episodes per search value is found and this is the number of episodes that it took to find each good policy. So what we’re asking here, as part of our results, is: “On average, it takes how many episodes to find a good policy that can get, in this case, a total reward of 200?”

Try it out!

Test Scenario: Noisy Policy

Here’s the next script, based on a hill-climbing style algorithm. This is This might take a little longer to run; perhaps about two to three minutes.

The basic idea here is that the reward derived is a function of those observation parameters. When we start with a policy that’s random, you are basically at what’s called a “local minimum” for the function.

Basically, functions can have “hills” and “valleys”, meaning places where the function reaches a minimum or maximum value. There is also the concept of an interval, which can be some segment of the overall values of a function. Each such interval can be considered a local maximum or minimum, where the local minimum means the “height” of the function at some point is greater than (or equal to) the height anywhere else in that interval.

So the local minimum for a random policy means that since you’ve started off with a random value it is, by definition, the minimum. So you are in the “valley” of the function and you want to go up to the “hills” — meaning go “up” the function, to higher values. Higher values, in this context, means a higher reward. That’s the “hill climbing” part. But if it’s all still random how does this work?

What the above does is add small changes to the parameters each time. These small changes are referred to as “noise.” If that small change leads to a better reward (moving up the function), the parameters are updated to incorporate those small changes. If, however, the change leads to less reward or even the same reward, then you try with different small changes.

Test Scenario: Gradient Policy

Finally, here is the script for The complexity in this approach compared to the others should be fairly obvious just by the nature of the script. I should note that this script will take longer to execute than the others. Depending on your computer, whether you are using CPU or GPU, and the version of TensorFlow, this can easily take anywhere from 10 to 25 minutes to run.

Whoa, huh? That’s a lot of stuff. In order to implement a policy gradient, we need a policy that can change in certain increments. In practice, this means switching from an absolute limit (move left if the total is < 0, otherwise move right) to probabilities. So before we had a situation where a policy would provide a number and if that number was below 0, the agent moves one direction. If it was 0 or more, the agent would move right. The difference with a gradient approach is that the policy will provide not a number, but a probability. And then if a randomly drawn number is below that probability, the agent will move one way and if above that probability, the agent will move the other way. This is a vast oversimplification of what's going on here. I don't plan to explain all this in full but just to give you some idea of what's going on. We have a policy and we take a series of actions. Specifically, we generate a set of actions to play out the CartPole problem and record any information from those actions (produce_observations) and then find a way to get the values of particular states that the CartPole world is in (produce_action_values). This provides what’s called approximate action-value or what’s sometimes known as an advantage function.

The “advantage” in this context refers to a value that we rely on to increase the probability of actions that are returning higher rewards than our value function says they probably should. In other words, those actions gave us an advantage.

There are two specific components: a value function (value_function) and a policy function (policy_function). We also have means to update both of those (update_value_function, update_policy_function). The basic idea is that we want our agent to learn a value function that gets to the point where it is able to determine which states of the CartPole world are “better” and that, in turn, leads to a policy that is better at maximizing rewards.

Key to all of this is that the produce_action_values method calculates the difference between rewards the agent experienced when producing the observations (via the produce_observations method) and the rewards that the value function said the agent should have gathered. If the difference is a positive number, the agent updates its policy to perform actions that lead to those states more. If the difference is negative, the agent updates its policy to perform actions that lead to those states less.

This may seem a bit involved as far as explanations go, but I assure you: I’m giving you the very abbreviated version.

Testers Still With Me?

A question I’ve asked repeatedly in these posts is one I’ll ask now: how much of any of this do you find relevant? Keep in mind where we started. We had some data science to provide the basis of our observables. We had the business requirements for the basis of the problem, which was having an agent solve a problem where those observables were the basis for learning how to control a system. We then had the idea of some algorithms that would allow the agent in question to apply strategies that would lead to better or worse policies.

We also created some test harnesses to get a feel for the mathematics involved as well as determine how to run against the environment such that we can determine what success or failure looks like.

But how much of this was tester activity? We are in an industry that already conflates developers with testers to a disturbing degree. We are already in an industry that feels much, if not all, of what testers do can be automated. So it’s necessary to have some thoughts about how test techniques, test thinking, and the specialty of testing itself provides value in this context.

And speaking of that context, while I’ve provided all of this in the context of the relatively simple CartPole problem, the basis of everything discussed here — literally everything — is effectively the basis of data science, machine learning and artificial intelligence environments.

Results and Analysis

Each of the above scripts should have generated some graphs for you. Even if you think very little of the previous material had much to do with testers, surely this is where you will shine. After all, testing is partly about gathering information, analyzing the information, and providing a narrative — a framing context — for what was observed.

So let’s check out the graphs you should have gotten for each script. Here’s the graph for the random policy:

And here’s the graph for the random-noisy policy:

And finally here’s the graph for the gradient policy:

You might have noticed a particular print statement near the end of each script. As a tester, looking at what is going to run and understanding the output is critical. For each execution you probably got something along these lines:

  • Random: Average number of episodes: 14.75
  • Noisy: Average number of episodes: 5080.63
  • Gradient: Average number of episodes: 461.13

The numbers will almost certainly have differed a bit but they should be in the ballpark of the above.

Frame the Testing

Okay so — tester! — it’s your time to shine. You’ve had four posts explaining the basis of this material to you. You have scripts that have executed some strategies (algorithms). And those executions have resulted in some very specific graphs.

What’s your analysis? What do you tell the business team and the developers?

Keep in mind that each script is running 10,000 test cases of its respective policies. Each policy variation is, in fact, a test case. Given that, it should probably stand out to you that only one of those graphs actually gets to 10,000 on the x-axis.

At a glance it might seem that noisy does the absolute worst. Keep in mind that it’s noise based on randomness. And there seems to be a noise_scaler variable set at 0.1 in the script. As a tester, hopefully you were wondering what would happen if that scaling value was changed. Since the random script is no noise and the noisy script is random-with-some-noise, it should have struck you that there can be an equivalence class here if the noise scaling is set to such a value that it basically becomes purely random.

The graph of the policy gradient seems better than noisy, certainly, but it also took a much longer time to execute. Does that matter? Is time to execute part of the qualities that we are looking for? Oh, wait, we didn’t ask that at all yet, did we? But then again perhaps we didn’t even know to ask until we saw the complexity of the scripts and/or saw their execution. So test execution might inform some of our test design, which in turn may suggest how we want to frame activities and results.

At a first glance, and correct me if you think this is wrong, it would almost seem that the random approach worked the best. This is going by not just time to execute, but by the average number of episodes (14.75) compared to the other scenarios and the look of the graph distribution. But … can that be right? How could just taking a series of random actions have been better than approaches that weighted those actions based on the rewards achieved?

Or is this to be expected given the nature of the problem? But if this CartPole problem is translatable to other problems that have continuous observables, that would imply that randomness is a sound strategy for those as well.

And So It Goes!

And thus do we have the basis for more exploration and more experimentation, working with our teams to figure out different models, different ways to craft policies, different strategy implementations, and so on.

I realize this series of posts can appear to have been a series of unfair statements that are somehow suggesting that testers aren’t any good in these environments. In fact, quite the opposite. I believe testers are critical in these environments. But I am suggesting that many testers may not be ready to translate their existing techniques in such a way that they can seem value-adding in such environments.

My goal with this series was to get testers thinking about their value, what they have to do to provide that value, and what skills they may need to consider leveling up a bit. In particular I tried to show many of the techniques that we testers currently use, including social relationships, are no different at a fundamental level than we deal with in any environment. But certainly there is a need for refining those techniques and those relationships to meet specific contexts.

And so that’s it folks! It’s probably too much to hope that this was a fun ride; but I do hope it was interesting and, ideally, a little bit enlightening.


This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

2 thoughts on “The Tester Role in Machine Learning, Part 4”

  1. Thanks for a great blog post series! It really got me thinking about the role of the tester in an AI context. Are we testing the algorithm, the learning data sets and/or the finished product once it has learned? I guess the answer is: a little bit of each.

    Overall, I think the level of the posts was just right for me (computer science background, but more than 15 years since I studied maths). There were some holes I had to fill in for myself, but I felt like that was part of the challenge. E.g. how the returned theta value was represented.

    And, since I am a tester, I have some potential nitpicks to bring up 🙂 From what I learned from the code, the termination condition for the theta angle is 12, not 15 degrees. And also, is stated in the first post that an unsuccessful termination would be when  “the episode length is greater than 200”. Is that really unsuccessful? In that case we have managed to balance the pole for more than 195 time steps. The code seems to return 0 in reward once you go over 200, but my guess is that is just for not skewing the average value for the reward. Feel free to correct me if I missunderstood anything.

    Last thing, do you have any reading recommendations that introduces the concepts of machine learning, deep learning, etc in a good way?

    Thanks again!



    1. You are absolutely right on 12 versus 15. The “15” actually came from the original implementation (still documented here), but I used the updated implementation. I thank you for pointing this out; it is now corrected and was a silly mistake on my part.

      Yes, the going over 200 comes from this bit of code in the CartPole logic (envs/

      tags={‘wrapper_config.TimeLimit.max_episode_steps’: 200},

      So the threshold is 195 for the rewards, keeping the pole balanced, while going over 200 is a termination of the episode itself. You need a termination condition for the episode no matter what because you could just stay there being successful forever. And, yes, as you note rewards have to stop after that point. But I do see your point in that this actually could be made a bit clearer in the post.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.