The Tester Role in Machine Learning, Part 1

As a tester are you ready to work in environments that are based in or around data science and machine learning? What will you actually do in these environments? How will you interact with developers? How technical do you have to be? Is it all just automated testing? Or do we still have room for a human in there somewhere? Let’s dig into this a little bit by going through a scenario.

I already provided a running example of this with algorithmic searching using Pac-Man as well as a simplified Q-learning demonstration. Here I’ll dig in to some specifics of thinking around these problems and I’ll do this via a platform called OpenAI that’s attempting to democratize the knowledge of artificial intelligence and machine learning.

The OpenAI Context

So first there’s OpenAI and then there’s something called the OpenAI Gym and there’s even something called the OpenAI Universe. The main difference is that the Gym is where you can develop and test out various machine learning algorithms. The Universe gives you the ability to have your algorithms interact more with the real world. In this post, I’ll just take you through using the Gym. Do note that these are resources you can use to level up your own skills and gain experience.

To play along, you will need Python installed. You can use Python 2 or 3, but I always recommend 3 because we need to stop relying on Python 2 already. If on Windows, try to make sure you get a 64-bit version of Python. In general, the easiest way to get the Gym is simply this:

pip install gym

This should work on all platforms, even Windows. If you get much more involved with Gym, there are elements that require a good development tool chain and Windows often doesn’t have it by default. But for this post, you should be okay with the above command.

The Gym Platform

The Gym platform is a Python library that is designed to provide environments. An environment in this context represents a problem that an agent can interact with. The idea is that this agent can learn to fulfill different tasks based on the environment it observes. The agent will use various algorithms in service to that goal. These algorithms will generally be of the category known as reinforcement learning algorithms. Two other categories are supervised learning algorithms and unsupervised learning algorithms.

Agents Learning In Environments

In the reinforcement learning scenarios, which I’ll focus on here, the agent starts by trying random actions as a consequence to which it gets rewarded. Sometimes those “rewards” are negative and thus are rather more like punishments. In my Pac-Man example, there was a “living reward” which was -1. The idea being that the longer Pac-Man was “living” in the world (thus didn’t solve his maze), he lost a point.

Based on the rewards, the agent continuously learns which action is good in which specific situation. Doing so, the agent learns how to “get good” at the environment — i.e., solving the problem the environment poses. For Pac-Man, for example, this was finding the quickest path to eating all the available dots. Importantly, this agent-based learning can happen even without the agent having to be be told how the environment actually works.

The context for all of this can vary widely. This can be something an agent learning how to play video games. This can be an agent learning how to recognize certain faces in picture frames captured from video cameras. This can be an agent learning to figure out the sentiment of a particular statement culled from a social media posting. This can be an agent determining what genomic sequences to focus on given apparent markers of progression for diseases like cancer.

Generalized Execution … And Testing

One of the advantages of this type of learning is that it’s completely generic and not bound to a specific problem. This means you can apply your algorithms to different environments without changing the algorithm at all. As a tester, you might have started thinking about “equivalence classes” with what I just said. As a tester in a machine learning environment, that notion will need to be refined a bit.

Another notion to consider are the conditions you are dealing with. The algorithm might be the test condition and the various environments it’s used in are data conditions. Or is it the other way around? And does it actually matter? Ponder upon that.

All this being said, as a tester, you first have to understand the context. There is data science here. And there are algorithms. And there is machine learning. And those things are paired up to provide software agents that do valuable things and help us gain insights. Companies that do better at this than others will have more competitive advantage. So let’s look at an example of this in action.


CartPole is one of those environments that I just talked about. In this environment, a pole is attached to a cart. The cart moves along a frictionless track. The pole starts upright, and the goal is to prevent the pole from falling over by increasing and/or decreasing the cart’s velocity in a coordinated way. Think of the idea of balancing a pencil or ruler upright on your finger as you try to keep it from tipping over.

Obviously moving your hand too fast will cause the pencil to tip. But not moving your hand at all will allow (not cause) the pencil to tip. So you have to take some actions, but no one can provide you a specific script of exactly what action to take at each point in time. You have to learn as you go based on experience.

So let’s play around with this and, along the way, we’ll investigate the tester role in machine learning. Granted CartPole isn’t like figuring out cancer or helping spot bad guys, but the very simplicity of this problem will, I hope, make it easier to convey some of the ideas without getting lost in the complexity of some problem. What I talk about here scales to those larger concerns.

Spin Up Your (Test) Environment

To get started, create a script called and and with this we’ll start by loading up the CartPole environment.

You can run that script, although not much will happen. However if you get an error running it, that will tell you that your Gym library was not installed correctly. What the make method does is use the gym to create a new environment. In this case, I’m using version 0 of CartPole. What you are doing here, as a tester, is equivalent to spinning up an environment that has been provided as part of a sprint. Gym is basically acting as a container for these environments.

Incidentally, if you’re one of those testers that likes some insight into the code behind what you are testing, you can check out the source code of CartPole.

What Can I Do? What Can I See?

“What can I do?” (actions) and “What can I see?” (observations) are two things testers are always thinking about. And there’s a nice parallel with that because every environment in the Gym platform provides mechanisms that describe valid actions and states. The states are treated as observations. Specifically, an observation is “the state that an agent observes.” So now let’s add some statements that let us check the action and state (observation) space:

You should get this output:


But what does that mean?

  • The Discrete action space allows a fixed range of non-negative numbers, so in this case valid actions are a total of two: either 0 or 1. In the CartPole environment, 0 corresponds to pushing the cart to the left while 1 corresponds to pushing the cart to the right.
  • The Box observation space represents an n-dimensional box. Here what this is telling us is that valid observations will be an array of four numbers. Why four? Because that’s what this environment, CartPole, provides just as it provides only two actions.

STOP! As a tester, what are you thinking right now?

What’s the (Relevant) Business Domain?

Well, hopefully you’re thinking: “I need to know exactly what all those observation numbers are!” And you’re right. You do. This is part of the data science that is going into this particular problem. In the CartPole context, those numbers are:

  • 0 – Cart Position
  • 1 – Cart Velocity
  • 2 – Pole Angle
  • 3 – Pole Velocity at Tip

This is the equivalent of gathering the requirements for understanding a problem. These are essentially business domain rules. They also serve as constraints on the nature of the domain, in terms of what’s worth considering. The color of the pole, for example, has no relevance whatsoever. Nor does its height or its girth. But we don’t know all that unless we establish the domain.

What’s My Domain-to-Range Ratio?

Testers should always be thinking about their domain-to-range ratio in any testing environment. It’s absolutely critical, however, in data science or machine learning contexts. The domain-to-range ratio is the quotient of the number of possible inputs over the number of different outputs. This isn’t the post to go into that since it can get kind of involved. For now, I’ll just state that we can add two more statements which will allow us to investigate the range of the valid observations.

You should get something like this:

[  4.80000000e+00   3.40282347e+38   4.18879020e-01   3.40282347e+38]
[ -4.80000000e+00  -3.40282347e+38  -4.18879020e-01  -3.40282347e+38]

What this is showing is the maximum and minimum values for the four observations we can make. This is providing you with some information about how complex your space is. And, at a glance, it would seem really complex given that the range of possible states is so large. But perhaps we can bound that as part of what we are testing for. After all, this environment and an agent within it presumably fall within some parameters that we consider acceptable. (Parameters. Keep that word in mind. But you’ll have to wait for the third post for me to dig in to it.)

So let’s consider these parameters because that, in essence, becomes part of the interface for testing.

What’s My Interface?

What Gym provides for each environment is a consistent and common interface and this interface essentially provides the classic “agent-environment loop”:

The idea is that for each timestep, the agent chooses an action and the environment returns an observation and a reward as a result of that action being taken. When I say “returns an observation” just think of that as a new state. So basically you have state1 --> action --> state2. And so on. Each new state, called a successor state, is something the agent can observe, hence the synonymous naming.

An environment will have termination conditions, either successful (usually positive) or unsuccessful (usually negative). A set of actions leading to a termination are called an episode. For the CartPole, any given episode terminates unsuccessfully when one of the following conditions holds true:

  • The pole angle is more than ±12° from vertical.
  • The cart position is more than ±2.4.
  • The episode length is greater than 200.

What Does Success Look Like?

What do we have there? I said the above is about terminating unsuccessfully but that gave us some ideas of what success looks like, right? And that understanding, I think we can all agree, is critical for testing. But make sure you know what “success” actually means in each case. That second point, for example, is really saying that the center of the cart reaching the edge of the display counts as termination of the unsuccessful variety. Not necessarily obvious from that description, right?

The above are the termination conditions. What is the ultimate success condition? The idea is that the CartPole problem is considered solved when the average reward is greater than or equal to 195 over 100 consecutive episodes. So our agent may solve the CartPole problem after, say, 88 episodes. But to prove success, there have to be a further set of 100 episodes (so at least 188 episodes in total) where the returned reward averaged over 195 in each episode.

As with any testing situation, being able to understand success is critical.

Let’s Recap Our Environment

CartPole is what’s called a binary classification problem. There are four features as inputs. In machine learning, “features” are the things you care about to solve the problem. In this case, the features refer to those four observables we already talked about.

But, actually, let’s talk about those again. Those features include:

  • the cart position (x)
  • the cart velocity (x_dot)
  • the pole’s angle to the cart (theta)
  • the pole’s derivative; angular velocity or how fast the pole is “falling” (theta_dot)

You might notice there how I gave some other descriptors there (x, x_dot, etc) and refined the previous descriptions of the observations. That’s important because that language will infuse a lot of how developers and data scientists talk about these situations.

STOP! What I just said there is important. As testers we are once again called upon to be a medium for communication that ties the business level terms with the underlying technical terms.

All four features are, as we saw, continuous values (floating point numbers), which certainly implies an infinitely large feature space. It’s not quite infinite as you saw from the observation space output above, but it’s still incredibly large by any rational measure. Yet while the inputs are continuous, the outputs are binary. The outputs are either 0 or 1, corresponding to “left” or “right”.

But notice how our outputs are very easy to conceptualize (actions to take) whereas the inputs are easy to visualize at a high level but don’t as easily map to what you might actually see. Well, unless you are visually able to correlate a specific theta_dot and theta value to the exact observed position of a moving pole. In which case I salute you.

What this further dive into our inputs and outputs means is that our domain-to-range ratio is really high! Again, I don’t want to get too much into that but you can consider the domain-to-range ratio to be an indicator of risk. The higher the ratio, the worse it is to have few tests.

STOP! What does “few tests” versus “many tests” even mean in this kind of context? As a tester, be thinking about that as we continue on here.

Next Steps

Okay, that’s a fairly gentle start to what I want to talk about. I was able to introduce a data science and machine learning context as well as some focal points for thinking about testing in that context. In the second post in this series, we’re actually going to exercise the CartPole idea a bit.


This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.