This post continues on directly from the first and second parts. I covered a lot of material in those posts so I can’t easily recap it here so definitely read those before reading this one. Here we’ll dig more into how a tester actually *tests* in this context but also look at testing as a framing activity.

Let’s consider what we’ve been doing here. We’re not training an agent to simply figure out one set of actions and then apply them every single time it’s in a given environment. That would be akin to memorization in a human. And that *can* work if your environment has no randomness to it or no variables that are outside of your control. That’s not the case with CartPole or, in fact, with most environments. What we really want to do is train an agent to find a policy.

And this is getting into the meat of what we’re actually testing for here.

## Policies — Guided By Heuristics

A *policy* is a set of actions an agent can take in any possible state. These policies are more like heuristics instead of sets of instructions or scripts. Much of machine learning is about the process of finding these heuristics or, rather, the parameters that go into a heuristic. In the first post, I asked you to keep the term “parameter” in mind. Now you’ll see why.

**STOP!** But wait! Isn’t that just about exactly what we do as testers? Don’t we use heuristics to guide how we explore an application or service? Yes, we do! So there’s a great conceptual interplay of what testers do and what these kinds of environments must do. In fact, we’ve been building up to using an agent that explores its environment so it can better learn how to apply actions and see if those actions are good or bad.

Uh, wait. Does that mean testers are basically like these agents? No, it doesn’t, but you might want to check out my Testing and AI post for an interesting take on this.

Okay, so we want to train our agent to find a good policy for the CartPole problem. Specifically, we want our agent to learn an ideally *optimal* policy that takes the four observation values (which we looked at in the last post) and then make a decision as to what action to take (move right or move left) given the values the agent is observing at any given time.

## Getting Set Up

We’re actually going to revisit the script we ended up with in the last post but for now create a new script called **cartpole-policy.py** and put the following in it:

1 2 3 4 5 6 |
import gym import numpy as np env = gym.make("CartPole-v0") observation = env.reset() |

Consider this new script the equivalent of a development spike that developers tend to do as they learn more about what they want to build. As a technical tester working in such environments, you might find yourself pairing with developers in this context for your own learning or also to work together to create scripts that help probe the problem space.

The above is a very simple start and almost matches what we started the first post with. One difference here is that, as part of this post, we’re going to use NumPy. NumPy is a numerical computation library that is very popular in the Python world.

It’s hard to work in a data science, machine learning, or artificial intelligence environment in a Python context and *not* work with NumPy to some extent. I certainly won’t be providing a tutorial to this very large library here so if you’re not familiar with it, treat this as just a real-life test exercise where there are bits you don’t fully understand. That said, I won’t be covering massively complicated stuff with NumPy anyway so I think we’ll all have smooth sailing.

You should have NumPy already if you installed Gym, but if for whatever reason you didn’t, you could just do this:

pip install numpy

## Policy Parameters

So let’s talk about this policy stuff. Keep in mind that in CartPole’s environment there are four observations at any given state, representing information such as the angle of the pole and the position of the cart. Using these observations, the agent needs to decide on one of two possible actions: move the cart left or right.

Well, how do we do this? How do we map these observations to an action choice? One way is via a technique called linear combination. Being really simplistic here, linear combination involves constructing expressions from a set of terms by multiplying each term by a constant and then adding the results. A very common approach in this kind of problem is to define what’s called a “vector of weights.” Each weight, in our case, would correspond to one of the observations, thus we would have four weights.

More specifically, each weight is multiplied by its respective observation and the products are then summed up. This is equivalent to performing what’s called an inner product, or matrix multiplication, of the two vectors.

Just to cover terms here, vectors, mathematically speaking, are really just special types of matrices, which themselves are just arrays of numbers. Also, if you are somewhat into math you may think I meant to say “dot product” rather than “inner product.” A dot product is an example, or type of, inner product. In situations where the dimensional complexity is small, they are effectively the same thing.

So let’s say we had some parameters (a vector) like this:

array([-0.91841861, 0.99507566, -0.90386279, -0.03262479])

And we have an observation (another vector) like this:

array([-0.83330851, -0.1958385 , 0.78598085, 0.46157554])

Matrix multiplication would give us this:

-0.15502573901193228

Here’s how we might do this in Python with NumPy:

1 2 |
policy_parameters = np.random.rand(4) * 2 - 1 action = 0 if np.matmul(policy_parameters, observation) < 0 else 1 |

This is just using NumPy to provide random values between 1 and -1 and then getting us a value for the “action” variable via matrix multiplication. If that value is less than 0, the agent would move left. Otherwise, the agent would move right.

## Hey! Who Invited Math to the Party?

Yeah, you have to understand some math and some statistics when you get into these contexts. There’s absolutely no way around that. The good news is that the libraries — like NumPy — do much of the work for you. But, just like with anything automated, they don’t do the thinking for you or choose the right techniques for you. You have to know what to apply to those tools and so the mathematical basis of this kind of work is important to level yourself up in.

Given that we just did this bit of math, let’s run our script just to see that you get values:

1 2 3 4 5 6 7 8 9 10 11 |
import gym import numpy as np env = gym.make("CartPole-v0") observation = env.reset() policy_parameters = np.random.rand(4) * 2 - 1 action = 0 if np.matmul(policy_parameters, observation) < 0 else 1 print(action) |

Running this repeatedly will output a 0 or a 1 depending on the random policy chosen by `np.random`

. Wait. Policy? What policy? In this case, the policy are those policy parameters. But this is only being done for one observation, and the initial one at that, so calling this a “policy” is a bit generous.

So we want to go back to what we did in the previous post and create a way to run multiple steps in the environment until the problem is solved or not. In other words, we want to run episodes. After each episode we reset the environment and run more episodes. Again, that’s what we did in the previous post. But now we’re adding the idea of a policy to the mix.

## Incorporate the Policy

So now let’s abandon our development spike script and go back to our **cartpole.py** script, which is what we ended up with at the end of the second post. Here’s that script is in full, with the addition of NumPy:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import gym import numpy as np env = gym.make("CartPole-v0") for episode in range(200): observation = env.reset() for timestep in range(500): env.render() print(observation) action = env.action_space.sample() observation, reward, done, info = env.step(action) if done: print("Completed after {} timesteps.".format(timestep + 1)) break |

Now let’s add our development spike work to that. I’m going to reproduce the entire script here just to make sure you have full context.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
import gym import numpy as np env = gym.make("CartPole-v0") for episode in range(200): policy_parameters = np.random.rand(4) * 2 - 1 total_reward = 0 observation = env.reset() for timestep in range(500): action = 0 if np.matmul(policy_parameters,observation) < 0 else 1 env.render() print(observation) action = env.action_space.sample() print(action) observation, reward, done, info = env.step(action) total_reward += reward if done: print("Completed after {} timesteps.".format(timestep + 1)) print("Total Reward: {}".format(total_reward)) break |

What you’re going to end up with here actually isn’t that much more sophisticated than what we did in the second post. You’ll have a bunch of these data points:

[-0.01993485 0.02136037 -0.03918042 0.02794792] 0 [-0.01950765 -0.17317844 -0.03862146 0.3080161 ] 1 [-0.02297122 0.02247194 -0.03246114 0.0034075 ] 1 [-0.02252178 0.21804403 -0.03239299 -0.29933799] 0 [-0.0181609 0.02339842 -0.03837975 -0.01704462] 1 [-0.01769293 0.21904917 -0.03872064 -0.32158549] 1 [-0.01331194 0.41470051 -0.04515235 -0.62622356] 1 [-0.00501793 0.61042269 -0.05767682 -0.93277789] 1 [ 0.00719052 0.80627343 -0.07633238 -1.24301307] 1 [ 0.02331599 1.00228749 -0.10119264 -1.55859794] 1 [ 0.04336174 1.1984649 -0.1323646 -1.8810586 ] 0 [ 0.06733104 1.00501012 -0.16998577 -1.63221642] 0 [ 0.08743124 0.81224328 -0.2026301 -1.39696912] 1 Completed after 13 timesteps. Total Reward: 13.0

What that’s showing you, however, is that we’ve annotated our previous output a bit, showing the action that was taken (0 or 1) and the total reward. Given that the agent gets one point for each move taken, it’s probably not a surprise to you that the total reward (13, above) matches the number of timesteps (also 13).

This is important though if you keep in mind that we started with some initial data science: the parameters that go into what makes an upright pole tip. We’ve now turned that data science into an actionable set of parameters that can be observed to change and, further, can change based on some actions. And that leads us to ask: what have we really done here so far?

## Creating a Model

Remember at the end of the last post I talked about models and said we hadn’t done much with that idea yet? Well, now we have. We’ve got a basic model for choosing actions based on observations.

Great, but how does this help us? It helps us because now our problem is that our agent needs to know how to modify these weights to keep the pole standing up. How does the agent know that? To know *that*, the agent needs some concept of how well it’s doing. And, lucky us, we just said how well it’s doing in our output. For every timestep the agent keeps the pole straight, it gets a positive 1 reward. Therefore, to estimate how good a given set of weights is, the agent can just run an episode until the pole drops and see how much reward it got.

**AH HA!** Now our reward count makes a little more sense. The higher the reward, the more timesteps we ran. And remember from the first post that the goal is to keep the post upright at least 195 times in a given episode.

This is really important to understand. We now have a basic model and we can run episodes to test *how well* that model performs. The problem we have is now much simpler: how can the agent *select* these weights (policy parameters) to receive the highest amount of average reward? That’s equivalent to asking: how does the agent choose what action to take to modify the weights given what it’s observing?

## Tester! How ya doin’?

Take a moment to stop here and reflect. Does the progression of activities make sense to you? I’m not trying to insult anyone’s intelligence. On the other hand I’m also specifically not hand-holding through every single nook and cranny that we discuss.

There’s a delicate balance here. What is largely true, however, is that if you are feeling cognitive friction reading these posts, that’s likely the same cognitive friction you will feel in an environment where you have to work on these things. That’s by no means a bad thing!

I just wanted to put this little aside in here as a reality check on myself. If I was presenting this in person, this would have been my check to see how many people had fallen asleep or were otherwise entirely disengaged.

## Taking Action

Now, above, I sort of slipped something in that is really important. I said: How can the agent *select* these weights (policy parameters) to receive the highest amount of average reward?

That’s not just an idle question for us to be able to move ahead to the next step. Rather that’s the *whole point* of this project! A learning agent is one that select parameters relevant to a policy such that by following that policy, the agent, on average, gets a high reward. Getting a high reward means that the agent is doing better. “Doing better” means that the agent is learning. And, putting on my Captain Obvious hat, learning is really the whole point of machine learning!

That is what we’re trying to achieve here. Ideally the agent finds an “optimal” enough policy that these parameters — and the resulting actions taken based on them — allow the pole to stay upright for at least 195 steps and do that for 100 consecutive episodes.

## Finding a Strategy

What we’re doing here is talking about the strategy that the agent will follow. And this is now where we do a bit of testing as design activity, testing as execution activity and testing as a framing activity. As you will see, the strategy largely amounts to the type of algorithm we want the agent to use.

So what’s one strategy we can use for having our agent choose weights? We already looked at it a bit in the previous post. The agent could simply keep trying random weights then pick the one that works best. So, basically, an algorithm based on choosing some random numbers.

There’s another popular approach, which is the use of a so-called hill-climbing algorithm. The idea here is that you start with some randomly chosen initial weights but then, as the episodes move on, some noise is added to those weights. The agent then checks if those modified weights led to a better performance and, if so, uses those. Hill climbing is one of many, many numerical analysis approaches.

Then there’s more complicated approaches like that of policy gradients. There is a *lot* of detail to this but basically we could represent a policy as focusing on the agent picking actions by adjusting its weights through — hang on to your socks! — gradient descent using feedback from the environment. Gradient descent just means taking the minimum value for some function. So if you’re given a function defined by a set of parameters (our observations), gradient descent starts with an initial set of parameter values and moves, in an iterative fashion, toward a set of parameter values that minimize the function — i.e., minimizes those parameters.

You can see the mathematics creeping in more and more beyond just the matrix multiplication that I mentioned above. This can be a struggle for testers who don’t necessarily have a background in math or statistics. I certainly fall into that category and I’ve had to level up my skills there a bit. Actually, more than just a bit.

## Testing as Framing Activity

All of what I just described here is a bit of a framing activity. It’s how testing frames the problem space and the solution space. What ties those two spaces together, of course, is the strategies that we just talked about.

Those strategies will utilize the various elements — policies, parameters, algorithms, data, etc — and provide a narrative around what’s actually happening and provide the basis for whether or not what we observe happening is or is not valuable.

## Testers In the Mix?

So let’s say that what I just talked about is roughly your situation. The developers say: “Okay, business wants us to figure out some strategy mechanism here. We have a few options. We could just go full random. We could go with some random but with noise sampling. Think of hill climbing. We might even want to consider policy gradients.”

As a tester, let’s ask some questions here. What has been *your* contribution so far? Were you aware of these algorithm choices and worked with business and development to decide on them? Or are you hearing about them for the first time when a developer tells you what the options are? And once you have those options, what are your next steps? Have you built anything to test these out? Or is the developer going to be doing that for you?

These can be a painful set of questions to answer if your role is fairly minimal and that’s even more so the case if your role will continue to be minimal as the process moves on.

And that will takes us into the final post in this series, which will have us actually executing those strategies but then determining what we’re going report.