This is the last of a four part series (see parts 1, 2 and 3). The goal has been investigating whether specialist testers have a role in machine learning environments uniquely distinct from development roles in those same environments. These posts have been getting you up to speed on what that might look like. Here we finish off that journey.
In most of these posts, I’ve spent a lot of time going over details and taking you somewhat step-by-step through script creation because those scripts were acting as test harnesses. Along the way, I asked you to consider how much of that work you were actually doing, as a tester, as opposed to just having it handed to you.
For this final part, I’m actually going to give you the full scripts. How you got them is less important than what you do with them. Or, rather, how you consume the results from them. We already did testing as a design activity in a few of these posts. We did some testing as an execution activity as well, which we’ll continue a bit here. We also did a little — very little — of testing as a framing activity.
So you can run the provided scripts. You can see the results. And then we’ll regroup and think about what that part of the testing process means for you, as a specialist tester, and why this is testing as a framing activity.
Some Dependencies
For these scripts you will need another dependency beyond those we used already, which is a matplotlib. You can get this library by the following:
pip install matplotlib
This will be used so that we can draw some graphs. Currently we’ve been relying on text output but as we’ve seen, getting a whole bunch of data thrown back at us doesn’t really help us reason about the system all that much. In fact, as we saw, we couldn’t even necessarily tell if a bit of data indicated success or failure.
Finally, for the last of these examples, you will need TensorFlow installed. TensorFlow is library that allows for symbolic math. It’s name comes from the fact that it can perform computations with multi-dimensional data arrays which are commonly called tensors. Tensors take you into linear algebra and vector calculus, subjects I will tackle not at all here. To install it:
pip install tensorflow
Note that this will work just fine on a Mac or Linux. It should work okay on Windows — if you are using a 64-bit version of Python, as recommended in the first post. On a 32-bit version of Python, that might not work. If on Windows, your best bet is usually to use Python as part of a distribution like Anaconda or the Enthought distribution. One thing you can do is try to download one of the TensorFlow wheels. For example, download “tensorflow-1.4.0-cp36-cp36m-win_amd64.whl”, although the version number of TensorFlow might differ. Once you have that wheel downloaded, try this:
pip install tensorflow-1.4.0-cp36-cp36m-win_amd64.whl
If you are told that it’s “not a supported wheel on this platform”, you can try this next bit. Rename the wheel as follows:
tensorflow-1.4.0-py3-none-any.whl
Note the “-py3-none-any” which essentially replaces all other bits on that filename. Then try the pip installation again against this newly named file. That worked for me on Windows 8 and Windows 10. Keep in mind, even if all this works, this relies on certain DLL files in Windows and during execution you might find something reported like this: “DLL load failed: %1 is not a valid Win32 application.” If that happens, it probably means you are using a 32-bit Python while the binaries being called upon are 64-bit.
If all that seems a bit much, just try one of those distributions I mentioned or make sure you have a 64-bit Python installed.
Test Scenarios
We talked a few times about our test scenarios and what a “scenario” means in the context of this example. The notion of a scenario can take on varying levels of granularity in these kinds of environments. Here we’ll treat each algorithm — the strategy the agent uses to find a policy — as a test scenario. It is the outputs of those scenarios that we will be reasoning about.
The scenarios below are what I briefly covered in the last post from a conceptual point of view. Here we put them into action. In all cases, I’ll present the scripts with a few explanatory notes. Technically you don’t have to run them if you don’t want to. I say that because I am going to come back and present their output to you. That said, I would certainly recommend running them and perhaps even annotating them with print statements if you want to see how things are operating.
Test Scenario: Random Policy
Here is cartpole-policy-random.py.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
import gym import numpy as np import matplotlib.pyplot as plt env = gym.make('CartPole-v0') def create_policy(): policy_parameters = np.random.rand(4) return (policy_parameters) * 2 - 1 def execute_policy(policy_parameters, observation): return 0 if np.matmul(policy_parameters, observation) < 0 else 1 def run_episode(env, policy_parameters): observation = env.reset() total_reward = 0 for _ in range(200): action = execute_policy(policy_parameters, observation) observation, reward, done, info = env.step(action) total_reward += reward if done: break return total_reward def train_agent(env): all_rewards = [] num_episodes = 0 optional_params = None optimal_reward = 0 for _ in range(10000): num_episodes += 1 policy_params = create_policy() reward = run_episode(env, policy_params) all_rewards.append(reward) if reward > optimal_reward: optimal_reward = reward optimal_params = policy_params if reward == 200: break return num_episodes episodes_per_search = [] for _ in range(100): episodes_per_search.append(train_agent(env)) avg = np.mean(episodes_per_search) print("Average number of episodes: {}".format(avg)) color = plt.get_cmap('bone') plt.hist(episodes_per_search, facecolor='g', color=color(0.3), edgecolor='black', linewidth=1.1) plt.plot([avg for _ in range(40)], np.linspace(0,40,40), color=color(0.7), linewidth=5) plt.title('Histogram of Random Search Policy', fontsize=20) plt.xlabel('Number of Episodes to Reach 200', fontsize=15) plt.ylabel('Frequency of Episodes Reaching 200', fontsize=15) plt.show() |
Here our agent is being trained and the train_agent
method calls out to a create_policy
method to get the specific algorithm that will handle this. Here it’s the random algorithm we were talking about in the last post. Each episode is a single execution of our learning algorithm. That’s what the run_episode
method is doing. As you can see, 200 actions will be taken in the environment for each episode. The agent will train like this for 10,000 times.
Getting results is handled at the bottom. We’re training the agent 100 times and we’re doing this just to get some data for the graphs. But notice what we have here. We have the agent training being doing 100 times. For each training, the agent is running an episode 10,000 times. And each episode will consist of 200 steps.
That run_episode
method can run multiple episodes with multiple random policies. That’s really important to understand. I say that because while this “random policy” approach may be one test scenario, there are actually thousands of test cases being executed! Remember that discussion we had previously about the domain-to-range ratio? Now you’re seeing a bit more of how that comes into play.
An episodes per search value is found and this is the number of episodes that it took to find each good policy. So what we’re asking here, as part of our results, is: “On average, it takes how many episodes to find a good policy that can get, in this case, a total reward of 200?”
Try it out!
Test Scenario: Noisy Policy
Here’s the next script, based on a hill-climbing style algorithm. This is cartpole-policy-noisy.py. This might take a little longer to run; perhaps about two to three minutes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
import gym import numpy as np import matplotlib.pyplot as plt env = gym.make('CartPole-v0') def create_policy(): policy_parameters = np.random.rand(4) return (policy_parameters) * 2 - 1 def execute_policy(policy_parameters, observation): return 0 if np.matmul(policy_parameters, observation) < 0 else 1 noise_scaler = 0.1 def noise_params(policy_params, noise_scaler): return policy_params + (np.random.rand(4) * 2 - 1) * noise_scaler def run_episode(env, policy_parameters): observation = env.reset() total_reward = 0 for _ in range(200): action = execute_policy(policy_parameters, observation) observation, reward, done, info = env.step(action) total_reward += reward if done: break return total_reward def train_agent(env): num_episodes = 0 optimal_reward = 0 all_rewards = [] policy_params = create_policy() for _ in range(10000): num_episodes += 1 new_policy_params = noise_params(policy_params, noise_scaler) reward = run_episode(env, new_policy_params) all_rewards.append(reward) if reward > optimal_reward: optimal_reward = reward policy_params = new_policy_params if reward == 200: break return num_episodes episodes_per_search = [] for _ in range(100): episodes_per_search.append(train_agent(env)) episodes_per_search = np.array(episodes_per_search) avg = np.mean(episodes_per_search) print("Average number of episodes: {}".format(avg)) color = plt.get_cmap('bone') plt.hist(episodes_per_search, facecolor='g', color=color(0.3), edgecolor='black', linewidth=1.1) plt.plot([avg for _ in range(40)], np.linspace(0,40,40), color=color(0.7), linewidth=5) plt.title('Histogram of Noisy Search Policy', fontsize=20) plt.xlabel('Number of Episodes to Reach 200', fontsize=15) plt.ylabel('Frequency of Episodes Reaching 200', fontsize=15) plt.show() |
The basic idea here is that the reward derived is a function of those observation parameters. When we start with a policy that’s random, you are basically at what’s called a “local minimum” for the function.
Basically, functions can have “hills” and “valleys”, meaning places where the function reaches a minimum or maximum value. There is also the concept of an interval, which can be some segment of the overall values of a function. Each such interval can be considered a local maximum or minimum, where the local minimum means the “height” of the function at some point is greater than (or equal to) the height anywhere else in that interval.
So the local minimum for a random policy means that since you’ve started off with a random value it is, by definition, the minimum. So you are in the “valley” of the function and you want to go up to the “hills” — meaning go “up” the function, to higher values. Higher values, in this context, means a higher reward. That’s the “hill climbing” part. But if it’s all still random how does this work?
What the above does is add small changes to the parameters each time. These small changes are referred to as “noise.” If that small change leads to a better reward (moving up the function), the parameters are updated to incorporate those small changes. If, however, the change leads to less reward or even the same reward, then you try with different small changes.
Test Scenario: Gradient Policy
Finally, here is the script for cartpole-policy-gradient.py. The complexity in this approach compared to the others should be fairly obvious just by the nature of the script. I should note that this script will take longer to execute than the others. Depending on your computer, whether you are using CPU or GPU, and the version of TensorFlow, this can easily take anywhere from 10 to 25 minutes to run.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
import gym import random import numpy as np import tensorflow as tf import matplotlib.pyplot as plt env = gym.make('CartPole-v0') def produce_observations(world_state, p_probs, p_state, session): total_reward = 0 states = [] actions = [] transitions = [] for _ in range(200): state_vector = world_state.reshape(1, -1) result = session.run(p_probs, feed_dict={p_state: state_vector}) probabilities = result[0][0] action = create_policy(probabilities) states.append(world_state) action_taken = np.zeros(2) action_taken[action] = 1 actions.append(action_taken) old_state = world_state world_state, reward, done, info = env.step(action) transitions.append((old_state, action, reward)) total_reward += reward if done: break return transitions, states, actions, total_reward, world_state def produce_action_values(v_vals, v_state, world_state, transitions, session): advantages = [] update_vals = [] for i, transition in enumerate(transitions): obs, act, rew = transition discount = 0.03 decrease = 1 future_reward = 0 for future_transition in transitions[i:]: future_reward += future_transition[2] * decrease decrease = decrease * (1 - discount) state_vector = world_state.reshape(1, -1) values_current = session.run(v_vals, feed_dict={v_state: state_vector}) update_vals.append(future_reward) advantages.append(future_reward - values_current[0][0]) return update_vals, advantages def create_policy(probabilities): return 0 if random.uniform(0, 1) < probabilities else 1 def update_value_function(update_vals, v_optimizer, v_state, v_new_vals, states, sess): update_vals_vector = np.array(update_vals).reshape(-1, 1) sess.run(v_optimizer, feed_dict={v_state: states, v_new_vals: update_vals_vector}) def update_policy_function(advantages, p_optimizer, p_state, p_advantages, p_actions, states, actions, sess): advantages_vector = np.array(advantages).reshape(-1, 1) sess.run(p_optimizer, feed_dict={p_state: states, p_advantages: advantages_vector, p_actions: actions}) def policy_function(): with tf.variable_scope("policy"): params = tf.get_variable("policy_parameters", [4, 2]) state = tf.placeholder(tf.float32, [None, 4]) actions = tf.placeholder(tf.float32, [None, 2]) advantages = tf.placeholder(tf.float32, [None, 1]) policy_function = tf.matmul(state, params) probs = tf.nn.softmax(policy_function) good_probs = tf.reduce_sum(tf.multiply(probs, actions), reduction_indices=[1]) log_probs = tf.log(good_probs) * advantages loss = -tf.reduce_sum(log_probs) optimizer = tf.train.AdamOptimizer(0.01).minimize(loss) return probs, state, actions, advantages, optimizer def value_function(): with tf.variable_scope("value"): state = tf.placeholder(tf.float32, [None,4]) update_vals = tf.placeholder(tf.float32, [None, 1]) w = tf.get_variable("w1", [4, 10]) b = tf.get_variable("b1", [10]) logits = tf.matmul(state, w) + b h = tf.nn.relu(logits) w = tf.get_variable("w2", [10, 1]) b = tf.get_variable("b2", [1]) calc_vals = tf.matmul(h, w) + b diffs = calc_vals - update_vals loss = tf.nn.l2_loss(diffs) optimizer = tf.train.AdamOptimizer(0.1).minimize(loss) return calc_vals, state, update_vals, optimizer, loss def run_episode(env, session, policy_params, value_params): p_probs, p_state, p_actions, p_advantages, p_optimizer = policy_params v_vals, v_state, v_new_vals, v_optimizer, v_loss = value_params world_state = env.reset() transitions, states, actions, total_reward, world_state = produce_observations(world_state, p_probs, p_state, session) update_vals, advantages = produce_action_values(v_vals, v_state, world_state, transitions, session) update_value_function(update_vals, v_optimizer, v_state, v_new_vals, states, session) update_policy_function(advantages, p_optimizer, p_state, p_advantages, p_actions, states, actions, session) return total_reward def train_agent(): tf.reset_default_graph() policy_params = policy_function() value_params = value_function() session = tf.InteractiveSession() session.run(tf.global_variables_initializer()) num_episodes = 0 all_rewards = [] for _ in range(2000): num_episodes += 1 reward = run_episode(env, session, policy_params, value_params) all_rewards.append(reward) if reward == 200: break return num_episodes episodes_per_search = [] for _ in range(100): episodes_per_search.append(train_agent()) avg = np.mean(episodes_per_search) print("Average number of episodes: {}".format(avg)) color = plt.get_cmap('bone') plt.hist(episodes_per_search, facecolor='g', color=color(0.3), edgecolor='black', linewidth=1.1) plt.plot([avg for _ in range(40)], np.linspace(0,40,40), color=color(0.7), linewidth=5) plt.title('Histogram of Gradient Search Policy', fontsize=20) plt.xlabel('Number of Episodes to Reach 200', fontsize=15) plt.ylabel('Frequency of Episodes Reaching 200', fontsize=15) plt.show() |
Whoa, huh? That’s a lot of stuff. In order to implement a policy gradient, we need a policy that can change in certain increments. In practice, this means switching from an absolute limit (move left if the total is < 0, otherwise move right) to probabilities.
So before we had a situation where a policy would provide a number and if that number was below 0, the agent moves one direction. If it was 0 or more, the agent would move right. The difference with a gradient approach is that the policy will provide not a number, but a probability. And then if a randomly drawn number is below that probability, the agent will move one way and if above that probability, the agent will move the other way.
This is a vast oversimplification of what's going on here. I don't plan to explain all this in full but just to give you some idea of what's going on.
We have a policy and we take a series of actions. Specifically, we generate a set of actions to play out the CartPole problem and record any information from those actions (produce_observations
) and then find a way to get the values of particular states that the CartPole world is in (produce_action_values
). This provides what’s called approximate action-value or what’s sometimes known as an advantage function.
The “advantage” in this context refers to a value that we rely on to increase the probability of actions that are returning higher rewards than our value function says they probably should. In other words, those actions gave us an advantage.
There are two specific components: a value function (value_function
) and a policy function (policy_function
). We also have means to update both of those (update_value_function
, update_policy_function
). The basic idea is that we want our agent to learn a value function that gets to the point where it is able to determine which states of the CartPole world are “better” and that, in turn, leads to a policy that is better at maximizing rewards.
Key to all of this is that the produce_action_values
method calculates the difference between rewards the agent experienced when producing the observations (via the produce_observations
method) and the rewards that the value function said the agent should have gathered. If the difference is a positive number, the agent updates its policy to perform actions that lead to those states more. If the difference is negative, the agent updates its policy to perform actions that lead to those states less.
This may seem a bit involved as far as explanations go, but I assure you: I’m giving you the very abbreviated version.
Testers Still With Me?
A question I’ve asked repeatedly in these posts is one I’ll ask now: how much of any of this do you find relevant? Keep in mind where we started. We had some data science to provide the basis of our observables. We had the business requirements for the basis of the problem, which was having an agent solve a problem where those observables were the basis for learning how to control a system. We then had the idea of some algorithms that would allow the agent in question to apply strategies that would lead to better or worse policies.
We also created some test harnesses to get a feel for the mathematics involved as well as determine how to run against the environment such that we can determine what success or failure looks like.
But how much of this was tester activity? We are in an industry that already conflates developers with testers to a disturbing degree. We are already in an industry that feels much, if not all, of what testers do can be automated. So it’s necessary to have some thoughts about how test techniques, test thinking, and the specialty of testing itself provides value in this context.
And speaking of that context, while I’ve provided all of this in the context of the relatively simple CartPole problem, the basis of everything discussed here — literally everything — is effectively the basis of data science, machine learning and artificial intelligence environments.
Results and Analysis
Each of the above scripts should have generated some graphs for you. Even if you think very little of the previous material had much to do with testers, surely this is where you will shine. After all, testing is partly about gathering information, analyzing the information, and providing a narrative — a framing context — for what was observed.
So let’s check out the graphs you should have gotten for each script. Here’s the graph for the random policy:
And here’s the graph for the random-noisy policy:
And finally here’s the graph for the gradient policy:
You might have noticed a particular print statement near the end of each script. As a tester, looking at what is going to run and understanding the output is critical. For each execution you probably got something along these lines:
- Random: Average number of episodes: 14.75
- Noisy: Average number of episodes: 5080.63
- Gradient: Average number of episodes: 461.13
The numbers will almost certainly have differed a bit but they should be in the ballpark of the above.
Frame the Testing
Okay so — tester! — it’s your time to shine. You’ve had four posts explaining the basis of this material to you. You have scripts that have executed some strategies (algorithms). And those executions have resulted in some very specific graphs.
What’s your analysis? What do you tell the business team and the developers?
Keep in mind that each script is running 10,000 test cases of its respective policies. Each policy variation is, in fact, a test case. Given that, it should probably stand out to you that only one of those graphs actually gets to 10,000 on the x-axis.
At a glance it might seem that noisy does the absolute worst. Keep in mind that it’s noise based on randomness. And there seems to be a noise_scaler
variable set at 0.1 in the script. As a tester, hopefully you were wondering what would happen if that scaling value was changed. Since the random script is no noise and the noisy script is random-with-some-noise, it should have struck you that there can be an equivalence class here if the noise scaling is set to such a value that it basically becomes purely random.
The graph of the policy gradient seems better than noisy, certainly, but it also took a much longer time to execute. Does that matter? Is time to execute part of the qualities that we are looking for? Oh, wait, we didn’t ask that at all yet, did we? But then again perhaps we didn’t even know to ask until we saw the complexity of the scripts and/or saw their execution. So test execution might inform some of our test design, which in turn may suggest how we want to frame activities and results.
At a first glance, and correct me if you think this is wrong, it would almost seem that the random approach worked the best. This is going by not just time to execute, but by the average number of episodes (14.75) compared to the other scenarios and the look of the graph distribution. But … can that be right? How could just taking a series of random actions have been better than approaches that weighted those actions based on the rewards achieved?
Or is this to be expected given the nature of the problem? But if this CartPole problem is translatable to other problems that have continuous observables, that would imply that randomness is a sound strategy for those as well.
And So It Goes!
And thus do we have the basis for more exploration and more experimentation, working with our teams to figure out different models, different ways to craft policies, different strategy implementations, and so on.
I realize this series of posts can appear to have been a series of unfair statements that are somehow suggesting that testers aren’t any good in these environments. In fact, quite the opposite. I believe testers are critical in these environments. But I am suggesting that many testers may not be ready to translate their existing techniques in such a way that they can seem value-adding in such environments.
My goal with this series was to get testers thinking about their value, what they have to do to provide that value, and what skills they may need to consider leveling up a bit. In particular I tried to show many of the techniques that we testers currently use, including social relationships, are no different at a fundamental level than we deal with in any environment. But certainly there is a need for refining those techniques and those relationships to meet specific contexts.
And so that’s it folks! It’s probably too much to hope that this was a fun ride; but I do hope it was interesting and, ideally, a little bit enlightening.
Thanks for a great blog post series! It really got me thinking about the role of the tester in an AI context. Are we testing the algorithm, the learning data sets and/or the finished product once it has learned? I guess the answer is: a little bit of each.
Overall, I think the level of the posts was just right for me (computer science background, but more than 15 years since I studied maths). There were some holes I had to fill in for myself, but I felt like that was part of the challenge. E.g. how the returned theta value was represented.
And, since I am a tester, I have some potential nitpicks to bring up 🙂 From what I learned from the code, the termination condition for the theta angle is 12, not 15 degrees. And also, is stated in the first post that an unsuccessful termination would be when “the episode length is greater than 200”. Is that really unsuccessful? In that case we have managed to balance the pole for more than 195 time steps. The code seems to return 0 in reward once you go over 200, but my guess is that is just for not skewing the average value for the reward. Feel free to correct me if I missunderstood anything.
Last thing, do you have any reading recommendations that introduces the concepts of machine learning, deep learning, etc in a good way?
Thanks again!
You are absolutely right on 12 versus 15. The “15” actually came from the original implementation (still documented here), but I used the updated implementation. I thank you for pointing this out; it is now corrected and was a silly mistake on my part.
Yes, the going over 200 comes from this bit of code in the CartPole logic (
envs/__init__.py
):So the threshold is 195 for the rewards, keeping the pole balanced, while going over 200 is a termination of the episode itself. You need a termination condition for the episode no matter what because you could just stay there being successful forever. And, yes, as you note rewards have to stop after that point. But I do see your point in that this actually could be made a bit clearer in the post.