This post continues on from the first part where I went over the high-level details of a tester getting involved in a machine learning context. I left off just at the point of introducing the algorithm and letting us get to work. So here, in this post, we’re going to dig right in.
The Algorithm
Our Pac-Man agent is going to use the Q learning algorithm to build up his Q table. As a reminder, here is that Q learning algorithm that we talked about:
Before I get into the algorithm, I want to talk really briefly about the discount rate parameter I mentioned in the previous post, which is the part of the algorithm symbolized by the γ (gamma) value. The gamma parameter has a range of 0 to 1 but there are various ways to interpret what this means.
If gamma is closer to zero, Pac-Man will tend to consider only immediate rewards. Whereas if gamma is closer to one, Pac-Man will consider future rewards as having greater weight. This would mean Pac-Man is willing to delay certain rewards. This interpretation will have less meaning for us in this context.
Speaking more generally and much more conceptually here, I’m using the discount rate and conflating it a bit with something called the learning rate. The learning rate is how quickly a network or an algorithm abandons old beliefs for new ones. A higher learning rate means the network changes its mind more quickly, thus abandoning older beliefs quicker. That’s a bit anthropomorphic. How that works in practice is that a learning rate controls how much of the difference between a previous Q-value and a newly proposed Q-value is taken into account. That interpretation does have some relevance for us here.
However, put yet another way, you can consider a higher learning rate as a focus on more exploration and thus not relying as much on what has been learned already. A lower learning rate tends to focus more on exploitation of what has already been learned.
By goal here was to show that sometimes in these machine learning contexts, there are some elements of interpretation that will come your way, depending on how data scientists or developers or subject matter experts present the information. You’ll also have certain terms that you see treated as distinct entities in some contexts (learning rate, discount rate) treated as the same in others. It’s important to have some facility with understanding how the definitions morph a bit.
Breaking Down the Algorithm
Okay, so the breakdown of the Q learning algorithm is as follows:
- Set the environment rewards in matrix R.
This establishes the world. - Set up an agent and initialize its matrix Q to zero.
This establishes the agent’s knowledge of the world. - Set a gamma (learning) parameter.
Then this bit of pseudo-code applies for how the algorithm works:
FOR EACH episode Select a random initial state. DO WHILE the goal state hasn't been reached Select one among all possible actions for the current state. Using this possible action, consider going to the next state. Get the maximum Q value for this next state based on all possible actions. Perform computation using the formula. Set the next state as the current state. END WHILE END FOR
For Pac-Man to use his matrix Q — essentially to build up his memory based on experiences — he traces the sequence of states, from the initial state to the goal state. The algorithm we just looked at, that Pac-Man is using, finds the actions with the highest reward values recorded in matrix Q for whatever the current state is.
Upon starting out this formula is basically useless because everything is 0 according to Pac-Man. But as he learns, that Q matrix is going to have its 0 values replaced with reward values. The algorithm is going to be always running against that Q matrix, finding out, based on Pac-Man’s experiences, what seems to be the best actions to take in certain states in order to get the best rewards. And if the Q matrix is lacking information, it will be filled in when new information is discovered.
So “Pac-Man using the Q learning algorithm to traverse the Q matrix” breaks down to this:
- Set the current state to the initial state.
- From the current state, find the action with the highest Q value.
- Set the current state to the next state.
- Repeat Steps 2 and 3 until the current state is the goal state.
There may be many paths to the goal but keep in mind that step 2 is trying to find the highest Q value. The algorithm will return the sequence of states from the initial state to the goal state but eventually it converges on the best solution given the rewards that Pac-Man has learned along the way. In this case, the “best solution” is the one that maximizes the rewards given along the way.
And keep in mind that what Pac-Man learns along the way will be indicative of where he started out. What Pac-Man “should” do if starting in state 4 is different than what he “should” do if starting out in state 0.
Try It Out
Let’s start by setting the value of the discount parameter gamma to 0.8. This will mean that 80% of the time, the Pac-Man agent will explore but it also means later rewards have more weight than immediate rewards. We’ll also set the initial state for Pac-Man as state 1. Finally, it’s necessary to initialize matrix Q as a zero matrix.
Episode 1
Let’s go through the algorithm statements, assuming a start at state S1. Select one among all possible actions for the current state. Using this possible action, consider going to the next state.
Well, the actions possible from the starting state 1 are to go to state 3 or state 5. So we have:
S1 --> S3, S5
Here we’ll say Pac-Man chose to go to state 5, just by random pick.
Get the maximum Q value from this NEXT state (S5) based on all possible actions FROM that state (S5).
This is a critical part so make sure you see what’s happening here. The actions possible from state 5 are state 1, state, 4, and state 5. So we have:
S5 --> S1, S4, S5
Perform computation using the formula.
Okay, so we’re solving for Q(1, 5). That’s where Pac-Man started (1) and where he chose to go (5). Let’s plug in for R(state, action). This is the corresponding reward — R(1, 5) — for the state ware solving for. Pac-Man gets an instant reward of 100 for taking that path.
Now we have to consider all the actions that can be taken from this state 5 and the values they might have. That goes into our equation as well:
Q(next state, all actions) Q(5, 1), Q(5, 4), Q(5, 5)
Pac-Man hasn’t traveled these paths yet so he has no rewards. Remember, his Q table marks everything as zero initially. So we end up with this:
100 + 0.8 * Max[0, 0, 0] 100 + 0.8 * 0 100
So let’s put it all together:
Q(state, action) = R(state, action) + γ * MAX[Q(next state, all actions)] Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] Q(1, 5) = 100 + 0.8 * Max[0, 0, 0] Q(1, 5) = 100 + 0.8 * 0 Q(1, 5) = 100
At this point, Pac-Man is in state 5 because that’s what he chose. That happens to be the goal state and thus there’s nothing more to do. So we’ve finished an episode. Pac-Man’s brain now has an updated Q matrix based on what he has learned from his experiences:
So we end up with a computation for Q(1, 5) — i.e., if Pac-Man starts in 1 and goes to 5 — of 100.
That will work well — as long as Pac-Man starts in state 1. But he might not always do so. For machine learning purposes, we want Pac-Man to be able to solve this problem maze no matter where he starts out. Further, remember that Pac-Man randomly chose state 5. He could have chosen state 3 and would have had to do more learning since he would not have been in the goal state.
Episode 2
Now let’s start a new episode and Pac-Man is randomly started in state 3.
Let’s go through the algorithm again.
Select one among all possible actions for the current state.
Using this possible action, consider going to the next state.
From state 3, Pac-Man can go to states 1, 2, and 4. Here we’ll say Pac-Man arbitrarily chose to go to state 1.
Get the maximum Q value from this NEXT state (S1) based on all possible actions FROM that state (S1).
The actions possible from state 1 are state 3 and state 5.
Perform computation using the formula.
So we’re solving for Q(3, 1). Let’s plug in for R(state, action). This is the corresponding reward, R(3, 1), for the state ware solving for. Pac-Man gets the instant reward of 0. The next state Pac-Man chose was 1 and he looked at all actions from it (3, 5).
Q(next state, all actions) Q(1, 3), Q(1, 5)
Pac-Man is using his Q table (memory) from the last episode. He hasn’t traveled to (1, 3) yet so that’s 0. But he has traveled to Q(1, 5). That’s 100. So we end up with this:
0 + 0.8 * Max[0, 100] 0 + 0.8 * 100 80
So let’s put it all together:
Q(state, action) = R(state, action) + γ * MAX[Q(next state, all actions)] Q(3, 1) = R(3, 1) + 0.8 * Max[Q(1, 3), Q(1, 5)] Q(3, 1) = 0 + 0.8 * Max[0, 100] Q(3, 1) = 0 + 0.8 * 100 Q(3, 1) = 80
So we have yet another updated Q matrix.
It’s important to see what happened here. Pac-Man didn’t get 80 as a result of going from 3 to 1 because the reward from 3 to 1 is zero. But what Pac-Man was able to do was know that he gets a pretty good value ultimately and so he’s probably on the right track, given that he knows — from his previous episode — that if he’s in state 1, he can get to state 5.
But, as Pac-Man notes, he’s not in the goal state yet. So he needs to run the algorithm again. We know from state 1 he can go back to 3 or he can go to 5. Let’s say Pac-Man chooses a random action — remember his learning rate is high, so he’s exploring — but it just so happens that he chooses 5.
And state 5 has three possible actions to choose from: S1, S4, or S5. We can again compute the Q value as above using that.
Q(state, action) = R(state, action) + γ * MAX[Q(next state, all actions)] Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] Q(1, 5) = 100 + 0.8 * Max[0, 0, 0] Q(1, 5) = 100 + 0.8 * 0 Q(1, 5) = 100
If you are having déjà vu, that’s okay. We solved this before. This means the Q matrix is not updated in this case and so Pac-Man is left with what you see in the above figure, except that now he is in the goal state and so he stops.
Learning Is Reflected In the Artifacts
Eventually, if you do this enough — or, rather, have Pac-Man do it enough — you might start getting a matrix like this:
From this Pac-Man will, assuming convergence, be able to trace out the best sequences of states via a simple process of following the links with the highest values at each state.
For example, let’s say Pac-Man starts out at state 2 and, from previous episodes, has the matrix above. From state 2 the maximum Q value points to state 3. It’s really the only choice. From state 3, Pac-Man has three choices. But one of them, state 2, is lower than the others so we can rule that out. So the choices are between states 1 and 4. Pac-Man needs some way to break the tie. There are many such ways to do this but let’s stick with random choices and play this out.
- If Pac-Man randomly chooses state 4, he has three choices. The clear winner is state 5.
- If Pac-Man randomly chooses state 1, he has two choices. The clear winner is state 5.
So we have two sequences here:
- 2 -> 3 -> 4 -> 5
- 2 -> 3 -> 1 -> 5
And given our state diagram, that makes sense. When at state 3, going to state 1 or state 4 can both lead to state 5.
Representing as Code
Here’s how we might start to represent all this in Python, using the popular NumPy library. Feel free to start a new Python script and put my code snippets in. This will give you something runnable by the time we’re done.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import random import numpy as np # The R matrix. (The world as it actually is.) R = np.matrix([ [-1, -1, -1, -1, 0, -1], [-1, -1, -1, 0, -1, 100], [-1, -1, -1, 0, -1, -1], [-1, 0, 0, -1, 0, -1], [-1, 0, 0, -1, -1, 100], [-1, 0, -1, -1, 0, 100] ]) # The Q matrix. (The world as the agent knows it to be.) Q = np.matrix(np.zeros([6, 6])) # The learning rate parameter. gamma = 0.8 # The initial state of the agent. initial_state = random.randint(0, 5) |
All of those elements there should make sense to you even if you are not familiar with Python or some of the specifics of creating matrices in NumPy. And that’s an important point. You, as a tester, will be moving between abstractions. We started off with a high-level abstraction in the first post and now we’re moving down to a lower one.
The above is just some setup code. Now we need a function that returns all available actions that are available from a state that is provided to the function. How about this:
1 2 3 4 |
def available_actions(state): current_state_row = R[state,] possible_actions = np.where(current_state_row >= 0)[1] return possible_actions |
With this it’s possible to get the available actions from the initial state, as such:
1 |
valid_actions = available_actions(initial_state) |
Now let’s put in place a function that chooses a random valid action to be performed. Valid means the action chosen must be within the range of all the available actions. Further, it must be an action that actually can occur. Remember not all states in Pac-Man’s maze were reachable from each other. Here’s one way to do that:
1 2 3 |
def sample_next_action(action_range): next_action = int(np.random.choice(action_range, 1)) return next_action |
With this, we can sample the next action to be performed, as such:
1 |
action = sample_next_action(valid_actions) |
Finally, we need a function that updates the Q matrix according to the path selected and as a result of executing the Q learning algorithm. Here’s an example of that:
1 2 3 4 5 6 7 8 9 10 11 |
def update(current_state, action, gamma): max_index = np.where(Q[action,] == np.max(Q[action,]))[1] if max_index.shape[0] > 1: max_index = int(np.random.choice(max_index, size = 1)) else: max_index = int(max_index) max_value = Q[action, max_index] Q[current_state, action] = R[current_state, action] + gamma * max_value |
The last line there is the Q learning formula. With this in place, you can update the Q matrix from the initial state, as such:
1 |
update(initial_state, action, gamma) |
So let’s look at two examples from this, with randomly selected starting points. Let’s say the luck of the draw had us starting on state 4 and we choose to, as a next state, to go to state 5. What the above logic would be doing is producing something like this:
Initial state: 4 Current row state: [[ -1 0 0 -1 -1 100]] Available actions: [1 2 5] R[current_state, action]: 100 Q[current_state, action]: 100.0
On the other hand, let’s say we randomly started off in state 3 and we choose to, as a next state, go to state 1. In that case, we get something like this:
Initial state: 3 Current row state: [[-1 0 0 -1 0 -1]] Available actions: [1 2 4] R[current_state, action]: 0 Q[current_state, action]: 0.0
What we have here is essentially our data. That’s a large part of data science: figuring out the data you want to use and making sure it’s valid. But another core part of that is then choosing the algorithms that you want to apply to that data.
Training the Agent
Now let’s set up the training. The agent should be allowed to train over a certain number of iterations. This is essentially calling the above functions I just showed you repeatedly. The goal being to generate that Q matrix. Here’s an example of what that might look like, with a key part missing:
1 2 3 4 5 |
for i in range(???): current_state = np.random.randint(0, int(Q.shape[0])) available_act = available_actions(current_state) action = sample_next_action(available_act) update(current_state, action, gamma) |
If you put this script together, what would be the range you would put in there in place of the question marks? Take a guess. Trying to build up intuitions for values in these contexts can be good practice. In this case, you can clearly do too few iterations but you can also clearly do way too many.
Normalizing Data
One thing I didn’t cover in the first post is that you have to normalize the Q tables as a result of the values that keep adding up as iterations are tried. For example, with the current setup, you might get a Q matrix like this:
[[ 0. 0. 0. 0. 399.97 0. ] [ 0. 0. 0. 319.97 0. 499.96 ] [ 0. 0. 0. 319.98 0. 0. ] [ 0. 399.97 255.98 0. 399.96 0. ] [ 0. 399.95 255.98 0. 0. 499.96 ] [ 0. 399.97 0. 0. 399.96 499.95 ]]
So you’ll normalize the data by taking the max value from the matrix and, in this case, multiplying by 100. So, for example:
1 |
Q / np.max(Q) * 100 |
That would get you this matrix:
[[ 0. 0. 0. 0. 80. 0. ] [ 0. 0. 0. 64. 0. 100. ] [ 0. 0. 0. 64. 0. 0. ] [ 0. 80. 51.2 0. 80. 0. ] [ 0. 80. 51.2 0. 0. 100. ] [ 0. 80. 0. 0. 80. 100. ]]
But that’s just the start. Our agent — Pac-Man, in our case — has now explored a lot. But has he learned? Will he be able to choose the optimal path through the maze given some initial starting point? Pac-Man is out of training mode and now has to go into testing mode.
Testing the Agent
Here is where we figure out if the agent, due to our data and algorithm choice, can find the best path given the fact that it has trained and updated its memory (the Q matrix) with the rewards gained by traveling via certain states.
If you have a coding background feel free to try this out on your own. Keep in mind, this would not be developer code. This would be test code to see if the developer code works. But even if you don’t want to code it, you should be able to write down the test data and test cases that you expect to see if the code is working as planned.
Take some time to think this through before reading on. I’m going to cover the code implementation first but, for you testers, the important part is to see if you through the test cases. I’ll cover those after the code.
Coding the Test
Keep in mind the algorithm I provided earlier for checking the Q matrix:
- Set the current state to the initial state.
- From the current state, find the action with the highest Q value.
- Set the current state to the next state.
- Repeat Steps 2 and 3 until the current state is the goal state.
One possible implementation would be this:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
current_state = random.randint(0, 5) steps = [current_state] while current_state != 5: next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[1] if next_step_index.shape[0] > 1: next_step_index = int(np.random.choice(next_step_index, size = 1)) else: next_step_index = int(next_step_index) steps.append(next_step_index) current_state = next_step_index |
Feel free to put that in place in the script you’ve been building up and see what you get. What you should get is output that matches your tests. Let’s finish off this post by talking about those.
Writing the Tests
These tests will an initial data condition, which is the start state. The output of the test is then the path that should be expected from that start state. Here’s what I would have come up with: Test
- Start State: 0
- Expected Path: [0, 4, 5]
- Start State: 1
- Expected Path: [1, 5]
- Start State: 2
- Expected Path: [2, 3, 1, 5]
- Start State: 3
- Expected Path: [3, 1, 5]
- Start State: 4
- Expected Path: [4, 5]
- Start State: 5
- Expected Path: [5]
Is that accurate? Didn’t we miss something? Think about it from a coverage standpoint. This has nothing to do with the code, but rather making sure you understand what has been presented such that you can craft effective tests.
The fact is we did miss something. The test for “Start State 2” actually should have two conditions applied to it. It’s also possible to use the path [2, 3, 4, 5]. Likewise, for “Start State 3” where you can also use the path [3, 4, 5].
In The End, We Are Left with What?
These two posts took you through a machine learning example, one that utilized a bit of data analysis (reward matrices) and that had a particular algorithm (Q learning) chosen for use against that data. Thus I showed you a working example of what is talked about in an article about machine learning paired with data science.
But my focus here was on the testers who have to operate in this context, often shifting between abstraction levels. Here we went from business descriptions to visuals to some mathematics to code to data and test conditions. There’s an exciting future for testing in this arena. But it is incumbent upon testers to be ready for these opportunities.
Loving these series.
Just small note to this text:
Get the maximum Q value from this NEXT state (S1) based on all possible actions FROM that state (S1).
The actions possible from state 5 are state 3 and state 5.
— it seems that it should be “…from state 1 are state 3 and state 5”, since it started from state 1.
Yikes! You are correct indeed. Much thanks for the catch. I have updated accordingly.
I’m glad you like the series. I have another one I’m going to be starting up here soon, using OpenAI Gym, which helps explore some of these ideas. The intersection of testing and development is proving to be a little more tricky in that context.