I’ve talked before about the intersection of testing and AI as well as provided a series of posts, using a Pac-Man clone to further introduce testers into algorithmic searching. Here I’ll consider a really simple example of engaging with a machine learning example. I’ll focus on reinforcement learning, which often isn’t talked about as much.
Given how much material there is to cover, I’ll break this post into two parts. This first part will serve as the set up to get us thinking about the context of machine learning. The second post will then break down one specific algorithm as well as some code to show it in action.
In these posts, I’ll specifically focus on one type of algorithm, known as a Q learning algorithm. I’ve chosen this because it’s basically a simplification of the kind of reinforcement learning that takes place in the human brain when it learns. So here we’re combining a little bit of artificial intelligence with machine learning. I should note that while that’s my focus, what I say here is generally translatable to any technique in the context where learning systems are based on trial-and-error exploration.
I should also note that I’m basing some of my descriptive aspects off of a relatively well-known introduction to Q-learning post. There were, however, a few errors in that post and a bit of unclear language. Some of the mistakes were cleared up by Q-Learning by Examples but the unclear language persisted, in opinion, and in fact got slightly worse. So this is my attempt to provide some clarity around this along with some more focused aspects of the examples.
The Basis of Testing Learning Systems
The basic idea of reinforcement learning is that you have a representation of some environment and then states within that environment. You also have a set of possible actions that can be taken in each of those states. Finally, you have an agent that acts in the environment and learns the value of each of those actions in each of those states. When Q learning is applied to such a situation, you are solving for some value Q. This Q value is referred to as a state-action value and the “Q” part means “quality”, as in “What is the quality of this action from this state?”
For our purposes here, our agent will be Pac-Man and his environment will be his game board or maze. However, I’ll drastically simplify this a bit for pedagogical purposes. Here’s what I mean by that:
From this graph, we can put Pac-Man in any location (the start), and from that location, Pac-Man can go to a particular other location. One of those locations is the target. The starting location above is 2 and the target location is 5. This is a state diagram. Each link from one node to a next will have an instant reward value associated with it. This means that going along that link or path will give that reward value to the agent. Here’s that same visual but annotated with those rewards:
The links that lead immediately to the target have reward values of 100. Other links not directly connected to the target have reward values of 0. The target node loops back on itself with a reward of 100. In a general scenario like this, we want the agent to reach the state with the highest reward so that if the agent arrives at the goal (target), it will stay there.
In other words, in this case, once the agent has reached node 5, the cost of going anywhere else is 0 so it just stays at node 5. In a Pac-Man game, of course, this would simply be the state where the game ends, probably corresponding to Pac-Man eating all the dots. Here the goal is just for Pac-Man to get to the right location.
This Seems Really … Simple
It is. It always is at the start. But, as a tester, you’re going to be working in environments where algorithms are applied in contexts like these and you need to be able to start thinking about what that means. A good start at that is being able to translate the algorithms into a visible narrative. That’s what we’ll do here with the Pac-Man example. As we go through this, we’ll start to get into some math. And then we’ll translate this problem, and the math, into some working code.
Experience-Based Learning Systems
Pac-Man has to learn through experience. Pac-Man can pass from one node to another but, in this case, we can assume Pac-Man has little or even no knowledge of the environment. Thus Pac-Man doesn’t know which sequence of links lead to the target, even though we clearly do.
To start putting this into terms you’ll hear more often, each node is a state. Pac-Man’s movement from one state to another is an action. It’s clear from the diagram that some states can only be reached from certain other states. For example, Pac-Man can’t go directly from 2 to 1. Pac-Man also can’t go directly from 3 to 0.
So now let’s put the state diagram and the instant reward values into a reward table, called R.
This forms a matrix of (state, action) pairs. The -1 values in the table represent cases where there isn’t a link between states. Again, Pac-Man can’t go directly from 2 to 1 so the cell for (2, 1) is -1. Likewise, Pac-Man can’t go directly from 3 to 0 so the cell for (3, 0) is -1. But Pac-Man can go from state 2 to 3 and so the cell for (2, 3) is 0 because the reward for going that path is 0.
Now keep in mind that this R matrix is something that we know. It’s a state of the world that objectively exists. The Pac-Man agent does not know this. The only way Pac-Man can learn about those rewards is to explore the world and see what he gets as a result of taking actions.
Ultimately what we want to see, from a machine learning perspective, is if our agent can learn better rather than worse ways of taking actions such that the target or goal is reached. When stripped of all the details, this is pretty much what most machine learning environments involve.
Okay, so R represents the world. But we need a way to represent what our agent, Pac-Man, knows about the world. We’ll add a similar table or matrix, called “Q”, that can be thought of as the “brain” of Pac-Man. More specifically, this matrix represents Pac-Man’s memory of what he has learned through experience of navigating the maze. Let’s consider what all this looks like:
I can’t stress enough how much I believe it’s important to have the ability to break down these concepts into a visual-narrative structure.
Just like matrix R, the rows of matrix Q represent a possible state of Pac-Man. The columns represent the possible actions from a given state. Pac-Man starts out knowing very little, so the matrix Q is initialized to zero. I say Pac-Man knows “very little” as opposed to “nothing at all” because, given the matrix, Pac-Man clearly knows that the number of states is six.
If Pac-Man didn’t even know that much, meaning didn’t know how many states were involved, then the matrix Q would start out with only one element, corresponding to wherever Pac-Man starts. What would happen then is that more columns and rows would be added in matrix Q as and when new states were found. But, again, here we’ll assume that, like a human player, Pac-Man has full knowledge of the possible states of his world.
So our agent, Pac-Man, has a knowledge representation but must learn through experience. In our machine learning context, we want Pac-Man to reason about the representation. As such, Pac-Man will explore from state to state until he reaches the goal. We’ll call each exploration an episode. Each episode consists of Pac-Man moving from the initial state to the goal state. Each time Pac-Man arrives at the goal state, a new episode is started.
There are multiple episodes because, keep in mind, Pac-Man has to figure things out. And he’s probably going to need multiple tries to do that, even with a small maze like this. You can imagine how much more work is involved as the states increase.
Transition Models and Rules
Something has to happen as Pac-Man goes through the states and starts to learn. This is often referred to as building a transition model, which is just a way to encode how to transition from state to state. A transition model is based on a transition rule. The transition rule of Q learning is a relatively simple formula.
According to this formula, you are solving for a value assigned to a specific element of matrix Q. For example, that element might be Q(1, 5), which means “from state 1 take an action to go to state 5.” This value is equal to the corresponding value in matrix R — so R(1, 5), for example — multiplied by the maximum value of Q for all possible actions that can take place in the next state. That last bit can be a little confusing. I’ll revisit it. Also as part of this, the value considers a discount parameter symbolized by the Greek letter gamma, γ.
This algorithm — and I’ll break it down in the second post — is used by Pac-Man to learn from his experiences. Each episode is equivalent to one training session. In a training session, Pac-Man explores the environment (represented by matrix R), receives rewards (if any) when taking actions, until he reaches the goal state. The purpose of the training is to help Pac-Man learn and build up a model of the world represented by matrix R and updated in matrix Q. More training (hopefully) results in a more optimized matrix Q, which means the agent is moving in a more optimal way to achieve the goal.
We also say that matrix Q is converging towards matrix R, meaning what Pac-Man starts to figure out about the world is an accurate representation of how the world really is.
This Still Seems … Simple
And it still, basically, is. The foundations of a discipline tend to be much less difficult than they are given credit for. In this case, for example, there is an R (state of the world) and Pac-Man learns to build up a Q (view of the world), with the idea that Q and R begin to match each other.
Not too bad when you break it down, right? A lot of this context is what you’ll start out with in projects. So you often have to backfill this basic knowledge, if you don’t have it.
At this point we have the basis to get into actually solving things. That is covered in the second post in this series.