In the previous post in this series we implemented a neural network from top to bottom, essentially allowing you to see how everything works from the start of inputting data to getting results. In this post, we’ll start to create a very similar neural network but I’ll take a bit more time to explain some of the specifics. Fair warning: this is probably going to be the longest post in this series so far because there’s a lot to dig into.
As a reminder, we did a few things at the end of the second post that got us a new script to play around with. So just to level set here, make sure you have a script called mnist.py and have only the following in it:
1 2 3 4 5 6 |
import numpy as np import matplotlib.pyplot as plt from keras.datasets import mnist (train_images, train_labels), (test_images, test_labels) = mnist.load_data() |
Also, let’s keep in mind what we essentially want at the end of this journey. We want a situation like this:
Here a decision has been made that the number being seen is a 0, although there was some possibility of that number being a 5 or a 6. Even better, how about this:
There was clearly no doubt there that what was being seen was a 5 and so that was the decision that was being made.
These decisions will be made on the basis of the evidence that was provided to the model. That evidence consists of the inputs. And those inputs are going to be operated upon (transformed) such that the model can consider the evidence in different ways — as different representations — and make decisions about it.
Details Of Our Data
The main thing we’ve done in our above script is load up the data and store the data in some variables. All of these data sets are encoded in Numpy arrays, called ndarrays, which just means an n-dimensional array. And as you known from the third post, an n-dimensional array just means a tensor. Let’s get some basic information about our data. Here I’ll just use the data stored in train_images
for analysis purposes. The test_images
would provide similar analysis.
1 2 3 |
print(train_images.ndim) # 3 print(train_images.shape) # (60000, 28, 28) print(train_images.dtype) # unit8 |
Incidentally, along the way I’ll be adding some print statements like the above to show the outputs. You can delete those statements after you run them if you don’t want them cluttering your script.
The above three attributes describe a cube of numbers or, in other words, a tensor. As you now know from the previous posts, pretty much all machine learning models use tensors as their basic data structure. Let’s break down what we’re dealing with here.
- The
ndim
is the number of dimensions. You can also think of those dimensions as the number of axes. For instance, a three-dimensional tensor has three axes, a two-dimensional tensor (matrix) has two axes, and a one-dimensional tensor (vector) has one axis.
- While the ndim tells you the total number of dimensions of the tensor, the
shape
tells you how many dimensions the tensor has along each axis. A three-dimensional tensor would look like this: (60000, 28, 28). A two-dimensional tensor (matrix) would look like this: (28, 28). A one-dimensional tensor (vector) would look like this: (1,).
- The
dtype
or data type is just what it sounds like: it indicates the type of data that is contained in the tensor. A tensor’s type could be something like float32, uint8, float64, and so on.
So let’s consider what we have for train_images: an ndim of 3, a dtype of uint8, and a shape of (60000, 28, 28). What this means is that we have a three-dimensional tensor of 8-bit integers. That tensor is an array of 60,000 matrices, each of which is made up of 28 × 28 integers. Those 28 × 28 matrices hold values for the pixel intensity of the image.
Using the shape data, we can get the total number of pixels, and thus the total dimensional space, as such:
1 |
total_pixels = train_images.shape[1] * train_images.shape[2] |
Thus the value of total_pixels
will be 784. This is an important variable for us to be holding onto and you’ll see how it’s used later. But just understand that a 28 × 28 matrix will have 784 elements. Obviously I could have just hard-coded the 784 in there but it’s good to use the representations of your data to get the values you want.
Visualizing Our Data
So we have 60,000 matrices. Each matrix is a representation of a grayscale image, with coefficients between 0 and 255. That just means that each element of the array is a number between 0 and 255. Keep in mind that in a grayscale image each pixel is black, white or some shade of gray. So each element is a value describing the intensity of the pixel. In this context, values closer to 0 mean background while those closer to 255 mean foreground.
First, let’s get a look at some representative images:
1 2 3 4 5 6 7 8 9 |
for i in range(9): plt.subplot(3, 3, i+1) plt.tight_layout() plt.imshow(train_images[i], cmap='gray', interpolation='none') plt.title("Digit: {}".format(train_labels[i])) plt.xticks([]) plt.yticks([]) plt.show() |
You’ll get output like this:
This visualization isn’t really necessary but it does help us confirm that what we think is in the data set is actually in there. And notice this is also confirming that the data set we’re using from Keras matches up nicely with the data set we had originally downloaded in the second post.
Now let’s also get an understanding of that pixel distribution I talked about above:
1 2 3 4 5 6 7 8 9 10 |
plt.subplot(2, 1, 1) plt.imshow(train_images[0], cmap='gray', interpolation='none') plt.title("Digit: {}".format(train_labels[0])) plt.xticks([]) plt.yticks([]) plt.subplot(2, 1, 2) plt.hist(train_images[0].reshape(784)) plt.title("Pixel Value Distribution") plt.show() |
You should get something like this:
What you can see there is that the pixel values of the background are the majority. The parts of the image that make up the digit — i.e., closer to values around 255 — are a much smaller representation.
Most likely you’ll want to remove the above plotting logic. Like the print statements, these are used solely to get us information. Again, just to level set, here’s the minimum that your script needs to move forward:
1 2 3 4 5 6 7 8 |
import numpy as np import matplotlib.pyplot as plt from keras.datasets import mnist (train_images, train_labels), (test_images, test_labels) = mnist.load_data() total_pixels = train_images.shape[1] * train_images.shape[2] |
Optimizing Our Input Data
The training data set is structured as a three-dimensional array of the following sort:
(number of samples × image width × image height)
That is what is represented by the (60000, 28, 28)
that we got earlier. This is the kind of data that we will be passing into our neural network.
We can optimize this data a bit, however. Specifically, we can preprocess our data by reshaping it and scaling it. It generally makes sense to perform at least some scaling of input values, particularly when you’re going to feed those inputs to a neural network model.
The reason this is a good idea is because it reduces the level of computation needed. Think of this as streamlining your evidence so that better decisions can be made from it. Beyond even that, if you played around with me in the last post, you saw how a lack of shaping and scaling the data caused some errors when you tried to use certain data with the neural network we wrote.
To get started on our data optimization, we can reduce the images down into a vector of pixels. In this case the 28 × 28 sized images — a matrix — will be flattened into a 784 element vector holding pixel input values.
Add these statements to your script:
1 2 |
train_images = train_images.reshape((60000, total_pixels)) test_images = test_images.reshape((10000, total_pixels)) |
The data type hasn’t changed but consider what has changed for the training images:
- train_images.ndim = 2
- train_images.shape = (60000, 784)
- test_images.shape = (10000, 784)
Okay, that’s our reshaping. Let’s also change the data type:
1 2 |
train_images = train_images.astype('float32') test_images = test_images.astype('float32') |
Why are we doing this? I’ll come back to that in a second. Finally, let’s do some scaling:
1 2 |
train_images /= 255 test_images /= 255 |
These statements are why I changed the data type as well: because we have changed the range. A uint8 data type has a range of 0 to 255. But our interval is now much smaller — a range of between 0 and 1 — thus the rescaling to float32. Floats are very easy to vectorize. Vectorization is used in a lot of scientific computing where often large chunks of data need to be processed as efficiently as possible.
Even if the details of this aren’t entirely clear to you, just be sure you understand that previously, our training images were stored in an array of shape (60000, 28, 28) of type uint8 with values in the [0, 255] interval. What we did here is transform that into a float32 array of shape (60000, 28 × 28) or (60000, 784) with values in the [0, 1] interval. That means all of our values are between 0 and 1 because we normalized the pixel values by dividing each value by the maximum of 255.
Optimizing Our Output Data
That takes care of our input. What about our outputs? There is a bit of work we can do there as well. Consider that we have a more constrained set of values for our outputs, which you can see by executing this statement:
1 |
print(np.unique(train_labels, return_counts=True)) |
You don’t need to keep that line in your script but if you do execute it, you’ll see something like this:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=unit8)
That’s a nice set of easy outputs to deal with: essentially just the numbers 0 through 9. So what can we do to make that even easier on us? We can perform an action that is described as “categorically encoding the labels.” Go ahead and add these lines:
1 2 |
train_labels = np_utils.to_categorical(train_labels) test_labels = np_utils.to_categorical(test_labels) |
The above will require you adding the following import at the top of your script:
1 |
from keras.utils import np_utils |
You’ll also hear this referred to as a “one hot encoding of the class values,” which isn’t terribly informative. You might even hear it described as “converting class vectors to sparse binary class matrices.” That last one is probably the easiest to understand if you realize that a sparse matrix is one where most of the elements are zero. It’s easier to show rather than tell, so let’s consider how the test labels change. Here’s what you end up with before this encoding:
1 2 3 |
print(test_labels[1]) # 2 print(test_labels[2]) # 1 print(test_labels[3]) # 0 |
And here’s what you get after the encoding:
1 2 3 |
print(test_labels[1]) # [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] print(test_labels[2]) # [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] print(test_labels[3]) # [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] |
What that’s telling you is that the result is a vector with a length equal to the number of categories. The vector is all zeros except in the position for the respective category. Thus a ‘5’ will be represented by this vector: [0. 0. 0. 0. 1. 0. 0. 0. 0].
Incidentally, the shape of train_labels would now be (60000, 10) and the shape of test_labels would be (10000, 10). And this means we can get another important variable:
1 |
total_classes = test_labels.shape[1] |
The total_classes
variable will be set to 10. Again, this value could have simply been hard-coded since it was well known ahead of time. Here I’m just reinforcing that you should reference the shapes of your data to get information you want.
So What Have We Done?
Hopefully you see that some of what we just did here matches up with what we did with the data set that we downloaded back in the second post in this series. In both cases, we were getting our data ready to be sent into a neural network. This is part of why I took you through two different ways of getting the data; namely, so that you could see that the processes are the same.
In the prior posts I’ve said a couple of times that our goal is to build a model that will learn from the training data. With the additional knowledge you’ve gathered in this post, you can now understand that this model will be learning to understand the relationship between a 784-dimensional vector (input) and a 1-dimensional vector (output).
The Neural Network Life Cycle
Let’s remind ourselves of the life cycle that we talked about in previous posts and that we’ll follow.
- Gather and Prepare Data
- Define the Model
- Compile the Model
- Fit the Model
- Evaluate the Model
- Generate Predictions from the Model
We’ve been stuck on that first point for awhile now but, in reality, that’s where a lot of your design comes into play. Getting the right data and the right amount of data and then getting that data into the right format is critical for streamlining the other steps. Now, however, we’re at the point where we can move on. Specifically, we’ll move on from designing our data to designing our model.
Reproducible Results
Before we get started defining our neural network model, add this line to your script:
1 |
np.random.seed(1337) |
The number you use for the seed value matters not at all. What does matter is that you are initializing the random number generator to a constant. I talked about the rationale for this in the previous post, but just to reiterate: the reason you do this is because it helps to ensure that the results of your script are reproducible.
Layers of a Model
Let’s consider a quick view of a neural network model:
The idea is fairly simple: a neural network is made of layers. Let’s consider those layers.
First, you have a set of inputs. While this is often called an “input layer”, it’s not actually part of the model. The inputs are what are fed into the model. Then you have an output layer. This, too, isn’t strictly part of the model; instead, this is what the model will ultimately produce. The model will produce those outputs by the operations of layers between the input and output.
Those “layers between” are the so-called “hidden layers” in the schematic above. I’ll get into more specifics about this but for right now just keep this very basic visualization in mind. You know we have our input data. Now we’re doing to define those layers that we pass the data into, taking us into the second of our workflow tasks.
Defining the Model
In our case, our input data consists of a sequence. When dealing with a sequence of data, you’ll generally create a sequential neural network model. The term “sequential” in this context simply means a linear stack of layers, regardless of whether you visualize them stacked top to bottom or left to right. To get started, add this line:
1 |
model = Sequential() |
This will require you adding the following import at the top of your script:
1 |
from keras.models import Sequential |
You can learn a bit more about the Sequential API that Keras provides but all you need to know for now is that this API allows you to create models layer-by-layer.
Remember that the input layer is made up of 60,000 784-dimensional vectors. So what we need now is another layer — a hidden layer — that will do something with those inputs. As suggested above, a “hidden” layer really just means “not an input or output layer.” How many of those hidden layers you have is a measure of how “deep” your network model is. This is where the “deep” in “Deep Learning” comes from.
You can add a layer to your model as such:
1 |
model.add(Dense(total_pixels, input_shape=(total_pixels,))) |
This will require you adding the following import:
1 |
from keras.layers import Dense |
We’ve added a “dense” layer as the first hidden layer. Now, as we go on here, think of each layer is being composed of some number of neurons. This number can be configured. It’s one of the more important of the data conditions that apply to neural networks.
In our case, our hidden “dense” layer has the same number of neurons as there are inputs: 784. That’s why I had you create that total_pixels
variable earlier, so I could make this connection a little more explicit.
What I want to call out is that the first argument to Dense
is the dimensionality of the input space. The second argument, the input_shape
, is a tensor of the given shape. When adding the first layer in the Sequential model you need to specify the input shape so Keras can create the appropriate matrices. For any remaining layers you end up creating, the shape would be inferred automatically.
Here the dimensionality and the input shape are the same thing but that doesn’t have to be the case. I’ll come back to this point, but for now let’s ask this: what is this “Dense” thing that we’re working with?
The idea of this layer is that each neuron in it receives input from all the neurons in the previous layer, thus the layer is said to be densely connected. A dense layer deals with two-dimensional tensors and outputs a two-dimensional tensor. That output is a new representation of the input tensor. The main thing to note is that our fully connected network would now look something like this:
For now, let’s keep adding to our model. Add the following:
1 |
model.add(Activation('relu')) |
This will require you adding to your previous import:
1 |
from keras.layers import Dense, Activation |
What we’ve done here is add an activation function to the dense layer that we just created. As a note, I could have condensed my logic above like this:
1 2 3 4 5 |
model.add(Dense( total_pixels, activation='relu', input_shape=(total_pixels,) )) |
Specifically, I could add the activation function as an argument when creating the Dense layer. I wanted to break this up to better be able to discuss some specifics.
Speaking of specifics, I’ll repeat that this last step has added an activation function. The specific type of activation function is one called “relu” which is short for a “rectified linear unit.” But what does that mean? Well, let’s take a step back here.
Artificial Neurons
A neural network contains a model of an artificial neuron. One of the easiest such models to think about is the perceptron. A perceptron is basically just a tiny “machine” that has one or more inputs, does some sort of processing, and produces a single output. What I just described is a single-layer perceptron.
A perceptron takes several binary inputs — in the above case, x1, x2, and x3 — and produces a single binary output. So think of that little “machine” as one that makes decisions by weighing up evidence, which are the inputs, and drawing boundaries that allow it to classify or predict outputs.
In terms of those layers we’ve been talking about and constructing, the bare minimum you need is an input layer and an output layer. But those alone aren’t going to do you much good because all that would do is pass your inputs directly to your outputs. Or, rather, all you would have is a purely linear combination.
To avoid having just a linear combination, you need layers in between the input and the output. When these perceptrons are spread out over more than one layer, you get a multilayer perceptron.
As a note of accuracy, when you have multiple layers like this, the “perceptron” is actually something called a sigmoid neuron. But we don’t have to get into that right now. For now, simply notice that the visual I just showed you looks very much like the schematic I showed you earlier.
You can think of each layer as a part of the decision-making process. Each subsequent layer in the neural network will be using the results of the decision making process from the layer immediately preceding it.
But what does any of this have to do with that activation function?
Here is a fairly uncomplicated view of what a neuron means in this context:
Perhaps I can somewhat combine a few of the visuals here to make the structure clear:
So what do these artificial neurons do? They essentially calculate what’s called a weighted sum of their inputs. They then add what’s called a “bias factor.” Without an activation function, this is basically what a neuron “looks like” mathematically:
But we can add a specific function to be called that determines if the neuron is “fired” (activated) or not. The function that does this is — you guessed it — the activation function.
In effect, what this function is doing is making a determination as to whether the information (evidence) that the neuron is receiving is relevant for making a decision or should be ignored. Practically speaking, that means this function is determining the output of the layer it is part of.
The Mathematics of Our Layer
All of that was conceptually what the rectified linear function is doing. Now let’s put it into an actual context. The third post in this series, regarding the math exploration, hopefully will have provided a bit of comfort for this part.
A simple way to think about our rectified linear activation function would be to consider what the actual relu function does. It’s quite simple:
relu(x) = max(0, x)
Here that function returns x when x is positive, otherwise it returns 0. Put in the context of our neural netwwork, that function for x looks like this:
output = relu(dot(w, input) + b)
Here the w refers to a two-dimensional tensor representing the weights that I mentioned in the previous post, b refers to the bias factor that I also mentioned in the previous post. The input would correspond to whatever input is being dealt with at the moment and the “dot” refers to a dot product. So here the dot product between the input tensor and the weight tensor is being calculated and a bias factor is being added to that.
Hopefully all that sounds very familiar to what I discussed in the third post.
After all that, as per the definition of the relu function, if the final value is positive, that’s the value that will be returned; otherwise zero will be returned. Think of any positive number as “the neuron activates (fires)” while zero would mean “the neuron does not activate.”
This is where I can now make the general math I showed you in the third post very specific to exactly what we’re doing here.
The Mathematics of a Model
So let’s say we define our layer like this:
1 |
model.add(Dense(784)) |
If the relu activation isn’t provided, the equation I showed you above would break down to this:
output = dot(w, input) + b
This is simply two linear operations: a dot product and an addition. This would mean that the layer could only learn linear transformations of the input data. This would mean that the hypothesis space of the layer would be the set of all possible linear transformations of the input data into a 784-dimensional space.
Such a hypothesis space, while seeming quite large perhaps, is actually way too restricted for learning purposes given the dimensionality of our input data. This kind of space wouldn’t benefit from having multiple layers of representations. The reason for that is because even a deep stack of linear layers — i.e., adding more layers to our model — would still implement nothing more than a linear operation.
Putting that another way, adding more layers would not extend the hypothesis space at all. But for machine learning you often want a large hypothesis space because it provides more “room” for learning by providing more “space” for representations and transformations of data.
A lot of this gets into the design of neural networks by considering the dimensionality of the problems being dealt with. Digging too far into that would take me out of our groove here so, for now, know that in order to get access to a much larger hypothesis space that would benefit from deeper representations, you need something that breaks the linearity. You need a non-linearity and that’s what an activation function provides.
So now consider that we define our layer like this:
1 |
model.add(Dense(784, activation='relu')) |
Now the equation for getting the output becomes this:
output = relu(dot(w, input) + b)
This layer is a function that takes as input a two-dimensional tensor and returns another two-dimensional tensor. That returned two-dimensional tensor is a new representation of the input tensor. Schematically here’s how that breaks down:
output = relu(dot(matrix, matrix) + vector) output = relu(matrix + vector)
There are three tensor operations here:
- A dot product between the input tensor and a tensor named w.
- An addition between the resulting 2D tensor and a vector b.
- A relu operation.
The relu operation, and the addition, are considered to be element-wise operations, which are different than tensor operations. The main difference is that element-wise operations are applied independently to each entry in the tensors being considered. The dot operation combines entries in the input tensors, which I showed you how to do in the third post.
Specifically, if you went through that post, you saw that the dot product between two vectors is a scalar and only vectors with the same number of elements are compatible for a dot product. The result of a dot product between two matrices provides another matrix where the coefficients are the vector products between the rows of x and the columns of y.
See how that math creeps up on you? It’s been awhile since that third post, so let’s play out that operation I just showed you. The relevance here is I’m showing you exactly what our two model statements are doing behind the scenes.
Coding Our Mathematics
To play along, create a script called mnist_math.py and add the following to it to get Numpy:
1 |
import numpy as np |
What I want to do here is take you through some of the calculations that our model will be doing as it applies the relu activation function. I’m doing this in the interests of reducing the opaqueness of these models. So, first, let’s declare two vectors:
1 2 |
vector1 = np.array([5, -2]) vector2 = np.array([-3, 0]) |
What we have have there is a vector of shape (2,). Let’s create our own dot product implementation:
1 2 3 4 5 6 7 |
def vector_dot(x, y): z = 0.0 for i in range(x.shape[0]): z += x[i] * y[i] return z |
Now let’s see how that works with our vectors:
1 2 |
print(np.dot(vector1, vector2)). # -15 print(vector_dot(vector1, vector2)) # -15.0 |
Here this just proves to you that our vector_dot implementation matches exactly what Numpy’s dot calculation does. Now let’s add some matrices:
1 2 |
matrix1 = np.array([[2, 3], [3, 5]]) matrix2 = np.array([[1, 2], [5, -1]]) |
Each matrix there is shape (2,2).
So now we have some data for our calculation. Keep in mind what I showed you above:
output = relu(dot(w, input) + b) output = relu(dot(matrix, matrix) + vector) output = relu(matrix + vector)
I’ll take you through that step by step.
Get the Dot Product
Let’s handle the first part of that: a dot product between the input tensor and a tensor named w. To do that, let’s add a custom matrix dot product function:
1 2 3 4 5 6 7 8 9 10 |
def matrix_dot(x, y): z = np.zeros((x.shape[0], y.shape[1])) for i in range(x.shape[0]): for j in range(y.shape[1]): row = x[i, :] column = y[:, j] z[i, j] = vector_dot(row, column) return z |
Notice here that I reuse the vector_dot function, which is also why I wanted to create that function. As with the vectors, let’s confirm that our custom dot product works according to how Numpy would do it:
1 2 |
print(np.dot(matrix1, matrix2)) print(matrix_dot(matrix1, matrix2)) |
You should get this output:
[[17 1] [28 1]] [[17. 1.] [28. 1.]]
Again, that’s just confirming that our matrix_dot function is doing the same thing as the Numpy dot product operation.
Add Matrix and Vector
So from that relu equation, we just did the dot product between the matrices. Now we have to add our resulting matrix and our vector. Let’s add a custom function for that:
1 2 3 4 5 6 7 8 |
def add_matrix_and_vector(x, y): x = x.copy() for i in range(x.shape[0]): for j in range(x.shape[1]): x[i, j] += y[j] return x |
As before we can check if things make sense:
1 2 |
print(matrix1 + vector1) print(add_matrix_and_vector(matrix1, vector1)) |
The output you get:
[[7 1] [8 3]]
As you can see, we don’t really need Numpy for this operation. That said, you can at least see that our implementation matches what actually happens.
Get the Relu
Finally, let’s add our own relu function:
1 2 3 4 5 6 7 8 |
def custom_relu(x): x = x.copy() for i in range(x.shape[0]): for j in range(x.shape[1]): x[i, j] = max(x[i, j], 0) return x |
So now we just have to add our matrix and vector like we did above and pass that value to our custom relu function:
1 2 3 |
value = add_matrix_and_vector(matrix1, vector1) result = custom_relu(value) print(result) |
And what yu get back from that is the resulting matrix that we already calculated.
And that’s it! That’s quite a bit of material to throw at you there, particularly since it was a digression. Even given that, I hope you see that I just showed you exactly what kind of calculation is going on behind the scenes with those model statements we have in our script.
Introducing Non-Linearities
This activation function still might seem a little odd to you. Let me give you a visual:
The neural network uses linear models like I’ve been showing you. That, however, can only lead to an output that is a linear combination.
But you can use activation functions to add non-linearities to the model and this allows you to model arbitrary functions. Put another way, non-linearity does just what it sounds like: it breaks linearity and allows for the representation of more complicated relationships in your data.
Don’t worry if all of the math went over your head. You understanding of every detail truly doesn’t matter for what we’re doing. What does matter is why I’m using an activation function like this. Specifically, what this is doing is adding non-linearities into the neural network. And, again, why do we want to add non-linearities? Mainly because this allows us to have better fitting with our data. This is what elevates our model beyond the capabilities of a simple perceptron.
Adding Another Layer
Back to our mnist.py file, let’s add one more layer and a corresponding activation function as such:
1 2 |
model.add(Dense(total_classes)) model.add(Activation('softmax')) |
Again, that could be simplified a bit if you prefer:
1 |
model.add(Dense(total_classes, activation='softmax')) |
Here this layer consists of connections for our 10 classes, which shows why I had you create the total_classes
variable earlier. For this layer, a softmax activation function is used. Just like relu, softmax is yet another operation or function that is being carried out.
Remember that I said the layers are made of neurons. Each neuron in the layer is carrying out some task — some function — on the inputs that it receives.
What our new layer does with its softmax function is turn the outputs for this layer into values that can be associated with probabilities. Specifically, the outputs will be between 0 and 1.
The output of the softmax function is equivalent to what’s called a categorical probability distribution. For example, consider a ten-sided dice roll, where there are ten possible outcomes: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. That’s a categorical distribution. Let’s say you rolled the dice once, equivalent to passing an image through our network once. This would be a single trial or training episode and the the categorical distribution would be equal to what’s called a multinomial distribution.
A multinomial distribution is used to find probabilities in experiments where there are more than two outcomes. In the case of our dice — which corresponds to the possible number of one of our images — each outcome has a probability of 1/10. So what softmax is ultimately doing is telling you the probability that any of the classes — a digit from 0 to 9 — are true given the data it was looking at.
Layers Interacting
Let’s consider our layers and here I’ll simplify their expression a bit just to make a point:
1 2 |
model.add(Dense(784, input_shape=(784,))) model.add(Dense(10)) |
Here we’re creating a layer that will only accept as input two-dimensional tensors where the first dimension is 784. (There is no second dimension in our case.) This layer will return a tensor with that same dimension. What that means is that this layer can only be connected to a downstream layer that expects a 784-dimensional vector as its input.
Okay, but our second layer doesn’t seem to be accepting a 784-dimensional vector, does it? This goes back to what I mentioned earlier: in Keras, the layers you add to your models are dynamically built to match the shape of the incoming layer. So consider that second layer above, which doesn’t receive an input shape argument. Instead, it automatically inferred its input shape as being the output shape of the layer that came before.
Okay, what about something like this, however:
1 |
model.add(Dense(512, input_shape=(784,))) |
Here what this is doing is passing in an input shape of 784 — to reflect the input size — but then that dimension is transformed to be 512. Remember earlier that I said the dimensionality and the input shape do not have to be the same? Well, here’s an example of that. I’ll get into that variation in the next post but I just wanted to show you that this is possible.
You can get a rough schematic of what I’m showing you by adding this to your script:
1 2 |
for layer in model.layers: print(layer.name, layer.input_shape, '==>', layer.output_shape) |
That will get this:
dense_1 (None, 784) ==> (None, 784) activation_1 (None, 784) ==> (None, 784) dense_2 (None, 784) ==> (None, 10) activation_2 (None, 10) ==> (None, 10)
You can even generate an image for your model. Add the following:
1 2 3 |
plot_model( model, to_file='model_plot.png', show_shapes=True, show_layer_names=True) |
This will require adding the following import:
1 |
from keras.utils.vis_utils import plot_model |
This will generate a model output in the file model_plot.png.
But we can do a little better than that by summarizing our model.
Summarize Our Model
To get a summary of the model, add this final line to your script:
1 |
print(model.summary()) |
That should get you something like this:
Layer (type) Output Shape Param # ================================================================= dense_1 (Dense) (None, 784) 615440 _________________________________________________________________ activation_1 (Activation) (None, 784) 0 _________________________________________________________________ dense_2 (Dense) (None, 10) 7850 _________________________________________________________________ activation_2 (Activation) (None, 10) 0 ================================================================= Total params: 623,290 Trainable params: 623,290 Non-trainable params: 0
This can get a little confusing. Let’s first consider parameters.
Parameters
A parameter in a model refers to a variable that is internal to the model and whose value can be estimated from data that the model is dealing with. The parameters of a neural network are typically the weights and the bias. These parameters are learned during the training stage. So, the input data provides the configuration for the parameters and the algorithm tunes them.
There’s also a concept of a hyperparameter. A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated or learned from the input data. Hyperparameters influence how your parameters will be learned. Probably one of the more common examples of a hyperparameter is the learning rate.
From the above summary you can see that there is a breakdown of trainable versus non-trainable parameters. In Keras, non-trainable parameters refers to the number of weights that you have chosen to keep constant when training. If weights are constant, this means that Keras won’t update these weights during training. Not updating weights, however, refers to not learning.
Weights in the Model
I’ve talked about weights throughout this series. Let’s revisit briefly. Weights are the values inside the network that perform the operations and can be adjusted to result in whatever it is you want the network to do. Weights can be thought of as connection strengths between neurons. A higher connection strength — a higher weight value — means the neuron is more likely to activate or fire.
Keep in mind that the output of one neuron is input to another so one way to think of “weight” is how much influence the input has on the output. Weights near zero would mean that changing the input will not change the output, thus no influence at all.
When you create layers, internally Keras will create its own weights and all of those weights will, by default, be trainable. In other words, Keras will make sure that all of your weights can be modified so that learning can occur. This is something that could, however, be configured as part of how the model operates.
Summary Numbers
If you’re curious how those numbers in the summary are being calculated, essentially it’s just this:
Dense 1: 784 * 784 + 784 = 615,440 Dense 2: 784 * 10 + 10 = 7,850 Total params: Dense 1 + Dense 2 = 615,440 + 7,850 = 623,290
Getting into all of the details of this would take us way far afield. Hopefully the above at least removes enough mysteries into what you are seeing and shows you some consistency with the details being presented.
What We Accomplished
This was a lot of material! If you stuck with me this far, kudos to you. I know a lot of this can get tedious, especially when people like me use repetition to keep driving home some of the concepts.
You probably feel that you are getting varying levels of explanation, particularly around some of the math or what’s actually happening in these layers. That, in fact, is by design. I want to show you one of the dangers of artificial intelligence and machine learning, which is that it can seem very much like a black box.
Now consider that entire industries are starting to run by these black boxes! We are building an industry that is using algorithms to make decisions about information existing in an abstract high-dimensional hypothesis space. The degree to which we, collectively, don’t understand what these systems are doing is the degree to which we are in danger of abdicating our responsibility, accountability and choices to a technocracy.
Beyond that cautionary bit, you defined a neural network, as you did in the fourth post. But this time you got some more basis as to why the neural network was being coded up the way it was. What we have to do now is actually get that network to run based on the input we send it. We’ll do that in the next post, which will be the final post in this series!