1 2 3 4 5 6 |
import numpy as np import matplotlib.pyplot as plt from keras.datasets import mnist (train_images, train_labels), (test_images, test_labels) = mnist.load_data() |
Details Of Our Data
The main thing we’ve done in our above script is load up the data and store the data in some variables. All of these data sets are encoded in Numpy arrays, called ndarrays, which just means an n-dimensional array. And as you known from the third post, an n-dimensional array just means a tensor. Let’s get some basic information about our data. Here I’ll just use the data stored intrain_images
for analysis purposes. The test_images
would provide similar analysis.
1 2 3 |
print(train_images.ndim) # 3 print(train_images.shape) # (60000, 28, 28) print(train_images.dtype) # unit8 |
- The
ndim
is the number of dimensions. You can also think of those dimensions as the number of axes. For instance, a three-dimensional tensor has three axes, a two-dimensional tensor (matrix) has two axes, and a one-dimensional tensor (vector) has one axis.
- While the ndim tells you the total number of dimensions of the tensor, the
shape
tells you how many dimensions the tensor has along each axis. A three-dimensional tensor would look like this: (60000, 28, 28). A two-dimensional tensor (matrix) would look like this: (28, 28). A one-dimensional tensor (vector) would look like this: (1,).
- The
dtype
or data type is just what it sounds like: it indicates the type of data that is contained in the tensor. A tensor’s type could be something like float32, uint8, float64, and so on.
1 |
total_pixels = train_images.shape[1] * train_images.shape[2] |
total_pixels
will be 784. This is an important variable for us to be holding onto and you’ll see how it’s used later. But just understand that a 28 × 28 matrix will have 784 elements. Obviously I could have just hard-coded the 784 in there but it’s good to use the representations of your data to get the values you want.
Visualizing Our Data
So we have 60,000 matrices. Each matrix is a representation of a grayscale image, with coefficients between 0 and 255. That just means that each element of the array is a number between 0 and 255. Keep in mind that in a grayscale image each pixel is black, white or some shade of gray. So each element is a value describing the intensity of the pixel. In this context, values closer to 0 mean background while those closer to 255 mean foreground. First, let’s get a look at some representative images:
1 2 3 4 5 6 7 8 9 |
for i in range(9): plt.subplot(3, 3, i+1) plt.tight_layout() plt.imshow(train_images[i], cmap='gray', interpolation='none') plt.title("Digit: {}".format(train_labels[i])) plt.xticks([]) plt.yticks([]) plt.show() |
1 2 3 4 5 6 7 8 9 10 |
plt.subplot(2, 1, 1) plt.imshow(train_images[0], cmap='gray', interpolation='none') plt.title("Digit: {}".format(train_labels[0])) plt.xticks([]) plt.yticks([]) plt.subplot(2, 1, 2) plt.hist(train_images[0].reshape(784)) plt.title("Pixel Value Distribution") plt.show() |
1 2 3 4 5 6 7 8 |
import numpy as np import matplotlib.pyplot as plt from keras.datasets import mnist (train_images, train_labels), (test_images, test_labels) = mnist.load_data() total_pixels = train_images.shape[1] * train_images.shape[2] |
Optimizing Our Input Data
The training data set is structured as a three-dimensional array of the following sort:(number of samples × image width × image height)That is what is represented by the
(60000, 28, 28)
that we got earlier. This is the kind of data that we will be passing into our neural network.
We can optimize this data a bit, however. Specifically, we can preprocess our data by reshaping it and scaling it. It generally makes sense to perform at least some scaling of input values, particularly when you’re going to feed those inputs to a neural network model.
The reason this is a good idea is because it reduces the level of computation needed. Think of this as streamlining your evidence so that better decisions can be made from it. Beyond even that, if you played around with me in the last post, you saw how a lack of shaping and scaling the data caused some errors when you tried to use certain data with the neural network we wrote.
To get started on our data optimization, we can reduce the images down into a vector of pixels. In this case the 28 × 28 sized images — a matrix — will be flattened into a 784 element vector holding pixel input values.
Add these statements to your script:
1 2 |
train_images = train_images.reshape((60000, total_pixels)) test_images = test_images.reshape((10000, total_pixels)) |
- train_images.ndim = 2
- train_images.shape = (60000, 784)
- test_images.shape = (10000, 784)
1 2 |
train_images = train_images.astype('float32') test_images = test_images.astype('float32') |
1 2 |
train_images /= 255 test_images /= 255 |
Optimizing Our Output Data
That takes care of our input. What about our outputs? There is a bit of work we can do there as well. Consider that we have a more constrained set of values for our outputs, which you can see by executing this statement:
1 |
print(np.unique(train_labels, return_counts=True)) |
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=unit8)That’s a nice set of easy outputs to deal with: essentially just the numbers 0 through 9. So what can we do to make that even easier on us? We can perform an action that is described as “categorically encoding the labels.” Go ahead and add these lines:
1 2 |
train_labels = np_utils.to_categorical(train_labels) test_labels = np_utils.to_categorical(test_labels) |
1 |
from keras.utils import np_utils |
1 2 3 |
print(test_labels[1]) # 2 print(test_labels[2]) # 1 print(test_labels[3]) # 0 |
1 2 3 |
print(test_labels[1]) # [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] print(test_labels[2]) # [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] print(test_labels[3]) # [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] |
1 |
total_classes = test_labels.shape[1] |
total_classes
variable will be set to 10. Again, this value could have simply been hard-coded since it was well known ahead of time. Here I’m just reinforcing that you should reference the shapes of your data to get information you want.
So What Have We Done?
Hopefully you see that some of what we just did here matches up with what we did with the data set that we downloaded back in the second post in this series. In both cases, we were getting our data ready to be sent into a neural network. This is part of why I took you through two different ways of getting the data; namely, so that you could see that the processes are the same. In the prior posts I’ve said a couple of times that our goal is to build a model that will learn from the training data. With the additional knowledge you’ve gathered in this post, you can now understand that this model will be learning to understand the relationship between a 784-dimensional vector (input) and a 1-dimensional vector (output).The Neural Network Life Cycle
Let’s remind ourselves of the life cycle that we talked about in previous posts and that we’ll follow.- Gather and Prepare Data
- Define the Model
- Compile the Model
- Fit the Model
- Evaluate the Model
- Generate Predictions from the Model
Reproducible Results
Before we get started defining our neural network model, add this line to your script:
1 |
np.random.seed(1337) |
Layers of a Model
Let’s consider a quick view of a neural network model: The idea is fairly simple: a neural network is made of layers. Let’s consider those layers. First, you have a set of inputs. While this is often called an “input layer”, it’s not actually part of the model. The inputs are what are fed into the model. Then you have an output layer. This, too, isn’t strictly part of the model; instead, this is what the model will ultimately produce. The model will produce those outputs by the operations of layers between the input and output. Those “layers between” are the so-called “hidden layers” in the schematic above. I’ll get into more specifics about this but for right now just keep this very basic visualization in mind. You know we have our input data. Now we’re doing to define those layers that we pass the data into, taking us into the second of our workflow tasks.Defining the Model
In our case, our input data consists of a sequence. When dealing with a sequence of data, you’ll generally create a sequential neural network model. The term “sequential” in this context simply means a linear stack of layers, regardless of whether you visualize them stacked top to bottom or left to right. To get started, add this line:
1 |
model = Sequential() |
1 |
from keras.models import Sequential |
1 |
model.add(Dense(total_pixels, input_shape=(total_pixels,))) |
1 |
from keras.layers import Dense |
total_pixels
variable earlier, so I could make this connection a little more explicit.
What I want to call out is that the first argument to Dense
is the dimensionality of the input space. The second argument, the input_shape
, is a tensor of the given shape. When adding the first layer in the Sequential model you need to specify the input shape so Keras can create the appropriate matrices. For any remaining layers you end up creating, the shape would be inferred automatically.
Here the dimensionality and the input shape are the same thing but that doesn’t have to be the case. I’ll come back to this point, but for now let’s ask this: what is this “Dense” thing that we’re working with?
The idea of this layer is that each neuron in it receives input from all the neurons in the previous layer, thus the layer is said to be densely connected. A dense layer deals with two-dimensional tensors and outputs a two-dimensional tensor. That output is a new representation of the input tensor. The main thing to note is that our fully connected network would now look something like this:
For now, let’s keep adding to our model. Add the following:
1 |
model.add(Activation('relu')) |
1 |
from keras.layers import Dense, Activation |
1 2 3 4 5 |
model.add(Dense( total_pixels, activation='relu', input_shape=(total_pixels,) )) |
Artificial Neurons
A neural network contains a model of an artificial neuron. One of the easiest such models to think about is the perceptron. A perceptron is basically just a tiny “machine” that has one or more inputs, does some sort of processing, and produces a single output. What I just described is a single-layer perceptron. A perceptron takes several binary inputs — in the above case, x1, x2, and x3 — and produces a single binary output. So think of that little “machine” as one that makes decisions by weighing up evidence, which are the inputs, and drawing boundaries that allow it to classify or predict outputs. In terms of those layers we’ve been talking about and constructing, the bare minimum you need is an input layer and an output layer. But those alone aren’t going to do you much good because all that would do is pass your inputs directly to your outputs. Or, rather, all you would have is a purely linear combination. To avoid having just a linear combination, you need layers in between the input and the output. When these perceptrons are spread out over more than one layer, you get a multilayer perceptron. As a note of accuracy, when you have multiple layers like this, the “perceptron” is actually something called a sigmoid neuron. But we don’t have to get into that right now. For now, simply notice that the visual I just showed you looks very much like the schematic I showed you earlier. You can think of each layer as a part of the decision-making process. Each subsequent layer in the neural network will be using the results of the decision making process from the layer immediately preceding it. But what does any of this have to do with that activation function? Here is a fairly uncomplicated view of what a neuron means in this context: Perhaps I can somewhat combine a few of the visuals here to make the structure clear: So what do these artificial neurons do? They essentially calculate what’s called a weighted sum of their inputs. They then add what’s called a “bias factor.” Without an activation function, this is basically what a neuron “looks like” mathematically: But we can add a specific function to be called that determines if the neuron is “fired” (activated) or not. The function that does this is — you guessed it — the activation function. In effect, what this function is doing is making a determination as to whether the information (evidence) that the neuron is receiving is relevant for making a decision or should be ignored. Practically speaking, that means this function is determining the output of the layer it is part of.The Mathematics of Our Layer
All of that was conceptually what the rectified linear function is doing. Now let’s put it into an actual context. The third post in this series, regarding the math exploration, hopefully will have provided a bit of comfort for this part. A simple way to think about our rectified linear activation function would be to consider what the actual relu function does. It’s quite simple:relu(x) = max(0, x)Here that function returns x when x is positive, otherwise it returns 0. Put in the context of our neural netwwork, that function for x looks like this:
output = relu(dot(w, input) + b)Here the w refers to a two-dimensional tensor representing the weights that I mentioned in the previous post, b refers to the bias factor that I also mentioned in the previous post. The input would correspond to whatever input is being dealt with at the moment and the “dot” refers to a dot product. So here the dot product between the input tensor and the weight tensor is being calculated and a bias factor is being added to that. Hopefully all that sounds very familiar to what I discussed in the third post. After all that, as per the definition of the relu function, if the final value is positive, that’s the value that will be returned; otherwise zero will be returned. Think of any positive number as “the neuron activates (fires)” while zero would mean “the neuron does not activate.” This is where I can now make the general math I showed you in the third post very specific to exactly what we’re doing here.
The Mathematics of a Model
So let’s say we define our layer like this:
1 |
model.add(Dense(784)) |
output = dot(w, input) + bThis is simply two linear operations: a dot product and an addition. This would mean that the layer could only learn linear transformations of the input data. This would mean that the hypothesis space of the layer would be the set of all possible linear transformations of the input data into a 784-dimensional space. Such a hypothesis space, while seeming quite large perhaps, is actually way too restricted for learning purposes given the dimensionality of our input data. This kind of space wouldn’t benefit from having multiple layers of representations. The reason for that is because even a deep stack of linear layers — i.e., adding more layers to our model — would still implement nothing more than a linear operation. Putting that another way, adding more layers would not extend the hypothesis space at all. But for machine learning you often want a large hypothesis space because it provides more “room” for learning by providing more “space” for representations and transformations of data. A lot of this gets into the design of neural networks by considering the dimensionality of the problems being dealt with. Digging too far into that would take me out of our groove here so, for now, know that in order to get access to a much larger hypothesis space that would benefit from deeper representations, you need something that breaks the linearity. You need a non-linearity and that’s what an activation function provides. So now consider that we define our layer like this:
1 |
model.add(Dense(784, activation='relu')) |
output = relu(dot(w, input) + b)This layer is a function that takes as input a two-dimensional tensor and returns another two-dimensional tensor. That returned two-dimensional tensor is a new representation of the input tensor. Schematically here’s how that breaks down:
output = relu(dot(matrix, matrix) + vector) output = relu(matrix + vector)There are three tensor operations here:
- A dot product between the input tensor and a tensor named w.
- An addition between the resulting 2D tensor and a vector b.
- A relu operation.
Coding Our Mathematics
To play along, create a script called mnist_math.py and add the following to it to get Numpy:
1 |
import numpy as np |
1 2 |
vector1 = np.array([5, -2]) vector2 = np.array([-3, 0]) |
1 2 3 4 5 6 7 |
def vector_dot(x, y): z = 0.0 for i in range(x.shape[0]): z += x[i] * y[i] return z |
1 2 |
print(np.dot(vector1, vector2)). # -15 print(vector_dot(vector1, vector2)) # -15.0 |
1 2 |
matrix1 = np.array([[2, 3], [3, 5]]) matrix2 = np.array([[1, 2], [5, -1]]) |
output = relu(dot(w, input) + b) output = relu(dot(matrix, matrix) + vector) output = relu(matrix + vector)I’ll take you through that step by step.
Get the Dot Product
Let’s handle the first part of that: a dot product between the input tensor and a tensor named w. To do that, let’s add a custom matrix dot product function:
1 2 3 4 5 6 7 8 9 10 |
def matrix_dot(x, y): z = np.zeros((x.shape[0], y.shape[1])) for i in range(x.shape[0]): for j in range(y.shape[1]): row = x[i, :] column = y[:, j] z[i, j] = vector_dot(row, column) return z |
1 2 |
print(np.dot(matrix1, matrix2)) print(matrix_dot(matrix1, matrix2)) |
[[17 1] [28 1]] [[17. 1.] [28. 1.]]Again, that’s just confirming that our matrix_dot function is doing the same thing as the Numpy dot product operation.
Add Matrix and Vector
So from that relu equation, we just did the dot product between the matrices. Now we have to add our resulting matrix and our vector. Let’s add a custom function for that:
1 2 3 4 5 6 7 8 |
def add_matrix_and_vector(x, y): x = x.copy() for i in range(x.shape[0]): for j in range(x.shape[1]): x[i, j] += y[j] return x |
1 2 |
print(matrix1 + vector1) print(add_matrix_and_vector(matrix1, vector1)) |
[[7 1] [8 3]]As you can see, we don’t really need Numpy for this operation. That said, you can at least see that our implementation matches what actually happens.
Get the Relu
Finally, let’s add our own relu function:
1 2 3 4 5 6 7 8 |
def custom_relu(x): x = x.copy() for i in range(x.shape[0]): for j in range(x.shape[1]): x[i, j] = max(x[i, j], 0) return x |
1 2 3 |
value = add_matrix_and_vector(matrix1, vector1) result = custom_relu(value) print(result) |
Introducing Non-Linearities
This activation function still might seem a little odd to you. Let me give you a visual: The neural network uses linear models like I’ve been showing you. That, however, can only lead to an output that is a linear combination. But you can use activation functions to add non-linearities to the model and this allows you to model arbitrary functions. Put another way, non-linearity does just what it sounds like: it breaks linearity and allows for the representation of more complicated relationships in your data. Don’t worry if all of the math went over your head. You understanding of every detail truly doesn’t matter for what we’re doing. What does matter is why I’m using an activation function like this. Specifically, what this is doing is adding non-linearities into the neural network. And, again, why do we want to add non-linearities? Mainly because this allows us to have better fitting with our data. This is what elevates our model beyond the capabilities of a simple perceptron.Adding Another Layer
Back to our mnist.py file, let’s add one more layer and a corresponding activation function as such:
1 2 |
model.add(Dense(total_classes)) model.add(Activation('softmax')) |
1 |
model.add(Dense(total_classes, activation='softmax')) |
total_classes
variable earlier. For this layer, a softmax activation function is used. Just like relu, softmax is yet another operation or function that is being carried out.
Remember that I said the layers are made of neurons. Each neuron in the layer is carrying out some task — some function — on the inputs that it receives.
What our new layer does with its softmax function is turn the outputs for this layer into values that can be associated with probabilities. Specifically, the outputs will be between 0 and 1.
The output of the softmax function is equivalent to what’s called a categorical probability distribution. For example, consider a ten-sided dice roll, where there are ten possible outcomes: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. That’s a categorical distribution. Let’s say you rolled the dice once, equivalent to passing an image through our network once. This would be a single trial or training episode and the the categorical distribution would be equal to what’s called a multinomial distribution.
A multinomial distribution is used to find probabilities in experiments where there are more than two outcomes. In the case of our dice — which corresponds to the possible number of one of our images — each outcome has a probability of 1/10. So what softmax is ultimately doing is telling you the probability that any of the classes — a digit from 0 to 9 — are true given the data it was looking at.
Layers Interacting
Let’s consider our layers and here I’ll simplify their expression a bit just to make a point:
1 2 |
model.add(Dense(784, input_shape=(784,))) model.add(Dense(10)) |
1 |
model.add(Dense(512, input_shape=(784,))) |
1 2 |
for layer in model.layers: print(layer.name, layer.input_shape, '==>', layer.output_shape) |
dense_1 (None, 784) ==> (None, 784) activation_1 (None, 784) ==> (None, 784) dense_2 (None, 784) ==> (None, 10) activation_2 (None, 10) ==> (None, 10)You can even generate an image for your model. Add the following:
1 2 3 |
plot_model( model, to_file='model_plot.png', show_shapes=True, show_layer_names=True) |
1 |
from keras.utils.vis_utils import plot_model |
Summarize Our Model
To get a summary of the model, add this final line to your script:
1 |
print(model.summary()) |
Layer (type) Output Shape Param # ================================================================= dense_1 (Dense) (None, 784) 615440 _________________________________________________________________ activation_1 (Activation) (None, 784) 0 _________________________________________________________________ dense_2 (Dense) (None, 10) 7850 _________________________________________________________________ activation_2 (Activation) (None, 10) 0 ================================================================= Total params: 623,290 Trainable params: 623,290 Non-trainable params: 0This can get a little confusing. Let’s first consider parameters.
Parameters
A parameter in a model refers to a variable that is internal to the model and whose value can be estimated from data that the model is dealing with. The parameters of a neural network are typically the weights and the bias. These parameters are learned during the training stage. So, the input data provides the configuration for the parameters and the algorithm tunes them. There’s also a concept of a hyperparameter. A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated or learned from the input data. Hyperparameters influence how your parameters will be learned. Probably one of the more common examples of a hyperparameter is the learning rate. From the above summary you can see that there is a breakdown of trainable versus non-trainable parameters. In Keras, non-trainable parameters refers to the number of weights that you have chosen to keep constant when training. If weights are constant, this means that Keras won’t update these weights during training. Not updating weights, however, refers to not learning.Weights in the Model
I’ve talked about weights throughout this series. Let’s revisit briefly. Weights are the values inside the network that perform the operations and can be adjusted to result in whatever it is you want the network to do. Weights can be thought of as connection strengths between neurons. A higher connection strength — a higher weight value — means the neuron is more likely to activate or fire. Keep in mind that the output of one neuron is input to another so one way to think of “weight” is how much influence the input has on the output. Weights near zero would mean that changing the input will not change the output, thus no influence at all. When you create layers, internally Keras will create its own weights and all of those weights will, by default, be trainable. In other words, Keras will make sure that all of your weights can be modified so that learning can occur. This is something that could, however, be configured as part of how the model operates.Summary Numbers
If you’re curious how those numbers in the summary are being calculated, essentially it’s just this:Dense 1: 784 * 784 + 784 = 615,440 Dense 2: 784 * 10 + 10 = 7,850 Total params: Dense 1 + Dense 2 = 615,440 + 7,850 = 623,290Getting into all of the details of this would take us way far afield. Hopefully the above at least removes enough mysteries into what you are seeing and shows you some consistency with the details being presented.