Demystifying Machine Learning, Part 4

In the previous posts in this series, we got a lot of terminology placed in context, we investigated our data set, we took a dive into some math, and we talked about the life cycle of a neural network. In this post, you’re going to get a rapid-fire tour of creating a neural network from start to finish. The goal won’t be to understand every aspect, but rather just to get functionality working that I can expand on for you.

As a reminder, we did a few things at the end of the second post that got us a new script to play around with. If you followed along with me, that script was called mnist.py and looked like this:

Keep that script handy. I’m going to come back to it in the next post. For now, let’s create a new script called mnist_learn.py. I’m doing this two-script approach for a specific reason.

In the previous post, I talked about thinking in terms of coordinate systems. So here I want to change those variables above to be in line with x and y coordinates. So for your new script, put these lines in place:

Here you can see that train_data and train_labels became x_train and y_train; similarly test_data and test_labels became x_test and y_test. With this we’re going to build a fully functioning neural network model in this post.

Revisit the Context

So let’s consider the most basic machine learning model. This model is an “engine” or function that takes inputs and provides outputs. A model can be trained. Training is the process of how the model learns to make sense out of the inputs it’s receiving. An algorithm shapes the model where the “shaping” means determining the specific model that will classify or predict certain outputs, given certain inputs.

The problem we’ve been talking about in this series is the ability of a model to perform classification on grayscale images of handwritten digits into ten respective categories, 0 through 9. Each image is 28 × 28 pixels in size.

Design the Model Around the Context

A natural way to design a network to handle this classification task is to encode the intensities of the image pixels into the input neurons. So if the images are 28 × 28 grayscale images, then we have have 784 input neurons. Thus each data point — each sample — will be an image that has been encoded as 784 numbers, where the numbers are intensity values.

Thus the images themselves can be linearized into a 1 × 784 linear vector. The labels, or outputs, are a simpler array of digits, ranging from 0 to 9. Thus the labels can be encoded as 1 × 10 vectors.

To perform this task we’ll create a three-layer neural network.

  • The first layer will be the input layer. This layer contains neurons that encode the values of the input pixel intensity.
  • The second layer of the network will be a hidden layer. This will be a layer that does actual processing on the inputs, using mathematical operations to transform the data. How many neurons this layer contains will be configurable.
  • The third layer of the network will be the output layer and it will contain ten neurons.

Regarding that output layer, keeping that simple is important. The ten neurons correspond to our ten categories of 0 through 9. So if the first neuron in the output layer fires, that will indicate that the network thinks the digit is a 0. If the second neuron fires then that will indicate that the network thinks the digit is a 1. And so on.

Define the Model

Now let’s define our network based on what I said above:

This will require that you add the following imports:

Don’t worry about the details of all of these things. As I mentioned, this is going to be a post where we just get something working without necessarily understanding all the details of how it works.

For now, just now that this is providing us without layers. Our input layer isn’t specified above. It’s implied by the data we loaded. What you see above is the hidden layer and the output layer. The hidden layer is set up to only accept as input two-dimensional tensors where the first dimension is 784. This hidden layer will return a tensor where the dimension has been transformed to be 32. That 32 is a configurable value.

Feed the Model

The workflow will be as follows: first, we’re going to feed the neural network the training data, x_train and target labels, y_train. The network will then learn to associate images (x) and labels (y). Finally, the network will produce predictions for our test data, x_test, and we’ll verify whether these predictions match the labels from y_test.

“Feeding the neural network” means sending the data into the layers. This is the training data (input tensors) and the training labels (target tensors). The layers will map the inputs to the targets. Put another way, the layers take as input one or more tensors and will output one or more tensors. These layers will have their own tensors, called weights. For now just know that these weights will be provided for you.

Compile the Model

To make the network ready for training, it has to be compiled. Why is this necessary? Think of “compiling” as configuring the learning process. There are a couple of things we want to configure.

  1. The network has to be able to measure its performance on the training data via a feedback signal.
  2. The network has to be able to update itself based on the data it sees and its loss function.

The first item is handled by a loss function and the second is handled by an optimizer. These two aspects will provide the means by which the network will be able to modulate itself in terms of getting better at classifying. Along with these components, the network will need to be able to determine the fraction of the images that were correctly classified. This means reporting the loss of the network over the training data and reporting the accuracy of the network over the training data. All of these elements are “compiled” into the model. Add the following:

Don’t worry about what the terminology means there, such as “rmsprop” or “categorical_crossentropy.” Those are just specific choices being made for the type of loss function and the optimizer. Again: this is just a fast, deep-dive. We’ll investigate these aspects in the next post.

Model Operation

You feed tensors into the network where each layer performs tensor operations on those inputs. Each neural layer transforms its input data based on a mathematical operation. These layers are chained together and use those transformations to map the input data to predictions. The loss function then compares these predictions to the targets. This produces a loss value, which is a measure of how well the network’s predictions match what was expected. The optimizer uses this loss value to update the network’s weights.

Weights contain the information learned by the network from exposure to training data. Weights are gradually adjusted based on a feedback signal. This is what happens during training. Think of weights as the state of the layer. A technique known as gradient descent is used to allow the network to learn from the training data by optimizing the weights in order to minimize the output of the loss function.

Train (Fit) the Model

Once the model has been compiled, it’s possible to fit the model to the training data by training the network. This basically means you have the network iterate those mathematical operations on the training data a certain number of times. Add the following to your script:

Execute the Model

We now have enough logic for a working neural network. Let’s try to run the script. When try to run your script at this point, you will get an error:

Error when checking input: expected dense_1_input to have 2 dimensions,
but got array with shape (60000, 28, 28)

The issue here is that we used an input shape like this input_shape=(784,) — with the second dimension being unspecified — but what got passed in were the inputs that are of the shape (60000, 28, 28). We have to preprocess the data by reshaping it into the shape the network expects. If this isn’t done, the error conditions I’ll be showing you will occur.

Before we fix this, let’s get some insight into our data. You put the following statements before the call to fit() although I have the outputs in comments in case you want to take my word for it:

To fix the issue we just encountered, add this line before the call to fit():

If you run the script again, you’ll get a similar but different error:

Error when checking target: expected dense_2 to have shape (10,)
but got array with shape (1,)

To fix this problem, add this line after the one you just added:

That will require adding the following import:

Let’s do another analysis of our data. You can try these statements after the lines you just added and before the call to fit() or, as before, just read the comments and take my word for it:

With these lines in place, your script should now work. Here “work” means that you should get output something like this:

Epoch 1/2
60000/60000 [==============================] - 9s 146us/step - loss: 6.7732 - acc: 0.5740
Epoch 2/2
60000/60000 [==============================] - 8s 140us/step - loss: 6.0214 - acc: 0.6238

Your numbers can certainly differ. They’ll differ when you rerun the script a couple of times. This can be a bit problematic.

Introduce Reproducibility

So you’ve run an algorithm on a data set and you’ve built a model. Can you produce the same model again given the same data? You should be able to. We achieve reproducibility in machine learning by using the exact same code, data and sequence of random numbers. Random numbers are created using a random number generator. This math functions are deterministic. If they use the same starting point, called a seed number, those functions will give the same sequence of random numbers. Add the following import to the top of your script:

Then, after your imports, add this line:

This will allow you to have some reproducibility in your executions.

Network Outputs

Let’s break down a little of what’s happening here in the output we received. The operation of fit is the training loop.

Deep learning models don’t process an entire dataset at once; rather, they break the data into small batches. Here that’s 10. The network will start to iterate on the training data in mini-batches of 10 samples, 2 times over (each iteration over all the training data is called an epoch). At each iteration, the network will compute the gradients of the weights with regard to the loss on the batch, and update the weights accordingly.

After these 2 epochs, the network will have performed 12,000 (6000 * 2) gradient updates (6000 per epoch; which you get from 60000 / 10). The output reports the loss, which you want to be low, and the accuracy, which you want to be high. Above my accuracy for the first epoch was about 57% while for the second it was 62%. That’s pretty bad.

The number of epochs and the batch size are, of course, configurable. Would it help if you added more of both? Change your fit statement as such:

Here’s what I got when I did that:

Epoch 1/5
60000/60000 [==============================] - 1s 20us/step - loss: 4.8206 - acc: 0.6875
Epoch 2/5
60000/60000 [==============================] - 1s 18us/step - loss: 3.0667 - acc: 0.8027
Epoch 3/5
60000/60000 [==============================] - 1s 18us/step - loss: 2.8681 - acc: 0.8169
Epoch 4/5
60000/60000 [==============================] - 1s 18us/step - loss: 2.7772 - acc: 0.8232
Epoch 5/5
60000/60000 [==============================] - 1s 18us/step - loss: 2.6606 - acc: 0.8309

Clearly a little better. Here the network iterated on the training data in batches of 128 samples, 5 times over. This time, after these 5 epochs, the network will have performed 2,345 gradient updates (469 per epoch), and the loss of the network is lower and thus the accuracy is higher.

Is it just a case of adding more epochs or a larger batch size? Not really, but you should feel free to experiment and see what happens. Try to get a few for how the results differ when you change just one, then change the other, and then change both together.

Instead, let’s do this: add the following line before the fit() call, after the lines you previously added:

You will see a demonstrable spike in accuracy with this addition, given those 5 epochs and batch size of 128. You should see your final epoch have an accuracy of 95%. Would that apply even with our original batch size of 10 and epochs of 2? Try it out!

If you do try it, you’ll probably see slightly less accuracy but still quite good. Given that we made a change to our data, let’s check our information about it again:

Hmm. Well, that’s exactly what we got before. And that’s because the statement we just added did not change the shape of the data at all. It simply changed the data type of the data and reduced the data from a range of 0 to 255 to a range of 0 to 1.

Again, don’t worry about why this is working. Just note what we’re doing here. We’ll revisit much of this in the next post. Incidentally, I’m going to keep my epochs at 5 and my batch size at 128 for the rest of this post.

With the logic in place, you should quickly reach an accuracy of somewhere in the 94 to 98 percent range on the training data.

Let’s step back and consider what’s going on here.

Calculations During Training

Throughout the creation of our neural network model, we have used simple vector data, stored in two-dimensional tensors of the shape (samples, features). For example, we ended up with a shape for our training data of (60000, 784), which means 60,000 individual samples of handwritten digits, each of which has 784 features. Those features are individual values that account for pixel intensity in the images.

This kind of vector data sequence is processed by what are called densely connected layers. The input to a layer will be a two-dimensional tensor and the layer will output a two-dimensional tensor. The calculation going on behind the scenes of the layer is this:

y = relu(dot(w, x) + b)

Here y is the output and x is the input. I will dig into this more thoroughly in the next post, but for now just understand that the breakdown of the calculation is as follows:

  1. A dot product between the input tensor (x) and a weight tensor named w.
  2. An addition between the resulting two-dimensional tensor and a vector named b.
  3. A specific operation called a relu applied to the result of the above calculations.

Again, I’ll dig into the calculation specifics in the next post so now let’s just consider what this training loop actually is.

Training Loop

Let’s consider a visual I showed you in the second post:

Keep in mind the goal for our model: we want a very low loss on the training data. We want a low mismatch between predictions — call them y’ — and expected targets y. Such a low mismatch will mean the network has figured out transformations that allow it to accurately map its inputs to correct outputs. So how does that take place in the training loop we are running?

  1. The network picks out a batch (128) of training samples x and corresponding targets y.
  2. The network fits itself to the data x to obtain predictions y’. This is called a forward pass.
  3. The network has to compute the loss of the network on the batch, which measures the distance between y’ and y.
  4. The network updates all the weights in the layers in a way that reduces the loss on the batch.

All of these activities are the neural network shaping the model, like I mentioned earlier. As you can probably guess, that last step is the hard part. How does the network decide which weights get updated and by how much?

I haven’t shown you much of the actual calculations going on — that’s for the next post — but know this: all of the mathematical operations that are being used in the model are differentiable functions. If you’re dealing with a differentiable function, you can compute the gradient of that function. A gradient here is the derivative of a function take accepts inputs that are multi-dimensional, i.e., tensors.

Without getting into all of the math, what this means is that our network can find a differentiable function’s minimum, which is the point where the derivative has a value of zero. There can be many such points in a function so the algorithm has to find all of the points where the derivative goes to 0 and then check for which of those points the function has the lowest value. This means we can update the above steps as follows:

  1. The network picks out a batch (128) of training samples x and corresponding targets y.
  2. The network fits itself to the data x to obtain predictions y’. This is called a forward pass.
  3. The network has to compute the loss of the network on the batch, which measures the distance between y’ and y.
  4. The network has to compute the gradient of the loss with regard to the network’s weights. This is called a backward pass.
  5. The network updates all the weights in the layers in the opposite direction from the gradient thus reducing the loss on the batch.

Here we added a new step 4 (the backward pass) and modified the last step (bolded).

Let’s get back to our network.

Evaluating the Model

Let’s evaluate our model against our test data. Put the following at the end of your script:

If you now run again, the network will still go through the training loop but that the end of that, you’ll get an error:

Error when checking input: expected dense_1_input to have 2 dimensions,
but got array with shape (10000, 28, 28)

This is very similar to the situation we saw with our training data. As before, we have to make sure we shape our test data accordingly. Add the following line before the fit() call:

After running the script again, you’ll get another error:

Error when checking target: expected dense_2 to have shape (10,)
but got array with shape (1,)

Again, somewhat similar to what we dealt with in terms of the training data. Add the following statement after the previous one:

Since we made some changes to our test data, you might want to do a bit of analysis on it. Below is an example of that with the comments showing how things have changed since you first checked out the data:

While I’ve sort of been breezing through these checks on the data, it wouldn’t hurt for you to go back over this post after reading it once and making sure you understand how the data is changing in terms of its shape.

Checking Accuracy

Keep in mind what our output is showing us. The output of the training loop is showing us the accuracy achieved for the training data. The final print statement after the evaluation is showing us the accuracy achieved for the test data.

You might notice a difference between the accuracy reported for the last epoch and that returned at the end of the evaluation. Why is that? Shouldn’t they be the same?

Any gap between training accuracy and test accuracy will usually be an example of what’s called overfitting. The idea here is that the model thinks its doing better than it actually is on its training data, which means it will tend to perform worse on new data that it hasn’t seen before. To handle this, you can provide a validation data set as part of the fitting process.

Change your fit statement to this:

If you now run your script, you should find the accuracy reported in the last epoch and that reported as part of the evaluation is identical. This is because the test data was used as the validation data.

In fact, in many cases you would not use your test data for validation. Instead, you would have a third data set that is distinct from your training data and your test data. That third set would be called your validation data. Here, however, I just wanted to show you how that basic process works.

The Full Neural Network

Just so you have it one place, here’s some example logic that you could end up with:

Understand Model Operation

At this point, make sure you have no trouble conceptualizing what is going on here.

We feed input (data) to the model. The model uses algorithmic computation to compare the accuracy of the model to the loss function. The model then propagates that loss information back up into its training process, varying the weight parameters via optimization. If the model reaches a point where it can no longer optimize, it considers that to be “good enough.”

Being reductionist, a “model” is some function defined by parameters called weights and biases. These are combined with the inputs via mathematical operations that are used to transform the inputs into representations that will allow the output to be predicted and thus classified.

So a large amount of importance has to be placed on finding the optimal loss function. The loss function is what ultimately measures the predictive or classification power of the model. As designers of neural network models, we want our algorithm to find the parameters for which the model has the highest predictive or classification power.

A large amount of importance also goes into finding the best optimizer. The optimizer is what’s going to improve the effectiveness of the loss function. An optimized model means a model with parameters (weights and biases) that lead to the highest predictive or classification power.

While the model will handle configuring its own internal weight and bias parameters, make sure you consider the configurable parameters in our script. We specified “32” in the hidden layer. But that could be other numbers. For example, try 784 — which matches the input shape. Or try 512. Or 64. Also consider the epochs and the batch size.

What We Accomplished

You took a relatively fast-paced tour here of neural network design and model creation. You ended with a script that does execute against the MNIST data set and provides a measure of accuracy regarding how well an algorithm was able to train on that data set. You also saw how a neural network can fail and throw errors when the data it is working with is not conforming to the shape that it expects.

I don’t want an important point to get lost here: you wrote a neural network here! That’s pretty cool.

Along the way, I introduced you to many of the concepts that go into a neural network, such as the various layers, and some of the core steps needed, such as compiling, fitting, and evaluating. What I didn’t do is provide a great deal of context around all that. So you were able to do it but probably not understand everything that you were doing. In the next post, we’ll recreate much of what we did here but with a bit more surrounding context.

Share

This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.