Demystifying Machine Learning, Part 6

In the last post, we defined our neural network by providing it some specific hidden layers that will provide the basis for how the neural network model actually works. We were also able to dig in a bit to what’s happening behind the scenes. In this post, we’ll actually execute the neural network by feeding it the data and evaluating what gets produced as output.

This will be the final post in the series and much of the points I make here have been covered in some way, shape or form in the previous posts in this series. My goal in this final post is to finish off the neural network we started in the previous post. This will actually be the second implementation of the same neural network you already developed.

I did a few steps at the end of the previous post to help you visualize the model that had been created. Those statements weren’t strictly necessary for the operation of the network. Let’s level set and consider what script you should have in place. In your, here is what you should have:

Revising our Life Cycle

Let’s do a quick check-in with our life cycle. Here’s the workflow:

  1. Gather and Prepare Data
  2. Define the Model
  3. Compile the Model
  4. Fit the Model
  5. Evaluate the Model
  6. Generate Predictions from the Model

The bolded parts indicate those that we have completed. In this post, we’re going to do the remainder of the list.

Compile the Model

After our model has been defined we have to compile it.

The first thing to understand is that compiling the model uses the numerical libraries that are operating behind the scenes. Think of Keras as our front-end. Now we start to delegate to the back-end. Here that back-end is TensorFlow, although another common library used by many is Theano.

A key point to understand here is that the back-end automatically chooses the best way to represent the network so that it can run on your hardware. This is where considerations of running on just CPUs or using GPUs comes into play. This is also where decisions might be made to run in a distributed fashion, such as on various cloud platforms. I won’t get into all this too much here but do note that the optimization regarding the execution of algorithms is of large concern, particularly as the problems you are tackling get more complex.

Keep in mind what’s going to happen here: we’re going to have our model consume the inputs and run those inputs through the layers. That process is going to involve doing this quite a bit, essentially iterating over a set of common steps. That process is referred to as training. We’ll train the model on our inputs.

To make the network capable of being trained, however, it has to be compiled. Compiling a neural network means you are configuring the learning process. When compiling, we have to specify some additional properties that are required when training the network.

Let’s consider a visual I showed you back in the second post:

Two of the things we need for our compilation step are indicated in that visual:

  • A loss function: how the network will be able to measure its performance on the training data, and thus how it will be able to steer itself in the right direction.
  • An optimizer: the mechanism through which the network will update itself based on the data it sees and its loss function.

To get something tangible in place, add this statement to your script:

You’ll see there that I’ve added a loss function and an optimizer. I’ll discuss those momentarily. I’ll note that I’m adding one extra parameter which are metrics. These are a list of aspects to monitor during training and testing. Here I’m indicating that I only care about accuracy, which would be the fraction of the images that were correctly classified by the model.

Let’s spend a little time going over those individual components that are passed in to the compiler.

Loss Functions

Regarding the loss function, the reason you want this is because of those parameters (weights) we talked about in the previous post. A loss function is what a neural network uses to optimize the parameter values. More specifically, this function helps to reduce the error between the actual output and the expected output. That’s why this function is sometimes also called an error function.

Early on in this series, I said that a neural network is ultimately learning a mapping function to make a prediction or perform a classification. As with anything that deals with prediction or classification, you will likely have errors. This is where the mapping function doesn’t quite work; where the output does not, in fact, converge appropriately with the input. Thus a movie review might be assigned a negative sentiment when, in fact, the review was positive. Or a dog is labeled as a cat. Or, in our case, perhaps the number 9 is labeled as an 8.

Here we’re using what’s called a categorical cross-entropy loss function and this is because we are dealing with a classification problem based on categories. If we had a binary classification problem we would use a loss function called binary cross-entropy.

But what does the “cross-entropy” part mean? Cross-entropy — whether binary or categorical — models a logarithmic loss. This kind of Loss quantifies the accuracy of a classifier by penalizing any false classifications. So a number 9 that was classified as a number 8 would be penalized. Minimizing the logarithmic loss means you are maximizing the accuracy of the classifier.

What “entropy” in this context means can be thought of this way: entropy is the average amount of surprise you get when obtaining samples from a data set. The “surprise” bit there probably requires some explanation.

Instead of our handwritten digits, consider instead that I’m doing something with sentiment analysis on movie reviews. Let’s say I have a very biased data set of movie reviews, in favor of positive reviews. Maybe the reviews of, say, Marvel movies were culled from a site that is dedicated to Marvel fans. This kind of data set has a low entropy because you expect a positive review on almost every sample, and so you are rarely surprised by the outcome.

An unbiased data set of movie reviews has a high entropy because there’s really nothing that would give you any sort of means to predict a series of outcomes. Meaning, you have a steady rate of “surprise” upon seeing each new outcome.

So cross-entropy can be thought of as the average amount of “surprise” you get when obtaining samples from the actual data set (D) after having assumed that data set to be something specific (E). With the biased review data set, if you assumed the bias to be toward positive outcomes but it ends up actually being toward negative outcomes, then both the expected (assumed) and actual data sets will be low but their difference will be high. You expect positive sentiments but you keep getting “surprised” by negative sentiments.

So what cross-entropy is doing is quantifying the difference between two probability distributions.

In the context of handwritten digits, the bias might come in if you were using images solely drawn by young children versus those drawn by adults although it gets more technical than that since it has to do with patterns of how people write.

In previous posts I had mentioned feedback signals. Here the categorical_crossentropy is the loss function that’s used as a feedback signal for learning the weight tensors, and which the training phase will attempt to minimize. This reduction of the loss happens via something called batch stochastic gradient descent. Don’t worry about that yet; just know that the exact rules governing a specific use of this “gradient descent” are defined by the rmsprop optimizer. So let’s talk about that next.


First, let’s remind ourselves of the reason we want an optimizer. The reason is basically exactly what it sounds like: it optimizes the learning process. For the optimizer, we’re using something called “rmsprop.” RMSProp stands for “Root Mean Square Propogation.” What this does is divide a hyperparameter called the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.

Huh? Yeah. Getting into gradients is a bit tricky so I’ll come back to that momentarily.

In a previous post, I talked about parameters and hyperparameters. The learning rate is a hyperparameter because it is outside the model. This is as opposed to the weights which are part of the model and, thus, are considered parameters.

I’ll note that another very common optimizer is called ADAM, which means “Adaptive Moment Optimization.” This is actually the optimizer that people most often use as the default choice and it includes the operations of RMSProp within it.

Both ADAM and RMSProp are efficient algorithms based on that technique called gradient descent that I just mentioned. Gradient descent is one of the most popular algorithms to perform optimization for neural networks. So let’s dive into that a bit.


Linear functions give straight lines when you plot output against input. Non-linear functions give curved lines when you plot output against input. And remember that when we introduced the linear activation function, we introduced a non-linearity. So we’re dealing with curves and that takes us a bit into calculus.

Calculus is nothing more than dealing with things changing. We use calculus to understand that change mathematically. If we understand how certain things are related to each other, we can figure out how changes in one will result in changes in the other. And that’s basically what calculus is all about: figuring out how things change as a result of other things changing. The “how things change” is represented by something called the gradient.

The rate of change of a curve at any point is the slope of the curve at that point. The slope of a curve means the slope of the tangent at some specific point on that curve. How do you measure the slope of a line that’s curved?

Basically, you estimate the slope by drawing a straight line, called the tangent, which touches the curved line in at an angle that tries to be at the same gradient as the curve at that exact point.

This is another area where geometry comes into play.

The mathematical version of this approach is called gradient descent. The idea is that after an algorithm has taken a step along the curve, it looks again at the surrounding area — the rest of the curve — to see which direction takes it closer to some objective. Then the algorithm steps again in that direction. The algorithm keeps doing this until it arrives at the “bottom,” which is often called the minimum. Think of the gradient as referring to the slope of the ground.

If we imagine the little guy in that picture as the algorithm, it steps in the direction where the slope is steepest downwards. Notice, however, that depending on where the goal is, an algorithm can get stuck in the wrong place. The person in position 1 above is likely going to get stuck in A. The person at position 2 looks to be getting stuck in B. Why getting stuck? Because they will calculate that they have reached the lowest point, even though there is another point (C) that is lower.

Keep in mind what gradient descent does: it descends. So getting stuck happens becuase the algorithm doesn’t try to “go up” if it’s already a low point. You will sometimes here this as “getting suck in a local minimum.” That basically just means that there are lower points — globally speaking — but the algorithm is stuck at a point that is locally the lowest.

Now imagine that the complex landscape our algorithm has to navigate is a mathematical function, which can make the shape of the space — and thus the gradient — much more complex. What a gradient descent method gives an algorithm is an ability to find the minimum without actually having to understand that complex function enough to work it out mathematically.

If we, as neural network designers, set things up such that the complex difficult function is the error of the neural network, then that would mean that going downhill to find the minimum is minimizing the error. In other words, gradient descent is helping to make the network’s output better by reducing its loss (or error).

There’s a lot more I could say about gradient descent but it does get a bit more involved than we have time for here. For now, just make sure you understand conceputally what’s happening. The hypothesis space that our algorithm is exploring has a certain shape depending on the dimensions of the problem being explored. That shape provides a gradients. Those gradients are what the algorithm will work the descend along, ideally finding the lowest point which would correspond to the lowest loss (and thus lowest error).


I mentioned earlier that you can use various metrics to judge how well your model is doing in the context of the problem it is working with. Because we are dealing with a classification problem, it’s a good idea to collect and report the classification accuracy as the metric. There’s not much more to be said about that right now. When we plot out our results, you’ll see the benefits of tracking this metric.

Fit the Model

Okay, so, we’ve defined our model and compiled it for efficient computation and configured its learning process. Now we have to execute the model on some data. In our case, we’ll fit the model on the loaded training data. But what does it mean to “fit” the model? Here’s something you don’t often get told right away but it’s such a simple point: fitting is synonymous to training. We literally fit the model to its training data.

The idea is that the training process will run for a fixed number of iterations through the data set. These iterations are called epochs. More specifically, the term epoch refers to the number of iterations involved during the training process of a neural network.

Two quantities will be displayed during training: the loss of the network over the training data and the accuracy of the network over the training data. Those are ultimately going to tell us how well our network is learning. Add the following statement to your script:

I talked about how the batches and epochs work in the fourth post in this series so I won’t repeat that here. I also covered the use of validation data in that post.

One thing to note is that the batch size and the number of training epochs, together with our model architecture, ultimately determines the total training time. Clearly for very complex problems with massive amounts of data, this training time because important to consider since you don’t want to be waiting around until the heat death of the universe for your results.

For this problem, I’m running for a small number of iterations (5) and using a relatively small batch size of 128. But how do you choose these values? Well, there’s no manual that explains this. These values can be and should be chosen via your own experimentation, essentially by trial and error. Regardless of your choices, this is where the work happens on your CPU or GPU or distributed platform.

Plotting Loss and Accuracy

Let’s add some logic to plot our loss:

You will get something like this:

And likewise to plot our accuracy:

You should get a graph like this:

These are pretty good! We see the loss is definitely going down. Notice, however, how much more so the loss went down on the training data as compared with the testing data. That’s an example of overfitting, which is why we also used validation data. Likewise, notice the accuracy was going up as the epochs went on and, again, the training data showed higher accuracy than the test data.

Life Cycle Check In

Let’s do a quick check-in with our life cycle. Here’s the workflow:

  1. Gather and Prepare Data
  2. Define the Model
  3. Compile the Model
  4. Fit the Model
  5. Evaluate the Model
  6. Generate Predictions from the Model

Not bad! We’re getting there. In fact, we’re at the point where we can evaluate how our network did from its training.

Evaluating the Model

To evaluate the network, we’ll evaluate against the test data. Add the following to your script:

Because we used validation data, you should see that your accuracy from the final epoch is very much in line with the accuracy reported with the above print statement. You can also get a feel for what your baseline error was like, which tells you the approximate percentage of classification errors.

As we had before, you now have a fully functioning neural network that provides a definition for its layers, is compiled in order to configure a learning process with a loss function and optimizer, is fit (or trained) on the training data input, and is evaluated on the test data input.

Generate Predictions

One of the major points of this whole supervised learning journey we’ve been on for these posts is to get a model to learn something so that, ideally, it can be used for predictions and classifications. We’ll work on that here. As we do this I’m going to show you a subtle bug that I introduced.

Saving the Model

To get started, let’s save the model. After the call to your fit() method, add the following statement:

This saves your model as an H5 file, which is an efficient storage mechanism for storing the details of a neural network. Specifically, this file will contain the architecture of the model, the weights of the model, the training configuration (i.e., the loss function and optimizer), and the state of the optimizer.

What this does is allow you to recreate the model and resume the training exactly from where your left off.

Now we can use that model. In fact, we’re going to have to refactor a portion of our script. Before your call to evaluate(), add this line:

This will require the following import:

Now change your evaluate statement to refer to this mnist_model variable as such:

Predicting Class Values

At the end of your current script, do this:

Here you are loading the model and then attempting to predict the class values for each instance in test_images.

Now add these next statements:

This will simply output how many were correctly classified and how many were not. If you run that, I bet you are going to be very surprised. I say that because you are probably going to get output like this:

0  were correctly classified
1  were incorrectly classified

This is due to the subtle bug I introduced. And the only reason I let it go this far is to show you how something like this can happen — and it can be very difficult to reason about what specifically caused the issue. Go back to this line in your script:

Here I categorically encoded test_labels and saved that back into test_labels. That’s actually not what we want to do here. We need for there to be two variables in the case of labels. So change the above line to this:

Now we have to change a few statements to use this variable. Find your total_classes line and change it as such:

Now, change your fit statement to use this variable:

Also, change your evaluate statement to use this variable:

Now run again. You’ll likely see something a bit more sensible, like this:

9793  were correctly classified
207  were incorrectly classified

Much better.

Visualizing the Predictions

Now let’s see if we can visualize a bit of what happened here. Let’s look at nine of the correct predictions. To do so, add the following statements:

You should get something like this:

Let’s also look at some of the incorrect predictions. To do that, add the following very similar statements:

You should get something like this:

And there you have it! You can now not only run your network and evalute how your network performs but you can check its ability to predict.

Of course let’s not forget that subtle bug that I was able to introduce and keep in there for quite some time. That is a very common example of what can happen when you are working with neural network models.

Modifying Our Network

One thing to keep in mind is that we have seen some data conditions that can be altered, such as the numer of epochs our model iterates over or the number of batches that it uses in its training sessions. We also saw that there are different loss functions and optimizers that could be configured for our model’s learning process. All of these can be considered variables, or data conditions, that can be modified to influence how the network actually works.

There’s one other I want to revisit and that’s the defintion of the network itself, when we set up the dimensional space. I’ll remind you of a visual I showed you in the previous post for our current definition:

Here our input layer consists of 784 neurons and that matches the number of inputs coming in for each image, where each image, if you remember, is a 28 × 28 matrix that was converted into 784-dimensional vector.

Let’s instead have the network look like this:

Change your model definition lines in your script to the following:

This will require you adding Dropout to one of your existing imports:

I’m not going to go into too much detail about this new architecture. Clearly you can see that we now have two hidden layers, each of which are made up of 512 neurons. I also added something called a dropout, but what is that?


Dropout is a particular type of technique known as regularization. The idea is that randomly selected neurons are ignored (“dropped out”) during training. But what effect does that have? It means that any such dropped out neurons will not have a contribution to the activation of neurons in the next layer down. It also means that any weight updates that are applied during optimization will not be applied to the dropped out neurons.

With the above logic, the dropout rate is set to 20%, which means that one in five inputs will be randomly excluded from each update cycle.

But why would you do this? The main reason is to avoid a model too is too specialized to the training data. The network will become less sensitive to the specific weights of neurons, given that there is a random drop out of neurons. The net effect is that the network should be better at generalizing and that should show up in the network being less likely to overfit the training data.

This modification of how the network is defined is yet another data condition that you can take into account as you try to determine how to make better neural networks that can handle the tasks you are applying them to.

And with that we have completed the workflow life cycle that we were following through many of these posts.

What We Accomplished

Wow! This was a long series of detailed posts, huh? But you made it! You’re at the end. Keep in mind that basic workflow we followed:

  1. Define your training data, which are input and output tensors.
  2. Define a network of layers, called a model, that maps your inputs to your outputs.
  3. Configure the learning process by choosing a loss function and an optimizer.
  4. Iterate on the training data by fitting the model to the data.
  5. Evalate the model on test data.
  6. Check if the model can accurately predict or classify data.

Through these posts, we took a journey of considering the concepts, applying those concepts, and learning the basis behind the concepts, including the mathematics that make it all work. I do realize that these posts will likely not have built up your intution for what’s actually going on at each point but I do hope that the amount we were able to dig into together provided enough demystification of the concepts around machine learning.

This is a challenging area of the future for all of us. There is a danger that the technocrats in our culture will have us all abdicating our responsibility to machines and algorithms. So while I probably didn’t build up your intuition for how everything works, that was also part of the point. There is a certain opaqueness to a lot of this stuff.

Further, I showed you how a very simple bug could be introduced and that simple bug could impact some aspects of how people are reasoning about the outputs of the machines and algorithms.

I certainly encourage any comments, suggestions and criticisms around the material or my approach to presenting it but, for now, I’ll close by thanking you for sticking with me on the journey.


About Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.
This entry was posted in Machine Learning, ML Series 3. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.