Demystifying Machine Learning, Part 3

In the previous post in this series we were able to dive in a bit and get coding. That was a nice balance with the first post which put more emphasis on theory. In this post we’ll deal with some of what’s going on between the coding and the theory. This is where some of the practice comes in.

In terms of the specific practice, we’re going to hop into some of the math underlying what I’ve been talking about. I don’t, however, want this to be the post where I lose people. So bear with me a bit and see if what I talk about here makes sense. If you find it doesn’t or you just aren’t liking it, jump to the next post in this series where we’ll get back into the coding.

In the last post, we ended with script that loaded up our MNIST data set from Keras. That data set comes stored in what are called multi-dimensional Numpy arrays. These are also called tensors. Tensors are the basic data structure of machine learning.

But let me back up a bit here.

Reducing to Operations

In the design of computer architecture, or rather in the computing logic upon which those architectures are built, there’s something called a NOR gate. This is basically a logic switch whose output is 1 only if its inputs are both 0. Another way to say it is that a NOR gate’s output is “true” if both inputs to it are “false.” Here’s what that looks like:

Input 1 Input 2 Output
false (0) false (0) true (1)
false (0) true (1) false (0)
true (1) false (0) false (0)
true (1) true (1) false (0)

Now, I bring this up because all computers are made of logic gates that are built out of transistors. And, because of this, all computations can be reduced to combinations of logic gates, like AND, OR, and NOT. But NOR gates are referred to as “universal gates” because they can be combined to form any other kind of logic gate. The point of this is that everything can be implemented using NOR gates.

So, just as all of computing, and thus all programming, can ultimately be reduced to a small set of binary operations on binary inputs (AND, OR, and so on), all transformations learned by deep neural networks can be reduced to a handful of tensor operations applied to tensor inputs. Just like how in computing everything reduces to NOR gates, in machine learning everything reduces to tensors.

Some of you will have read all that and nodded your head or calmly accepted it. Others might have been utterly baffled. Yet others might have agreed with the ideas but are concerned about how much all of this really matters.

The Need for Math Transparency

Okay, so let’s start in on a little about the math. This is a subject that some people shy away from because they either find it boring or, more often, they are simply concerned about what they don’t remember from early schooling years. But, honestly, some pretty basic math is about all you need. And I find a refresher of just some basic concepts is often more than enough.

But, wait, you might say, isn’t the point of many of the tools we’re going to use to abstract away a lot of that math?

That’s true, actually. For example, TensorFlow does a lot of tensor manipulation and differentiation. Much of that is abstracted away from you. And if you use Keras as a high-level modeling library on top of TensorFlow, you won’t see any math at all.

But therein lies a bit of the problem. These tools we’re working with, and these algorithms and models we’re building, become opaque. It becomes very hard to know what exactly is going on with what we’re constructing. How code and concept meet is critical for us being able to trust these algorithms as we come more and more to value what they are providing.

The Math of Collections

So let’s drag out Numpy and bring some concepts into the code or, if you prefer, fill out the concepts with some code. We’ll think in terms of everything being an array: a collection of something. We’ll consider this from the standpoint of linear algebra and also geometry. You’ll probably remember that I started down that path a little bit in the previous post.

To play along with me, create a script called and import numpy:

As we go on, feel free to type everything in and execute the script periodically when I show you print statements.

Dimensions: Start Nowhere with Nothing

First, let’s consider a single number:

In linear algebra, this is a scalar. Anything from algebra that you would call a “number” is called a scalar in linear algebra. A scalar has zero dimensions. Let’s confirm that:

Since a scalar has no dimensions it, by definition, can’t have a shape. But let’s confirm that as well:

In linear algebra, technically we could say that a scalar is a vector of length 1 and is thus a 1 × 1 array:

[ 42 ]

That said, please note that from a programmatic perspective, np.array(42) is different from np.array([42]).

Now ask yourself this: what would that scalar correspond to in geometry? Well, what’s an object in geometry that is said to have no dimensions and thus no shape? A point!

Individual numbers on their own, much like points, often aren’t terribly interesting. So how do we scale up from that?

Dimensions: Extend Yourself

We can consider a few numbers rather than just one. How many numbers doesn’t really matter; we just need at least more than one. How about this:

In linear algebra, this is a vector. A vector is said to have one dimension.

The single dimension of a vector is the number of rows it has and that is what gives it its shape:

What I’m showing you there is a row vector, which means a 1 × m array. You can also have a column vector, which means an m × 1 array.

To get a column vector programmatically, you have to do something called reshaping. Here’s how you could do that:

This new column vector would have the same shape as a row vector, which you can feel free to confirm.

What would this vector correspond to in geometry? I bet you can guess it. A line! Vectors can be thought of as representing movement from a point.

Dimensions: Extend Yourself Further

So what’s the next step up from that? We can imagine multiple lines relative to each other. If we have two lines, we have two vectors. So I could just have this:

But those are two different vectors, each with their own dimensions. To scale up, I have to make sure those lines are relative to each other and part of the same data structure:

In linear algebra, two vectors form a matrix. More generally, any collection of vectors is a matrix. A matrix is said to have two dimensions and it’s an m × n array.

What would this correspond to in geometry? That would be a plane. To get a two-dimensional plane, we need at least two lines to help us define the plane. Another way to think of it is that we represent planes by initial points and direction vectors.

What this does is form a coordinate system.

Realizing a matrix is basically a coordinate system is the way to understand the transition from linear algebra to geometry.

The Shape of Things

If any of you suspect I was crazy when I wrote about the dimensionality of testing, this post should start confirming that suspicion for you. In the interests of having you not think I’m crazy for this series of posts, let me state why I’m focusing on geometry along with the linear algebra. Consider this:

That’s showing the sum of two vectors in the plane. This is a geometric chaining together of the vector arrows. These are operations that are taking place in a dimensional space. What this means is that transformations of data are basically geometric in nature since all data sits in some sort of dimensional space.

And this matters when we consider how to scale past our two-dimensional matrix. This is where you get into tensors. In fact, if you’ve been following along, you’ve been staring at tensors for a lot of this post.

Consider that a scalar could be considered a vector of length 1 or a 1 × 1 matrix. But more appropriately each element of a vector is a scalar and the vector is an m × 1 or 1 × m matrix.

In turn, a matrix is a collection of vectors, the dimensions of the matrix being m × n. This means a matrix is really nothing more than a collection of n vectors of dimensions m × 1 or m vectors of dimensions n × 1.

The point here is that everything can be seen as a generalization of something else. So now consider:

  • A 1 × 1 tensor, called a tensor of rank 0, is a scalar.
  • A m × 1 tensor, called a tensor of rank 1, is a vector.
  • A m × n tensor, called a tensor of rank 2, is a matrix.

This is what I meant about you having been looking at tensors for much of this post.

Generalizing Beyond Two

But what’s past the matrix then? (I feel like I should be asking you that while you offering you a red pill or a blue pill.) If we want to go past this point, obviously we’re getting into three dimensions. Does that sound like a cube? Well, consider this visualization:

Yeah, it’s a cube. But what does that mean for a matrix? Consider this visualization:

What this means is that a k × m × n tensor, called a tensor of rank 3, is a collection of matrices. What you have to realize is that this same exact operation — embedding matrices within matrices — is what tensors are, up to any dimension you want to consider.


Let’s manually create a tensor:

Here we have a data structure of three dimensions and that is of the shape 2 × 3 × 3. This shape means that it contains two matrices, both of which are 2 × 3.

That’s a little hard to read though so often what we do is construct a tensor of dimensions higher than two by bringing together matrices:

Now, keep in mind, that what I just showed you are tensors. And earlier I said tensors are the key data structure for neural networks and deep learning. And now that you know that, when you boil it down even further, we’re really just talking about operations on matrices.

This is, essentially, what’s going on behind the scenes of your models in machine learning. When put in this context, the basis for machine learning is hopefully seeming a little less opaque.

Operations for Models

So now let’s consider some operations. And, again, the reason I’m doing this is because these operations are the basis for how deep learning models essentially work.

Vector Multiplication

One of these is vector multiplication. There are two types of product in this context: a dot (or inner) product and an outer (or tensor) product. Here the only one that really matters is the dot product.

Consider this schematic of two vectors being multiplied:

[ 6 ]   [ 8 ]
[ 5 ] * [ 2 ]
[ 4 ]   [ 3 ]

How this breaks down is really simple:

= (6 * 8) + (5 * 2) + (4 * 3)
= 70

The dot product is just the sum of the products of the corresponding elements. Here’s how to put the above into code:

Basically we end up with a scalar.

Matrix Multiplication

There’s also matrix multiplication. There’s a couple of interesting points to consider in this context. One is that in order for this operation to be carried out, there is a requirement of compatibility. That requirement is that you can only multiply an m × n matrix with an n × k matrix. The reason for this is that the second dimension of the first matrix has to match the first dimension of the second matrix.

Consider these matrices that are listed with their dimensions:

m1 (2 × 3) * m2 (3 × 6) = m3 (2 × 6)
m1 (3 × 4) * m2 (4 × 2) = m3 (3 × 2)

Notice the parts that match the constraint I just specified. But also notice something interesting about the resulting matrix: whatever dimension is repeating will disappear in the resulting matrix. Consider this:

[ 6 ]   [ 8 ]
[ 5 ] * [ 2 ]
[ 4 ]   [ 3 ]

Wait! Isn’t that what I just showed you earlier with multiplying two vectors? Yes. But consider them as matrices. Specifically, consider these as two vectors in the form of 3 × 1 matrices. In order to multiply them, they need to have the compatibility requirement satisfied. Sometimes you’ll hear this said as the matrices must have “matching forms.” So how do we do that? To find their dot product, we can transpose the first matrix to a 1 &times 3 form.

          [ 8 ]
[6 5 4] * [ 2 ]
          [ 3 ]

We could have done this with the second matrix instead. It doesn’t matter.

We have to do this because any time we have a dot product, we always multiply a row vector by a column vector. Then the procedure is the exact same we looked at above — just multiply the corresponding elements and sum everything up.

Make sure it’s clear to you that I presented the exact same example but in two different contexts. That’s why I kept visualizations to a minimum for this part.

Visualizing Dot Products

Now let’s do that procedure from above with matrices and let’s also do it with code:

We have two matrices here. To get their dot product, we have to transpose one of them. Keep in mind that m2’s shape is (2, 3) right now. Do this:

That applies a transpose operation. Now m2’s shape is (3, 2).

We thus have a 2 × 3 matrix and a 3 × 2 matrix, which means, as we saw earlier, that our resulting matrix will be 2 × 2.

Our first matrix has two row vectors and our second matrix has two column vectors. In order to find the dot product of the two matrices, we just have to find the dot product of the vectors they are made of. So let’s go back to visuals. We have a 2 × 3 matrix:

We also have a 3 × 2 matrix:

Let’s break down the operation:

dot(first row, first column)
(6 * 9) + (10 * 13) + (1 * 8)

dot(first row, second column)
(6 * 2) + (10 * 5) + (1 * 18)

dot(second row, first column)
(8 * 9) + (12 * 13) + (3 * 8)

dot(second row, second column)
(8 * 2) + (12 * 5) + (3 * 18)

The result is:

Let’s make sure we did that right by hopping back into code:

What you can see here is that row vectors determine the row in the output matrix while column vectors determine the column in the output matrix.

That’s (Basically) Deep Learning

What I just showed you above is the primary types of calculations that underpin all of what’s going on in deep learning. What we were doing above was tensor manipulation.

Now, yes, there are other bits that come into play, such as differentiation to apply techniques like gradient descent. But if you can get a handle on the above and feel comfortable with it, you are a very large step to reducing the opaqueness that exists behind much of deep learning.

How The Math Becomes The Model

In the second post, I talked about linear equations a bit that you might have recognized from your early school days with math. I want to show how some of the above material translates to specific linear algebra and the linear model. Keep in mind that the linear model is this:

f(x) = xw + b

Or, more commonly, this:

Reminding what we talked about, x is our input, w is our coefficient (called weights, in the context of machine learning) and b is our intercept (called bias, in the context of machine learning).

Here’s an important point: your inputs, weights, and outputs will all be matrices.

That equation is the model for our purpose right now. The calculation of that expression will give us our output, y. It’s what we want our model to classify or predict.

Models in Action

But what are we going to apply this model to? Let’s consider an idea. Let’s say our goal is to predict how bored some blogger will make you. We might do that based on, say, the size of the blog posts. (Hmm. I wonder where I could have gotten this idea?) So the input x will be the size of the blog post and the output, y, will be the level of boredom.

Let’s say the size of words is 3,000. The weight is, let’s say, 20 and the bias is 922. So we have this:

y = xw + b
y = 3000 * 20 + 922
y = 60,922

Let’s take a few more examples, using different inputs:

y = 2900 * 20 + 922
y = 58,922

y = 2500 * 20 + 922
y = 50,922

y = 2000 * 20 + 922
y = 40,922

Knowing the size of a given blog post, we are can get a prediction of how bored you will be based on the linear model.

Wait, we can?! From those numbers? Yeah, that’s the part that might seem strange. Let’s try to understand what we’re dealing with. What actually are those values of 20 and 922? Well, they’re just the values for the weights and bias. Yeah, great. But where did they come from?

The weights and the bias are initialized with values when the network starts operating. But initialized to what?

There are a couple of different initialization patterns for these values, but one simple and very common method is to pull random numbers from a normal distribution and then multiply them by a scaling factor to bring the majority of values between the interval of -1 and 1.

This is an important point to understand: these numbers are used as coefficients (weights) and thresholds (biases) but they are essentially “made up” in order to allow a linear equation to be parameterized and thus modulated.

But then what is that output? Well, that depends on what your model is measuring. Clearly the units that come out have to make sense in terms of what you are trying to classify or predict. Here “boredom” might be defined as 53,000 for example.

Simple Models Scaled Up

So what would the simplest linear model be? Consider this visual again:

Here n is the number of samples, m is the number of outputs, and k is the number of inputs.

The simplest model would be one where n, m and k are 1. This would mean x is a 1 × 1 matrix, w is a 1 × 1 matrix, and b is a 1 × 1 matrix. So let’s say we had this:

x = [2]
w = [4]
b = [6]

Solving the linear model for that is pretty simple, right?

y = [2] * [4] + [6]
y = 14

The next simplest linear model would be where m (outputs) and k (inputs) are still both 1, but n (samples) is greater than one. This would mean x is a 3 × 1 matrix, w is a 1 × 1 matrix, and b is a 1 × 1 matrix. This is basically just adding more samples to be used as part of our calculation:

y = [2] * [4] + [6] = 14
y = [4] * [4] + [6] = 22
y = [6] * [4] + [6] = 30

Finally, the next simplest linear model would be where m (outputs) is still 1 but k (inputs) and n (samples) are both greater than one. This would mean x is a 3 × 2 matrix, w is a 3 × 1 matrix, and b is a 1 × 1 matrix. I’ll trust that at this point you can see how matrices tie into the linear equation.

One More Example

I gave you a lot of numbers in that last section, so let’s just get a bit more visual with it here. Say you have two inputs. That means you will have two weights, one for each input. Consider:

x1 = size of post
x2 = word density

Those are the inputs to our model. Applying this to our linear model:

y (boredom) = (size of post * weight of size + word density * weight of word density) + b

Here x would be a 1 × 2 vector and w would be a 2 × 1 vector. Keep in mind he we treat vectors as matrices and thus we know the shape of our output:

So let’s put some numbers to that. We have our x:

And we have our w:

What do we end up with?

This breaks down as:

y = (3000 * 20) + (15 * 100) + 922
y = 61,500

What all of this has shown you is that a linear model can represent relationships that are multi-dimensional.

Pretty cool, huh? What I just showed you here is literally the math that is going to be going on behind the scenes with the neural network we’re going to build starting in the next post.

Choosing Weights and Biases

I know the whole notion of how weights and biases are chosen can still seem very murky. I mentioned that there are different ways to choose the weights and biases, the most common of which by far is simply to initialize them with random values.

Going back to something I started this post with, however, you might find this next bit kind of interesting. You can also set the weights manually to get some certain specific behavior out of your model. As an example of this, you can use the weights and bias to make the model behave like a logic gate.

Here the inputs x1 and x2 would be fed in as either 0 or 1 and weights would likewise be scaled to give an output of 0 or 1. To create an OR gate, you would have:

w1 = 1, w2 = 1, b = 0

To have an AND gate, you could do this:

w1 = 1, w2 = 2, b = -1

And if you go back to my table regarding the NOR gate earlier, I bet you can see how to construct that as well. If not, consider that a bit of an exercise.

What We Accomplished

This was a deep dive into some pretty shallow math. This math, however, was the basis for how many neural networks operate. My goal here was to make sure you had some transparency into what’s going on behind the scenes.

In the next post, we’ll go into the rest of the life cycle, getting our data in shape for feeding to the neural that we’re going to write.


About Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.
This entry was posted in Machine Learning, ML Series 3. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.