AI Testing – Measures and Scores, Part 2

In the first part of this post, I used a simple binary classification task to show some ideas around measures and scores and then provided some running commentary on how the tester mindset and skillset can situate in that context. That post was about depth; this post will be more about breadth.

One thing I want to focus on a bit here in this post is the data since that will align well with more focus on testing in particular. In the first post my running example of a binary classification was one that was entirely focused on the concept of data without worrying about what the data actually meant. Here my goal is still to keep the examples as simple as possible but put some context around the data.

As in the first post, I’ll be using Python since that’s often the simplest and most common ecosystem testers will be working in. However, also as in the first post, even if you don’t plan to try and run the examples yourself, I’ll make sure to explain what’s happening and why it matters.

To Spam Or Not To Spam?

To start off, I’ll use the traditional example: classifying whether an email is spam or not. To illustrate this, I’ll use a simple example of what’s called a “Naive Bayes classifier” to demonstrate evaluation measures and scores.

You’ll need to have the “numpy” and “scikit-learn” libraries available to run the example.

So here I have a list of email texts and their corresponding labels. The labels here are 0 (which means “not spam”) and 1 (which means “spam”). I’m using the CountVectorizer to convert the text data into a numerical feature representation using what’s called a bag-of-words approach.

A “bag-of-words” approach is a simple and pretty widely used technique in natural language processing. The idea is to represent text as collections of individual words, disregarding grammar and word order.

Next, I split the data into training and test sets as I showed with the example in the first post. In that previous example I used a Logistic Regression to train. For this example, I’m going to train a Naive Bayes classifier (MultinomialNB) on the training data and make predictions on the test data.

Wait. Let’s make sure we know what a “count vectorizer” is. The use of a count vectorizer is a technique used in natural language processing to convert text data into numerical representations that machine learning models can understand and process. In this example, the programmatic CountVectorizer is used to transform the email text data into a numerical format that can be fed into the Multinomial Naive Bayes classifier.

One thing I see a lot of testers do is simply accept terms that they have no idea what they mean. It’s really important to make sure that you know the terminology of the domain you are testing within. You may not have to know it as well as an expert — I rarely do! — but you do need to have some idea of what’s being talked about.

Finally, I calculate various evaluation measures using the predicted labels (y_pred) and the true labels (y_test). And thus what the example demonstrates is the calculation of measures — accuracy, precision, recall — as well as the F1 score. Hopefully you can see how this is a slight expansion of the concepts behind the example I used in the first post.

Just How Naive Is This Bayes Thing?

It might help to know that a Naive Bayes algorithm is based on a statistical method called Bayes’ theorem, which calculates the probability of an event happening given some evidence.

In the context of a Naive Bayes classifier like we’re looking at here, imagine you have a set of labeled examples — data points, essentially — with different features. The algorithm analyzes these examples to learn patterns and relationships between the features and their corresponding labels.

So what’s the “naïve” part? That qualifier comes from the assumption that the features are independent of each other.

This means that the presence or absence of one feature doesn’t affect the presence or absence of another feature. This assumption most certainly doesn’t always hold true in real-world scenarios but it equally certainly simplifies the calculations and makes the algorithm computationally efficient.

Execute the Model

If you run the above model, the output will be the following:

  Accuracy Measure: 0.5
  Precision Measure: 0.5
  Recall Measure: 1.0
  F1 Score: 0.6666666666666666

Okay, so you’re a tester and you’re being asked to reason about these results and provide assessments about the model.

What does the above tell you? Think about this before reading on.

Keep in mind that the output you obtained from running the code indicates the performance metrics of the Multinomial Naive Bayes model on the test set, which is made up of a sample set of email text.

Wait. We’re getting into some terminology soup again so let’s level-set on multinomial. A multinomial distribution is a generalization of the binomial distribution.

Binomial refers to a situation where there are two possible outcomes or categories. For example, flipping a coin can result in either heads or tails. So, in that context, a binomial distribution is used to describe the probability of getting a certain number of successes — let’s say “coin lands head side up” — in a fixed number of trials. Here the “trials” refer to coin flips.

Multinomial, on the other hand, refers to a situation where there are more than two possible outcomes or categories. A lot of people describe this example by rolling a dice, but that’s not entirely true. As an example, rolling a single polyhedral dice in a game like Dungeons & Dragons would not be an example of a multinomial context.

That would just provide what’s called a categorical distribution. Rolling multiple such dice, however, would constitute a multinomal distribution because you are in essence conducting multiple independent trials and each one of the die has multiple possible outcomes.

Okay, so back to you as the tester that has to tell someone what to make of those results. Have you thought about what you might say?

Well, let’s look at our output.

  • The accuracy measure is calculated by dividing the number of correctly predicted samples by the total number of samples. In this case, the accuracy is 0.5, which means that 50% of the test samples were classified correctly by the model.
  • The precision measure is the proportion of true positive predictions out of the total predicted positives. It measures the model’s ability to avoid false positives. The precision in this case is 0.5, indicating that 50% of the predicted spam emails were actually spam.
  • The recall measure, also known as sensitivity or true positive rate, is the proportion of true positive predictions out of the total actual positives. It measures the model’s ability to identify all positive samples. The recall is 1.0, meaning that the model correctly identified all the actual spam emails.
  • The F1 score is the harmonic mean of precision and recall. It provides a balanced measure of the model’s performance, taking into account both precision and recall. The F1 score in this case is 0.6667.

Given all of that, as a tester in this context, would you agree with the following high-level summary?

The results indicate that the model achieved moderate performance.

If your answer is yes, I would tend to agree.

The results show us that the model correctly identified all the actual spam emails (high recall). Yet the precision is relatively low, which indicates that some non-spam emails were incorrectly classified as spam. The accuracy score reflects the overall correct classification rate of the model, which is 50% in this case.

So here’s how you might frame your test results:

“We got an F1 score of 0.6667 and this suggests that the model is relatively good at correctly identifying the actual spam emails (recall of 1.0), but it also incorrectly identifies some non-spam emails as spam (precision of 0.5). So, while the model is effective at catching most spam emails, it still has room for improvement in reducing false positives (identifying non-spam emails as spam).”

Testing Provides Insight

By running this code and observing the evaluation metrics, as a tester, you can discuss with your delivery team the importance of each metric and how those metrics provide different perspectives on the model’s performance.

Key to this discussion is understanding that accuracy represents the overall correctness of the predictions, precision captures the proportion of correctly predicted positive instances, recall measures the proportion of actual positive instances correctly predicted.

The F1 score combines precision and recall into a single metric that balances both measures. How the F1 score works was talked about in the first post so I won’t belabor that again here.

I would ask you to keep in mind the controllability aspect with this example. And with that in mind, as a tester, let me ask you this: how could you subvert the model? How could you change the data so that the model started drifting in terms of accuracy?

If you can answer that question, you’re able to see how to lead the model to learn incorrect patterns and thus to make incorrect predictions.

But why would you want to do that? This is very much like the situation of “creating a failing test” in various other testing contexts. You want to be able to see that a test can be confirmed and falsified. So if I have a working test, I want to change conditions such that the test fails. If the test doesn’t fail, but I expect it to, that means I shouldn’t trust my tests!

Instrumenting the model to have it produce incorrect results would compromise the quality of predictability, of course. But it’s really important to have the observability that allows you to see this in the first place and the controllability to test out any modifications you, or your team, want to make.

That’s how we keep these models testable.

Test the Spam

Speaking of being testable, let’s jump into a little testing here. Let’s frame the email spam example around pytest to demonstrate how to test the model with different data.

If you want to follow along, note that you’ll need to get the “pytest” runner although you can use any test runner execution library you want.

In this example, I’m using the @pytest.mark.parametrize decorator. I’m essentially saying there’s a test condition (“email, expected label”) and a series of data conditions (various email texts) I want to apply to that test condition. Thus each test represents a different example that I want to classify as spam or not spam.

Within the test function, I’m transforming the email into a feature vector using the same CountVectorizer that was used during training. Then I’m using the trained model to predict the label for the email test condition.

Any test needs some observation to make it an actual test and so I’m asserting that the predicted label matches the expected label for each data condition when applied to the test condition.

By running the tests, you can verify whether the model correctly classifies different example emails as spam or not spam based on the expected labels. And what you should see is that all tests pass.

One thing I do want to call out here: the notion of “training” and “testing” got a little mixed together here, right? You could be forgiven for saying “Jeff, you essentially just shoved part of the model execution into a test.” Is that the case?

Are We Actually Testing?

Is there any concern with what you see above? Consider a diff of the two examples:

You might want to open the image in a new tab to get its full size. Think about what you’re seeing there. Is there a problem? Or does it, in fact, make complete sense? Can you frame a narrative around what someone might see the problem to be? What’s one thing that stands out as differing between the two code examples?

Clearly the data in the test is different than the data in the model. The email text examples are different. Is that a bad or a good thing?

Here’s how I would frame this for someone who was skeptical.

The original code is focused on training a Naive Bayes classifier on a dataset and evaluating its performance using metrics such as accuracy, precision, recall, and F1 score. It demonstrates how to train a model, make predictions, and evaluate its quality using those metrics.

The modified code is focused on testing by using the trained model to predict the label for each test case and compare it with the expected label using assertions. A key purpose of the test cases is to verify that the model correctly predicts the labels for different types of emails, including those it has not seen during training.

Ah, But Can It Fail?

Is there any way to make that test fail, just so we can prove that the test is doing what we think it is?

Well, take a look at it. How would you do that?

You can intentionally modify one of the test cases to make it fail. For example, you can change the expected label to a value that’s different from the predicted label. Here’s an updated version of the test condition part that will intentionally fail:

Do you see the change? It’s in the last email data condition. The “1” has been changed to a “0”. If you run this, you should see that the first three tests pass but the fourth one fails. Take some time to make sure you understand why it fails.

Email: Explain / Interpret

In the context of the email spam example, explainability and interpretability refer to the ability to understand and interpret the decisions made by the model regarding the classification of emails as spam or not spam. Let’s dig in a little bit into how these concepts relate to the provided code and example.

Explaining Our Model

Let’s start with explainability. Explainability focuses on providing insights into the decision-making process of the model. In the context of email spam classification, explainability tries to help us answer questions such as “Why did the model classify this email as spam?” or “What were the important features that led to the model’s decision?”

In the code example, explainability can be enhanced by analyzing the features used by the model to make predictions. One common approach is to examine the coefficients or weights associated with each feature in the trained model.

Whoa! What does that last part actually mean? The coefficients or weights refer to the learned probabilities of different words indicating, in our context, spam or non-spam. Naive Bayes assigns a probability to each word based on its occurrence in spam and non-spam emails. These probabilities are then used to calculate the likelihood of an email being spam or non-spam.

Here’s a modified example that incorporates some of this:

With that in place, you’ll get output like this:

Accuracy Measure: 0.5
Precision Measure: 0.5
Recall Measure: 1.0
F1 Score: 0.6666666666666666

Top 5 words indicating spam:
guaranteed (Probability: 0.06896551724137931)
rich (Probability: 0.06896551724137931)
quick (Probability: 0.06896551724137931)
money (Probability: 0.06896551724137931)
free (Probability: 0.06896551724137931)

Top 5 words indicating non-spam:
your (Probability: 0.06666666666666667)
we (Probability: 0.06666666666666667)
ask (Probability: 0.06666666666666667)
password (Probability: 0.06666666666666667)
never (Probability: 0.06666666666666667)

Key to this is that I’m using the feature_log_prob_ attribute of the trained MultinomialNB model. What this does is access the logarithm of the probability estimates. Doing that, I’m able to retrieve the probabilities of the top five words indicating spam and non-spam, along with their corresponding feature names.

Now that, my friends, is an example of controllability and observability in action!

In addition to the above, techniques like LIME or SHAP — which I talked about in the first post — can be used to provide local explanations for individual predictions. These methods would highlight the important features that influenced the model’s decision for a specific email, helping to explain why that particular email was classified as spam or not.

Interpreting Our Model

What about interpretability? Interpretability refers to the ability to understand and make sense of how the model works and behaves. It involves building a mental model or intuition about how the model operates and what factors contribute to its predictions.

In the email spam example, interpretability can be achieved by examining the underlying algorithm and the feature representation used. For instance, understanding the workings of a Naive Bayes classifier and its reliance on the bag-of-words representation can provide insights into how the model considers the presence or absence of specific words to classify emails.

If your delivery team is being asked to interpret the model and why your customers might feel comfortable using it, what might you say?

As a tester, you may argue that this interpretation is not your job. I would say that it’s not your job alone. But, as I hope these posts are showing, testing is a key part of how we trust our ability to explain and interpret.

One thing you could focus on is the feature representation. Specifically, you could explain how the textual data is transformed into a numerical representation that can be used by the Naive Bayes classifier. In our example, the count vectorizer is used to convert the text emails into a matrix of token counts. Each email is represented as a vector of word frequencies.

You could also talk a little about the model training. You could specifically mention that during training, the classifier learns the probability distribution of each word for each class based on the observed frequencies.

You should talk about the Naive Bayes algorithm itself. You could explain the underlying algorithm of the Naive Bayes classifier and mention that the classifier applies Bayes’ theorem to calculate the posterior probabilities of each class given the observed word frequencies.

It would be worth calling out that the occurrence of each word is independent of the occurrences of other words, which is a simplifying assumption being made.

You can talk about interpreting specific model parameters. Here you could discuss the interpretable aspects of the trained model. In the case of Naive Bayes, you could mention that the learned probabilities of different words indicate their importance or discriminative power for distinguishing between spam and non-spam emails. Higher probabilities for a particular word in the spam class suggest that the word is more indicative of spam, and vice versa.

You can talk about the model evaluation. You could, for example, highlight the performance metrics used to evaluate the model, such as accuracy, precision, recall, and the F1 score. You could talk about what each metric measures and how those measures provide insights into the model’s effectiveness in classifying emails.

Explain and Interpret Go Together

All of that interpretability is focusing on the the feature representation, the training process, and the interpretability of the Naive Bayes classifier. That’s pretty technical, though, right? Not all audiences will need or want that level of detail. Your customers, for example, probably would not.

Well … hold on. Is that true? What if your customer was a solution provider that wanted to incorporate your model? But your model may be just one of many that they’re considering. In this case, that kind of customer may appreciate some of the details around the interpretability.

Yet, again, many other customers may simply want to know if they can trust your model. And that’s where the explainability comes in.

So you have to interleave the two aspects and then decide which to put emphasis on given the particular situation you are dealing with.

Beyond even those examples, and going back to something I showed you in the first post, visualizations or summaries of the learned model parameters, such as word probabilities, can be really helpful in interpreting the model’s behavior.

Note that “ham” is often used to refer to non-spam emails.

By analyzing these parameters, testers — working with the delivery team — can use those visuals to gain a better understanding of the features and characteristics that the model leverages to make spam classification decisions.

Sometimes visualizations serve as a good bridge between the interpretability and explainability.

These kinds of visuals allow for a way to focus on qualitative aspects while still being able to dig into quantitative aspects. However, as warned about in the first post, it’s crucial to avoid letting these visuals take on a life of their own where they become more determinative of decision-making than the actual evaluation measures and scores that are providing the quantitative understanding of what is actually happening.

Visual Classification

Let’s consider a different example, this time in the context of image classification. I’ll use the popular MNIST dataset, which consists of grayscale images of hand-drawn digits from 0 to 9.

I’ll train a simple convolutional neural network model and evaluate its performance using evaluation measures and scores. This kind of this model was developed based on how our brains seem to process visual information, essentially by recognizing patterns and shapes in images.

These models break a given image down into smaller parts that are called “convolutional layers.” These layers apply certain filters to identify features of the image. Features here might be edges, corners, or textures. As each layer is built up, the model combines these features to recognize more complex shapes.

The term “convolution” refers to a mathematical operation that combines two functions to produce a third function that represents how the shape of one is modified by the other. The term “convolution” was inspired by the concept of convolutions in the structure and functioning of the human brain, particularly around what occurs in the visual cortex during visual processing which allows the human brain to make sense of spatial relationships.

I’m going to show this example in two ways. One using Tensorflow and another using PyTorch. I’m doing this because you will often come across different approaches depending on the development team or data scientist team that you’re working with and the tooling that they are most familiar with.

If you’re on Mac or Linux, you should have no problem installing the tensorflow package. Windows users, on the other hand, can be in for a rough ride. If you’re using the Anaconda distribution for Python that can ease things considerably. If tensorflow doesn’t work for you, just use the PyTorch version.


Here’s the Tensorflow version and it’s worth noting that this code will have to download information, specifically the MNIST data set.

In this example, I’m loading the MNIST dataset and preprocessing the data by reshaping the images and normalizing the pixel values. I then split the data into training, validation, and test sets. Then I define a simple convolutional neural network model using Keras, compile it with appropriate settings, and train the model on the training set while monitoring the validation performance.

Keras here refers to a high-level neural network library that has a specific focus on deep learning models. Keras’ claim to fame is its providing a set of modular APIs that simplify the process of developing neural networks.

What I just described gets represented in output like this:

  Epoch 1/5
375/375 [==============================] - 7s 19ms/step - loss: 0.3948 - accuracy: 0.8933 - val_loss: 0.1898 - val_accuracy: 0.9464
Epoch 2/5
375/375 [==============================] - 6s 15ms/step - loss: 0.1466 - accuracy: 0.9577 - val_loss: 0.1173 - val_accuracy: 0.9688
Epoch 3/5
375/375 [==============================] - 6s 15ms/step - loss: 0.0976 - accuracy: 0.9729 - val_loss: 0.0933 - val_accuracy: 0.9746
Epoch 4/5
375/375 [==============================] - 6s 15ms/step - loss: 0.0757 - accuracy: 0.9787 - val_loss: 0.0808 - val_accuracy: 0.9769
Epoch 5/5
375/375 [==============================] - 6s 16ms/step - loss: 0.0633 - accuracy: 0.9827 - val_loss: 0.0711 - val_accuracy: 0.9797
313/313 [==============================] - 0s 1ms/step

Just to put some context around that output, when you see something like this “375/375 [==============================] – 7s 19ms/step”, that represents the progress of the training process. It indicates that the current batch is the 375th batch out of a total of 375 batches. The numbers in square brackets shows the progress as a percentage. The subsequent numbers “7s” and “19ms/step” represent the time taken for each epoch and the average time per step — or batch — respectively.

An epoch refers to one complete pass through the entire training dataset during the training phase.

A line like “loss: 0.3948 – accuracy: 0.8933 – val_loss: 0.1898 – val_accuracy: 0.9464” shows the training metrics for the current epoch.

Here “loss” represents the training loss value, which measures the model’s error during training. The “accuracy” indicates the training accuracy, which represents the proportion of correctly classified samples during training. The “val_loss” represents the validation loss, which measures the model’s error on a separate validation set. And, finally, the “val_accuracy” indicates the validation accuracy, which represents the proportion of correctly classified samples on the validation set.

What you’re getting there is information about the training progress, including the loss and accuracy values for each epoch, as well as the evaluation of the model on the test set. This is what allows you to monitor the performance of the model during training.

After training, I then evaluate the model on the test set by making predictions (y_pred) and converting the predictions and true labels to class indices (y_pred_classes and y_true_classes). Finally, I calculate our by-now-familiar evaluation metrics such as accuracy, precision, recall, and F1 score. You should see output like this:

Accuracy Measure: 0.9805
Precision Measure: 0.9805278655924867
Recall Measure: 0.9805
F1 Score: 0.9804860876778115

As we did with the email spam example, consider how you would explain the above to someone who wanted to know the results.

Let’s break it down a little.

  • The accuracy measure represents the proportion of correctly classified samples in the test dataset. In this case, the accuracy measure is 0.9805, which means that the model achieved an accuracy of 98.05%. It correctly classified approximately 98.05% of the test samples.
  • The precision is a measure of how well the model predicts the positive class (in this case, the digits) among all the samples it classified as positive. The precision measure is 0.9805278655924867, which indicates that the model’s predictions of the positive class are precise or accurate.
  • The recall — as stated earlier, also known as sensitivity or true positive rate — measures how well the model captures all the positive class samples. It indicates the proportion of positive samples correctly identified by the model. The recall measure is 0.9805, which means that the model identified approximately 98.05% of the positive samples.
  • The F1 score is the harmonic mean of precision and recall. It provides a balanced measure of the model’s performance by considering both precision and recall. The F1 score in this case is 0.9804860876778115, indicating a good balance between precision and recall.

So what is that telling us? How do we summarize all that?

Overall, the results demonstrate that the trained model performs well on the test dataset, achieving high accuracy, precision, recall, and F1 score. These metrics suggest that the model is capable of accurately classifying the handwritten digits in the MNIST dataset with a very high level of performance.


Let’s try the PyTorch version.

You’ll need the “torch” and “torchvision” dependencies.

This one takes more effort to set up but, to be honest, I much prefer using PyTorch than Tensorflow. From a testing standpoint, I find it gives much more controllability and observability.

Lots of stuff, huH?!? This modified code uses the PyTorch library to load the MNIST dataset. So this script will also download files although, in this case, it will store them in a data folder that it creates wherever you are running the script. This logic also builds a really simple convolutional neural network model, represented by the SimpleCNN class. The training loop, loss function, and optimizer are also adjusted accordingly.

A key thing I’m doing there is saving a file called “model.pth” in the directory where you’re running this script. I’ll come back to why that is.

If you run this, you should see output like the following:

Accuracy: 0.9803
Precision: 0.9803903443873958
Recall: 0.9803
F1 Score: 0.9802689719341039

Let’s say that your delivery team was considering between Tensorflow and PyTorch. What would the above tell them?

The differences in accuracy, precision, recall, and F1 score between this and the Tensorflow version are minimal. The values are very close, indicating that both model evaluations have similar performance. Therefore, you and your team can conclude that the second output represents a comparable level of performance to the first output, and the model is consistently performing well on the test dataset.

So the point there is that by utilizing PyTorch, you can achieve similar functionality without relying on Tensorflow. Or vice versa. If the outputs were very different, that would be telling and you, along with your team, would have to look into that. It could be something as simple as us using one of the tools incorrectly or it could be something a bit deeper.

Let’s Do More Actual Testing

We can apply the same approach of using pytest to evaluate the MNIST example. Here, in the interests of not making a long post even longer, I’ll just show this with PyTorch although you could certainly do something similar with the Tensorflow version.

Before running pytest against this example, you need to make sure you have saved the trained model weights in that “model.pth” file I mentioned earlier. This is pretty crucial and shows the use of an oracle in action although it is, of course, a provisional oracle. Part of what’s being tested here is the oracle.

In this test, I’m use the @pytest.mark.parametrize decorator as I did in the email spam example to specify the indices of the test examples. You can add more indices or modify them as needed. Inside the test function, I then load the pre-trained model, pass the input data through the model, and assert that the predicted label matches the target label.

If you run this with pytest, you’ll see that all tests pass. As we tried with the previous example, let’s see if the tests can fail. How do we do that?

Well, one thing that should certainly make the test fail is to modify the target labels (y_test) to create a mismatch with the model’s predictions. Here’s an example modification that would cause the test to fail for the first index:

Here I’m placing the modification after loading the model’s state dictionary. By doing that, I’m making sure that the model is loaded correctly before modifying the target label and running the test. And if you make this change, you should see that the first test fails.

Visualizing our Testing

Since this example relies on specific visual data, it’s probably worth looking at a visualization to augment the test reports. Consider this modification:

If you run this, you’ll get, for each test, a plot that show what was expected from the test and what actual image data was processed.

Here’s a modified version that can show you the entire test set as one image:

This would generate output like this:

Obviously there is a whole lot about this code that I’m not explaining. This post would be even longer if I did that. But all of this, I believe, showcases an important point: while the test thinking and mindset remains the same in an AI context, you will have programmatic aspects that you reason about just as you do in a non-AI context.

This is where you, as a tester, either need to know how to do this yourself or pair-test/program with your developers, asking them for what you need and working with them to make sure it’s provided. There’s a vast spectrum of how “technical” — in a code-based sense — a tester needs to be. I personally err on the side of having as much programmatic knowledge as I can. But, again, there’s a vast spectrum.

Earlier I asked whether testers should become developers. I also suggested testers should act like developers. I also talked about testers and the technical abstraction stack.

The fact that I wrote all of that without an AI context just reinforces to me how much relevance it has — at least to me — in an AI context, where the abstractions are many and varied.

MNIST: Explain / Interpret

In the context of the MNIST example, explainability and interpretability revolve around understanding and interpreting the decisions made by the model in classifying handwritten digits. Let’s briefly discuss how these concepts relate to the code we just looked at.


As with the prior exmaple, explainability refers to the ability to provide insights into why the model made a particular prediction or classification. It aims to answer questions like “Why did the model classify this image as a specific digit?” or “What were the important features or patterns that influenced the model’s decision?”

In the code, explainability can be enhanced by examining the learned weights and biases of the model. Analyzing the weights assigned to different pixels or features can help understand which parts of the image contribute most to the classification decision.

Additionally, techniques like saliency maps or gradient-based methods can be used to visualize the regions of the input image that have the most impact on the model’s prediction.

That’s an example of a saliency map. The saliency map was generated in this case by analyzing the gradients of the model’s output with respect to the input image pixels, which are the number images from the MNIST data set. By computing these gradients, you can determine how changes in the input pixels affect the model’s prediction. The magnitude of the gradients indicates the importance of each pixel in influencing the model’s decision.

Another technique you might use is an activation map.

The above is such an activation map generated on a representation of the MNIST data set. The idea behind this technique is that each layer in a convolutional neural network consists of multiple filters that learn to detect specific features or patterns in the input image. The activation map shows the response or activation of these filters for each such image, highlighting the regions of the image that activated the filters the most.

These visual explanations, as well as others, can provide insights into which pixels or patterns in the image influenced the model’s decision.


Interpretability focuses on understanding how the model works and how it arrives at its predictions in a more general sense. Given the nature of the model we’re looking at here — a layered neural network — this involves building an understanding of the model’s architecture, internal representations, and reasoning processes.

In the MNIST example in particular, interpretability can be achieved by analyzing the structure and parameters of the trained model. For instance, examining the layers and activations in the convolutional neural network can provide insights into how the model captures and processes visual features. Understanding the architecture and operations of the model can aid in interpreting how it learns to recognize handwritten digits.

Additionally, visualizing intermediate representations or feature maps in the model can help gain insights into how different layers of the network extract and transform information. This can provide a deeper understanding of the model’s hierarchical feature learning process.

The Ethics of It All

So this was a broad look at a couple of systems that got more realistic than the example in the first post. But notice that the thinking behind providing interpretations and explanations was the same in all cases.

The ultimate focus was on trustability and that trusability was enabled by and facilitated with testability.

That focus is part of the ethic. I talked about how creating explanations was part of the ethos of testing. These recent posts have provided a timely exmaple of that, given the rapid focus on AI and machine learning.

In the context of an AI system, evaluation measures and scores can directly relate to ethical concerns. While traditional evaluation measures primarily focus on performance and accuracy, there is an increasing recognition of the importance of ethical considerations in evaluating AI systems.

These ethical concerns can arise from potential biases, fairness tradeoffs, privacy implications, or even just plain old unintended consequences of the AI system’s decisions or actions. Let’s break some of this down a little.

In terms of bias and fairness, the ealuation measures and scores can be used to assess the presence of bias or the lack of fairness in AI systems. As just one example, fairness metrics, such as disparate impact or equalized odds, can be used to evaluate if the system’s predictions or recommendations seem to exhibit what we would call discriminatory behavior. That could be as simple and innocuous as “seems to always favor the digit zero” or as insidious as “seems to unproperly skew around protected attributes like race or gender.”

Regarding privacy and, more broadly, data protection, evaluation measures and scores can focus on looking at the system’s compliance with privacy regulations and data protection standards. It’s very important for humans to appropriately handle and safeguard sensitive data. We consider that an important ethical consideration. It’s no less so — and arguably, even more so — the case for AI-enabled systems.

Some of the above issues can be unintentional. They are no less harmful for that but there’s also the case of purposful harm. We’ve been living with the notion of hacking and cybercrime for quite some time now. In the context of AI-enabled products or systems, we have to consider adversarial robustness. Here evaluation measures and scores can focus on assessing the system’s vulnerability to adversarial attacks or manipulation. This can involve evaluating the AI system’s resilience against intentional or malicious attempts to deceive or manipulate its behavior.

Ethics Comes Down to Qualities

I mentioned in these posts that the primary enemy of quality is opacity. What I’ve tried to show is that evaluation measures and scores provide a way to counter that by focusing on transparency. This is done by using techniques focused on explainability and interpretability, which help us provide understandable justifications for an AI system’s outputs. This enables us, and our users or customers, to comprehend and trust the decision-making process.

There are certain qualities that are important in the arena of software development but we don’t often consider them qualities: trustability, understandability, explainability, interpretability, and so on. The recognition of these qualities has been a battle I’ve long fought in the industry and I believe that the current resurgence and democratization of AI makes all of this more important than it has ever been.

There are amazing possibilities here and there are staggering dangers. With no hyberbole or melodrama intended, there is no doubt in my mind that testing — as a specialist craft and discipline — is what will allow us to realize the most beneficial of the possibilities and avoid the most serious of the dangers.


This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.