To Spam Or Not To Spam?
To start off, I’ll use the traditional example: classifying whether an email is spam or not. To illustrate this, I’ll use a simple example of what’s called a “Naive Bayes classifier” to demonstrate evaluation measures and scores.You’ll need to have the “numpy” and “scikit-learn” libraries available to run the example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score from sklearn.model_selection import train_test_split emails = [ "Greetings! This is a legitimate email.", "Get rich quick! Guaranteed FREE money!", "We will never ask for your password.", "Average cashout time is 15 min!!!", ] labels = np.array([0, 1, 0, 1]) vectorizer = CountVectorizer() X = vectorizer.fit_transform(emails) X_train, X_test, y_train, y_test = train_test_split( X, labels, test_size=0.5, stratify=labels, random_state=42 ) model = MultinomialNB() model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred, zero_division=0) recall = recall_score(y_test, y_pred, zero_division=0) f1 = f1_score(y_test, y_pred, zero_division=0) print("Accuracy Measure:", accuracy) print("Precision Measure:", precision) print("Recall Measure:", recall) print("F1 Score:", f1) |
CountVectorizer
to convert the text data into a numerical feature representation using what’s called a bag-of-words approach.
A “bag-of-words” approach is a simple and pretty widely used technique in natural language processing. The idea is to represent text as collections of individual words, disregarding grammar and word order.
Next, I split the data into training and test sets as I showed with the example in the first post. In that previous example I used a Logistic Regression to train. For this example, I’m going to train a Naive Bayes classifier (MultinomialNB
) on the training data and make predictions on the test data.
Wait. Let’s make sure we know what a “count vectorizer” is. The use of a count vectorizer is a technique used in natural language processing to convert text data into numerical representations that machine learning models can understand and process. In this example, the programmatic CountVectorizer
is used to transform the email text data into a numerical format that can be fed into the Multinomial Naive Bayes classifier.
One thing I see a lot of testers do is simply accept terms that they have no idea what they mean. It’s really important to make sure that you know the terminology of the domain you are testing within. You may not have to know it as well as an expert — I rarely do! — but you do need to have some idea of what’s being talked about.
Finally, I calculate various evaluation measures using the predicted labels (y_pred
) and the true labels (y_test
). And thus what the example demonstrates is the calculation of measures — accuracy, precision, recall — as well as the F1 score. Hopefully you can see how this is a slight expansion of the concepts behind the example I used in the first post.
Just How Naive Is This Bayes Thing?
It might help to know that a Naive Bayes algorithm is based on a statistical method called Bayes’ theorem, which calculates the probability of an event happening given some evidence. In the context of a Naive Bayes classifier like we’re looking at here, imagine you have a set of labeled examples — data points, essentially — with different features. The algorithm analyzes these examples to learn patterns and relationships between the features and their corresponding labels. So what’s the “naïve” part? That qualifier comes from the assumption that the features are independent of each other. This means that the presence or absence of one feature doesn’t affect the presence or absence of another feature. This assumption most certainly doesn’t always hold true in real-world scenarios but it equally certainly simplifies the calculations and makes the algorithm computationally efficient.Execute the Model
If you run the above model, the output will be the following:Accuracy Measure: 0.5 Precision Measure: 0.5 Recall Measure: 1.0 F1 Score: 0.6666666666666666Okay, so you’re a tester and you’re being asked to reason about these results and provide assessments about the model. What does the above tell you? Think about this before reading on. Keep in mind that the output you obtained from running the code indicates the performance metrics of the Multinomial Naive Bayes model on the test set, which is made up of a sample set of email text. Wait. We’re getting into some terminology soup again so let’s level-set on multinomial. A multinomial distribution is a generalization of the binomial distribution. Binomial refers to a situation where there are two possible outcomes or categories. For example, flipping a coin can result in either heads or tails. So, in that context, a binomial distribution is used to describe the probability of getting a certain number of successes — let’s say “coin lands head side up” — in a fixed number of trials. Here the “trials” refer to coin flips. Multinomial, on the other hand, refers to a situation where there are more than two possible outcomes or categories. A lot of people describe this example by rolling a dice, but that’s not entirely true. As an example, rolling a single polyhedral dice in a game like Dungeons & Dragons would not be an example of a multinomial context. That would just provide what’s called a categorical distribution. Rolling multiple such dice, however, would constitute a multinomal distribution because you are in essence conducting multiple independent trials and each one of the die has multiple possible outcomes. Okay, so back to you as the tester that has to tell someone what to make of those results. Have you thought about what you might say? Well, let’s look at our output.
- The accuracy measure is calculated by dividing the number of correctly predicted samples by the total number of samples. In this case, the accuracy is 0.5, which means that 50% of the test samples were classified correctly by the model.
- The precision measure is the proportion of true positive predictions out of the total predicted positives. It measures the model’s ability to avoid false positives. The precision in this case is 0.5, indicating that 50% of the predicted spam emails were actually spam.
- The recall measure, also known as sensitivity or true positive rate, is the proportion of true positive predictions out of the total actual positives. It measures the model’s ability to identify all positive samples. The recall is 1.0, meaning that the model correctly identified all the actual spam emails.
- The F1 score is the harmonic mean of precision and recall. It provides a balanced measure of the model’s performance, taking into account both precision and recall. The F1 score in this case is 0.6667.
The results indicate that the model achieved moderate performance.If your answer is yes, I would tend to agree. The results show us that the model correctly identified all the actual spam emails (high recall). Yet the precision is relatively low, which indicates that some non-spam emails were incorrectly classified as spam. The accuracy score reflects the overall correct classification rate of the model, which is 50% in this case. So here’s how you might frame your test results: “We got an F1 score of 0.6667 and this suggests that the model is relatively good at correctly identifying the actual spam emails (recall of 1.0), but it also incorrectly identifies some non-spam emails as spam (precision of 0.5). So, while the model is effective at catching most spam emails, it still has room for improvement in reducing false positives (identifying non-spam emails as spam).”
Testing Provides Insight
By running this code and observing the evaluation metrics, as a tester, you can discuss with your delivery team the importance of each metric and how those metrics provide different perspectives on the model’s performance. Key to this discussion is understanding that accuracy represents the overall correctness of the predictions, precision captures the proportion of correctly predicted positive instances, recall measures the proportion of actual positive instances correctly predicted. The F1 score combines precision and recall into a single metric that balances both measures. How the F1 score works was talked about in the first post so I won’t belabor that again here. I would ask you to keep in mind the controllability aspect with this example. And with that in mind, as a tester, let me ask you this: how could you subvert the model? How could you change the data so that the model started drifting in terms of accuracy? If you can answer that question, you’re able to see how to lead the model to learn incorrect patterns and thus to make incorrect predictions. But why would you want to do that? This is very much like the situation of “creating a failing test” in various other testing contexts. You want to be able to see that a test can be confirmed and falsified. So if I have a working test, I want to change conditions such that the test fails. If the test doesn’t fail, but I expect it to, that means I shouldn’t trust my tests! Instrumenting the model to have it produce incorrect results would compromise the quality of predictability, of course. But it’s really important to have the observability that allows you to see this in the first place and the controllability to test out any modifications you, or your team, want to make. That’s how we keep these models testable.Test the Spam
Speaking of being testable, let’s jump into a little testing here. Let’s frame the email spam example around pytest to demonstrate how to test the model with different data.If you want to follow along, note that you’ll need to get the “pytest” runner although you can use any test runner execution library you want.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
import pytest import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split emails = [ "Greetings! This is a legitimate email.", "Get rich quick! Guaranteed FREE money!", "We will never ask for your password.", "Average cashout time is 15 min!!!", ] labels = np.array([0, 1, 0, 1]) vectorizer = CountVectorizer() X = vectorizer.fit_transform(emails) X_train, X_test, y_train, y_test = train_test_split( X, labels, test_size=0.5, stratify=labels, random_state=42 ) model = MultinomialNB() model.fit(X_train, y_train) @pytest.mark.parametrize( "email, expected_label", [ ("Hello, checking in to see how you are.", 0), ("URGENT: Claim your free prize now!", 1), ("Important account information update", 0), ("Exclusive limited-time offer!", 1), ], ) def test_spam_classification(email, expected_label): email_vector = vectorizer.transform([email]) predicted_label = model.predict(email_vector) assert predicted_label == expected_label |
@pytest.mark.parametrize
decorator. I’m essentially saying there’s a test condition (“email, expected label”) and a series of data conditions (various email texts) I want to apply to that test condition. Thus each test represents a different example that I want to classify as spam or not spam.
Within the test function, I’m transforming the email into a feature vector using the same CountVectorizer
that was used during training. Then I’m using the trained model to predict the label for the email test condition.
Any test needs some observation to make it an actual test and so I’m asserting that the predicted label matches the expected label for each data condition when applied to the test condition.
By running the tests, you can verify whether the model correctly classifies different example emails as spam or not spam based on the expected labels. And what you should see is that all tests pass.
One thing I do want to call out here: the notion of “training” and “testing” got a little mixed together here, right? You could be forgiven for saying “Jeff, you essentially just shoved part of the model execution into a test.” Is that the case?
Are We Actually Testing?
Is there any concern with what you see above? Consider a diff of the two examples: You might want to open the image in a new tab to get its full size. Think about what you’re seeing there. Is there a problem? Or does it, in fact, make complete sense? Can you frame a narrative around what someone might see the problem to be? What’s one thing that stands out as differing between the two code examples? Clearly the data in the test is different than the data in the model. The email text examples are different. Is that a bad or a good thing? Here’s how I would frame this for someone who was skeptical. The original code is focused on training a Naive Bayes classifier on a dataset and evaluating its performance using metrics such as accuracy, precision, recall, and F1 score. It demonstrates how to train a model, make predictions, and evaluate its quality using those metrics. The modified code is focused on testing by using the trained model to predict the label for each test case and compare it with the expected label using assertions. A key purpose of the test cases is to verify that the model correctly predicts the labels for different types of emails, including those it has not seen during training.Ah, But Can It Fail?
Is there any way to make that test fail, just so we can prove that the test is doing what we think it is? Well, take a look at it. How would you do that? You can intentionally modify one of the test cases to make it fail. For example, you can change the expected label to a value that’s different from the predicted label. Here’s an updated version of the test condition part that will intentionally fail:
1 2 3 4 5 6 7 8 9 |
@pytest.mark.parametrize( "email, expected_label", [ ("Hello, checking in to see how you are.", 0), ("URGENT: Claim your free prize now!", 1), ("Important account information update", 0), ("Exclusive limited-time offer!", 0), ], ) |
Email: Explain / Interpret
In the context of the email spam example, explainability and interpretability refer to the ability to understand and interpret the decisions made by the model regarding the classification of emails as spam or not spam. Let’s dig in a little bit into how these concepts relate to the provided code and example.Explaining Our Model
Let’s start with explainability. Explainability focuses on providing insights into the decision-making process of the model. In the context of email spam classification, explainability tries to help us answer questions such as “Why did the model classify this email as spam?” or “What were the important features that led to the model’s decision?” In the code example, explainability can be enhanced by analyzing the features used by the model to make predictions. One common approach is to examine the coefficients or weights associated with each feature in the trained model. Whoa! What does that last part actually mean? The coefficients or weights refer to the learned probabilities of different words indicating, in our context, spam or non-spam. Naive Bayes assigns a probability to each word based on its occurrence in spam and non-spam emails. These probabilities are then used to calculate the likelihood of an email being spam or non-spam. Here’s a modified example that incorporates some of this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score from sklearn.model_selection import train_test_split emails = [ "Greetings! This is a legitimate email.", "Get rich quick! Guaranteed FREE money!", "We will never ask for your password.", "Average cashout time is 15 min!!!", ] labels = np.array([0, 1, 0, 1]) vectorizer = CountVectorizer() X = vectorizer.fit_transform(emails) X_train, X_test, y_train, y_test = train_test_split( X, labels, test_size=0.5, stratify=labels, random_state=42 ) model = MultinomialNB() model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred, zero_division=0) recall = recall_score(y_test, y_pred, zero_division=0) f1 = f1_score(y_test, y_pred, zero_division=0) print("Accuracy Measure:", accuracy) print("Precision Measure:", precision) print("Recall Measure:", recall) print("F1 Score:", f1) feature_names = vectorizer.get_feature_names_out() spam_word_probabilities = model.feature_log_prob_[1] ham_word_probabilities = model.feature_log_prob_[0] print("\nTop five words indicating spam:") top_spam_words = np.argsort(spam_word_probabilities)[-5:][::-1] for idx in top_spam_words: print(f"{feature_names[idx]} (Probability: {np.exp(spam_word_probabilities[idx])})") print("\nTop five words indicating non-spam:") top_ham_words = np.argsort(ham_word_probabilities)[-5:][::-1] for idx in top_ham_words: print(f"{feature_names[idx]} (Probability: {np.exp(ham_word_probabilities[idx])})") |
Accuracy Measure: 0.5 Precision Measure: 0.5 Recall Measure: 1.0 F1 Score: 0.6666666666666666 Top 5 words indicating spam: guaranteed (Probability: 0.06896551724137931) rich (Probability: 0.06896551724137931) quick (Probability: 0.06896551724137931) money (Probability: 0.06896551724137931) free (Probability: 0.06896551724137931) Top 5 words indicating non-spam: your (Probability: 0.06666666666666667) we (Probability: 0.06666666666666667) ask (Probability: 0.06666666666666667) password (Probability: 0.06666666666666667) never (Probability: 0.06666666666666667)Key to this is that I’m using the
feature_log_prob_
attribute of the trained MultinomialNB
model. What this does is access the logarithm of the probability estimates. Doing that, I’m able to retrieve the probabilities of the top five words indicating spam and non-spam, along with their corresponding feature names.
Now that, my friends, is an example of controllability and observability in action!
In addition to the above, techniques like LIME or SHAP — which I talked about in the first post — can be used to provide local explanations for individual predictions. These methods would highlight the important features that influenced the model’s decision for a specific email, helping to explain why that particular email was classified as spam or not.
Interpreting Our Model
What about interpretability? Interpretability refers to the ability to understand and make sense of how the model works and behaves. It involves building a mental model or intuition about how the model operates and what factors contribute to its predictions. In the email spam example, interpretability can be achieved by examining the underlying algorithm and the feature representation used. For instance, understanding the workings of a Naive Bayes classifier and its reliance on the bag-of-words representation can provide insights into how the model considers the presence or absence of specific words to classify emails. If your delivery team is being asked to interpret the model and why your customers might feel comfortable using it, what might you say?As a tester, you may argue that this interpretation is not your job. I would say that it’s not your job alone. But, as I hope these posts are showing, testing is a key part of how we trust our ability to explain and interpret.
One thing you could focus on is the feature representation. Specifically, you could explain how the textual data is transformed into a numerical representation that can be used by the Naive Bayes classifier. In our example, the count vectorizer is used to convert the text emails into a matrix of token counts. Each email is represented as a vector of word frequencies. You could also talk a little about the model training. You could specifically mention that during training, the classifier learns the probability distribution of each word for each class based on the observed frequencies. You should talk about the Naive Bayes algorithm itself. You could explain the underlying algorithm of the Naive Bayes classifier and mention that the classifier applies Bayes’ theorem to calculate the posterior probabilities of each class given the observed word frequencies. It would be worth calling out that the occurrence of each word is independent of the occurrences of other words, which is a simplifying assumption being made. You can talk about interpreting specific model parameters. Here you could discuss the interpretable aspects of the trained model. In the case of Naive Bayes, you could mention that the learned probabilities of different words indicate their importance or discriminative power for distinguishing between spam and non-spam emails. Higher probabilities for a particular word in the spam class suggest that the word is more indicative of spam, and vice versa. You can talk about the model evaluation. You could, for example, highlight the performance metrics used to evaluate the model, such as accuracy, precision, recall, and the F1 score. You could talk about what each metric measures and how those measures provide insights into the model’s effectiveness in classifying emails.Explain and Interpret Go Together
All of that interpretability is focusing on the the feature representation, the training process, and the interpretability of the Naive Bayes classifier. That’s pretty technical, though, right? Not all audiences will need or want that level of detail. Your customers, for example, probably would not. Well … hold on. Is that true? What if your customer was a solution provider that wanted to incorporate your model? But your model may be just one of many that they’re considering. In this case, that kind of customer may appreciate some of the details around the interpretability. Yet, again, many other customers may simply want to know if they can trust your model. And that’s where the explainability comes in. So you have to interleave the two aspects and then decide which to put emphasis on given the particular situation you are dealing with. Beyond even those examples, and going back to something I showed you in the first post, visualizations or summaries of the learned model parameters, such as word probabilities, can be really helpful in interpreting the model’s behavior.Note that “ham” is often used to refer to non-spam emails.
By analyzing these parameters, testers — working with the delivery team — can use those visuals to gain a better understanding of the features and characteristics that the model leverages to make spam classification decisions. Sometimes visualizations serve as a good bridge between the interpretability and explainability. These kinds of visuals allow for a way to focus on qualitative aspects while still being able to dig into quantitative aspects. However, as warned about in the first post, it’s crucial to avoid letting these visuals take on a life of their own where they become more determinative of decision-making than the actual evaluation measures and scores that are providing the quantitative understanding of what is actually happening.Visual Classification
Let’s consider a different example, this time in the context of image classification. I’ll use the popular MNIST dataset, which consists of grayscale images of hand-drawn digits from 0 to 9. I’ll train a simple convolutional neural network model and evaluate its performance using evaluation measures and scores. This kind of this model was developed based on how our brains seem to process visual information, essentially by recognizing patterns and shapes in images. These models break a given image down into smaller parts that are called “convolutional layers.” These layers apply certain filters to identify features of the image. Features here might be edges, corners, or textures. As each layer is built up, the model combines these features to recognize more complex shapes.The term “convolution” refers to a mathematical operation that combines two functions to produce a third function that represents how the shape of one is modified by the other. The term “convolution” was inspired by the concept of convolutions in the structure and functioning of the human brain, particularly around what occurs in the visual cortex during visual processing which allows the human brain to make sense of spatial relationships.
I’m going to show this example in two ways. One using Tensorflow and another using PyTorch. I’m doing this because you will often come across different approaches depending on the development team or data scientist team that you’re working with and the tooling that they are most familiar with.If you’re on Mac or Linux, you should have no problem installing the tensorflow package. Windows users, on the other hand, can be in for a rough ride. If you’re using the Anaconda distribution for Python that can ease things considerably. If tensorflow doesn’t work for you, just use the PyTorch version.
Tensorflow
Here’s the Tensorflow version and it’s worth noting that this code will have to download information, specifically the MNIST data set.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
import numpy as np from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score from tensorflow import keras (X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data() X_train = X_train.reshape(-1, 28, 28, 1) / 255.0 X_test = X_test.reshape(-1, 28, 28, 1) / 255.0 y_train = keras.utils.to_categorical(y_train, num_classes=10) y_test = keras.utils.to_categorical(y_test, num_classes=10) X_train, X_val, y_train, y_val = train_test_split( X_train, y_train, test_size=0.2, random_state=42 ) model = keras.models.Sequential( [ keras.layers.Conv2D(32, (3, 3), activation="relu", input_shape=(28, 28, 1)), keras.layers.MaxPooling2D((2, 2)), keras.layers.Flatten(), keras.layers.Dense(10, activation="softmax"), ] ) model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"]) model.fit(X_train, y_train, batch_size=128, epochs=5, validation_data=(X_val, y_val)) y_pred = model.predict(X_test) y_pred_classes = np.argmax(y_pred, axis=1) y_true_classes = np.argmax(y_test, axis=1) accuracy = accuracy_score(y_true_classes, y_pred_classes) precision = precision_score(y_true_classes, y_pred_classes, average="weighted") recall = recall_score(y_true_classes, y_pred_classes, average="weighted") f1 = f1_score(y_true_classes, y_pred_classes, average="weighted") print("Accuracy Measure:", accuracy) print("Precision Measure:", precision) print("Recall Measure:", recall) print("F1 Score:", f1) |
Keras here refers to a high-level neural network library that has a specific focus on deep learning models. Keras’ claim to fame is its providing a set of modular APIs that simplify the process of developing neural networks.
What I just described gets represented in output like this:
Epoch 1/5
375/375 [==============================] - 7s 19ms/step - loss: 0.3948 - accuracy: 0.8933 - val_loss: 0.1898 - val_accuracy: 0.9464
Epoch 2/5
375/375 [==============================] - 6s 15ms/step - loss: 0.1466 - accuracy: 0.9577 - val_loss: 0.1173 - val_accuracy: 0.9688
Epoch 3/5
375/375 [==============================] - 6s 15ms/step - loss: 0.0976 - accuracy: 0.9729 - val_loss: 0.0933 - val_accuracy: 0.9746
Epoch 4/5
375/375 [==============================] - 6s 15ms/step - loss: 0.0757 - accuracy: 0.9787 - val_loss: 0.0808 - val_accuracy: 0.9769
Epoch 5/5
375/375 [==============================] - 6s 16ms/step - loss: 0.0633 - accuracy: 0.9827 - val_loss: 0.0711 - val_accuracy: 0.9797
313/313 [==============================] - 0s 1ms/step
Just to put some context around that output, when you see something like this “375/375 [==============================] – 7s 19ms/step”, that represents the progress of the training process. It indicates that the current batch is the 375th batch out of a total of 375 batches. The numbers in square brackets shows the progress as a percentage. The subsequent numbers “7s” and “19ms/step” represent the time taken for each epoch and the average time per step — or batch — respectively.
An epoch refers to one complete pass through the entire training dataset during the training phase.
A line like “loss: 0.3948 – accuracy: 0.8933 – val_loss: 0.1898 – val_accuracy: 0.9464” shows the training metrics for the current epoch. Here “loss” represents the training loss value, which measures the model’s error during training. The “accuracy” indicates the training accuracy, which represents the proportion of correctly classified samples during training. The “val_loss” represents the validation loss, which measures the model’s error on a separate validation set. And, finally, the “val_accuracy” indicates the validation accuracy, which represents the proportion of correctly classified samples on the validation set. What you’re getting there is information about the training progress, including the loss and accuracy values for each epoch, as well as the evaluation of the model on the test set. This is what allows you to monitor the performance of the model during training. After training, I then evaluate the model on the test set by making predictions (y_pred
) and converting the predictions and true labels to class indices (y_pred_classes
and y_true_classes
). Finally, I calculate our by-now-familiar evaluation metrics such as accuracy, precision, recall, and F1 score. You should see output like this:
Accuracy Measure: 0.9805 Precision Measure: 0.9805278655924867 Recall Measure: 0.9805 F1 Score: 0.9804860876778115As we did with the email spam example, consider how you would explain the above to someone who wanted to know the results. Let’s break it down a little.
- The accuracy measure represents the proportion of correctly classified samples in the test dataset. In this case, the accuracy measure is 0.9805, which means that the model achieved an accuracy of 98.05%. It correctly classified approximately 98.05% of the test samples.
- The precision is a measure of how well the model predicts the positive class (in this case, the digits) among all the samples it classified as positive. The precision measure is 0.9805278655924867, which indicates that the model’s predictions of the positive class are precise or accurate.
- The recall — as stated earlier, also known as sensitivity or true positive rate — measures how well the model captures all the positive class samples. It indicates the proportion of positive samples correctly identified by the model. The recall measure is 0.9805, which means that the model identified approximately 98.05% of the positive samples.
- The F1 score is the harmonic mean of precision and recall. It provides a balanced measure of the model’s performance by considering both precision and recall. The F1 score in this case is 0.9804860876778115, indicating a good balance between precision and recall.
PyTorch
Let’s try the PyTorch version.You’ll need the “torch” and “torchvision” dependencies.
This one takes more effort to set up but, to be honest, I much prefer using PyTorch than Tensorflow. From a testing standpoint, I find it gives much more controllability and observability.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score import torch import torch.nn as nn import torch.optim as optim from torchvision.datasets import MNIST from torchvision.transforms import ToTensor mnist_train = MNIST(root="data", train=True, download=True, transform=ToTensor()) mnist_test = MNIST(root="data", train=False, download=True, transform=ToTensor()) X_train = mnist_train.data.unsqueeze(1).float() / 255.0 X_test = mnist_test.data.unsqueeze(1).float() / 255.0 y_train = torch.nn.functional.one_hot(mnist_train.targets, num_classes=10).float() y_test = torch.nn.functional.one_hot(mnist_test.targets, num_classes=10).float() X_train, X_val, y_train, y_val = train_test_split( X_train, y_train, test_size=0.2, random_state=42 ) class SimpleCNN(nn.Module): def __init__(self): super(SimpleCNN, self).__init__() self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1) self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2) self.flatten = nn.Flatten() self.fc = nn.Linear(13 * 13 * 32, 10) def forward(self, x): x = self.conv1(x) x = nn.ReLU()(x) x = self.maxpool(x) x = self.flatten(x) x = self.fc(x) return x model = SimpleCNN() criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) num_epochs = 5 batch_size = 128 for epoch in range(num_epochs): model.train() for i in range(0, len(X_train), batch_size): inputs = X_train[i : i + batch_size] targets = torch.argmax(y_train[i : i + batch_size], dim=1) optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, targets) loss.backward() optimizer.step() torch.save(model.state_dict(), "model.pth") model.eval() y_pred = torch.argmax(model(X_test), dim=1) y_true = torch.argmax(y_test, dim=1) accuracy = accuracy_score(y_true, y_pred) precision = precision_score(y_true, y_pred, average="weighted") recall = recall_score(y_true, y_pred, average="weighted") f1 = f1_score(y_true, y_pred, average="weighted") print("Accuracy Measure:", accuracy) print("Precision Measure:", precision) print("Recall Measure:", recall) print("F1 Score:", f1) |
data
folder that it creates wherever you are running the script. This logic also builds a really simple convolutional neural network model, represented by the SimpleCNN
class. The training loop, loss function, and optimizer are also adjusted accordingly.
A key thing I’m doing there is saving a file called “model.pth” in the directory where you’re running this script. I’ll come back to why that is.
If you run this, you should see output like the following:
Accuracy: 0.9803 Precision: 0.9803903443873958 Recall: 0.9803 F1 Score: 0.9802689719341039Let’s say that your delivery team was considering between Tensorflow and PyTorch. What would the above tell them? The differences in accuracy, precision, recall, and F1 score between this and the Tensorflow version are minimal. The values are very close, indicating that both model evaluations have similar performance. Therefore, you and your team can conclude that the second output represents a comparable level of performance to the first output, and the model is consistently performing well on the test dataset. So the point there is that by utilizing PyTorch, you can achieve similar functionality without relying on Tensorflow. Or vice versa. If the outputs were very different, that would be telling and you, along with your team, would have to look into that. It could be something as simple as us using one of the tools incorrectly or it could be something a bit deeper.
Let’s Do More Actual Testing
We can apply the same approach of using pytest to evaluate the MNIST example. Here, in the interests of not making a long post even longer, I’ll just show this with PyTorch although you could certainly do something similar with the Tensorflow version.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
import pytest import torch import torch.nn as nn from torchvision.datasets import MNIST from torchvision.transforms import ToTensor mnist_test = MNIST(root="data", train=False, download=True, transform=ToTensor()) X_test = mnist_test.data.unsqueeze(1).float() / 255.0 y_test = torch.nn.functional.one_hot(mnist_test.targets, num_classes=10).float() class SimpleCNN(nn.Module): def __init__(self): super(SimpleCNN, self).__init__() self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1) self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2) self.flatten = nn.Flatten() self.fc = nn.Linear(13 * 13 * 32, 10) def forward(self, x): x = self.conv1(x) x = nn.ReLU()(x) x = self.maxpool(x) x = self.flatten(x) x = self.fc(x) return x model = SimpleCNN() model.load_state_dict(torch.load("model.pth")) @pytest.mark.parametrize("index", [0, 1, 2, 3, 4]) def test_mnist_classification(index): model.eval() input_data = X_test[index] target = torch.argmax(y_test[index]) with torch.no_grad(): output = model(input_data.unsqueeze(0)) predicted = torch.argmax(output) assert predicted == target |
@pytest.mark.parametrize
decorator as I did in the email spam example to specify the indices of the test examples. You can add more indices or modify them as needed. Inside the test function, I then load the pre-trained model, pass the input data through the model, and assert that the predicted label matches the target label.
If you run this with pytest, you’ll see that all tests pass. As we tried with the previous example, let’s see if the tests can fail. How do we do that?
Well, one thing that should certainly make the test fail is to modify the target labels (y_test
) to create a mismatch with the model’s predictions. Here’s an example modification that would cause the test to fail for the first index:
1 2 3 4 5 6 7 8 |
... model = SimpleCNN() model.load_state_dict(torch.load("model.pth")) y_test[0] = torch.zeros(10) ... |
Visualizing our Testing
Since this example relies on specific visual data, it’s probably worth looking at a visualization to augment the test reports. Consider this modification:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
import pytest import torch import torch.nn as nn from torchvision.datasets import MNIST from torchvision.transforms import ToTensor import matplotlib.pyplot as plt mnist_test = MNIST(root="data", train=False, download=True, transform=ToTensor()) X_test = mnist_test.data.unsqueeze(1).float() / 255.0 y_test = torch.nn.functional.one_hot(mnist_test.targets, num_classes=10).float() class SimpleCNN(nn.Module): def __init__(self): super(SimpleCNN, self).__init__() self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1) self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2) self.flatten = nn.Flatten() self.fc = nn.Linear(13 * 13 * 32, 10) def forward(self, x): x = self.conv1(x) x = nn.ReLU()(x) x = self.maxpool(x) x = self.flatten(x) x = self.fc(x) return x model = SimpleCNN() model.load_state_dict(torch.load("model.pth")) @pytest.mark.parametrize("index", [0, 1, 2, 3, 4]) def test_mnist_classification(index): model.eval() input_data = X_test[index] target = torch.argmax(y_test[index]) with torch.no_grad(): output = model(input_data.unsqueeze(0)) predicted = torch.argmax(output) assert predicted == target, f"Test failed for index {index}" image = X_test[index].squeeze().numpy() predicted_label = predicted.item() target_label = target.item() plt.imshow(image, cmap="gray") plt.title(f"Predicted: {predicted_label}, Target: {target_label}") plt.axis("off") plt.show() |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
import pytest import torch import torch.nn as nn from torchvision.datasets import MNIST from torchvision.transforms import ToTensor import matplotlib.pyplot as plt import numpy as np mnist_test = MNIST(root="data", train=False, download=True, transform=ToTensor()) X_test = mnist_test.data.unsqueeze(1).float() / 255.0 y_test = torch.nn.functional.one_hot(mnist_test.targets, num_classes=10).float() class SimpleCNN(nn.Module): def __init__(self): super(SimpleCNN, self).__init__() self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1) self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2) self.flatten = nn.Flatten() self.fc = nn.Linear(13 * 13 * 32, 10) def forward(self, x): x = self.conv1(x) x = nn.ReLU()(x) x = self.maxpool(x) x = self.flatten(x) x = self.fc(x) return x model = SimpleCNN() model.load_state_dict(torch.load("model.pth")) @pytest.mark.parametrize("index", [0, 1, 2, 3, 4]) def test_mnist_classification(index): model.eval() input_data = X_test[index] target = torch.argmax(y_test[index]) with torch.no_grad(): output = model(input_data.unsqueeze(0)) predicted = torch.argmax(output) assert predicted == target, f"Test failed for index {index}" def test_visualize_mnist_classification(): num_tests = 5 num_cols = 5 num_rows = (num_tests + num_cols - 1) // num_cols fig, axes = plt.subplots(num_rows, num_cols, figsize=(10, 10)) axes = np.array(axes).flatten() for i, index in enumerate([0, 1, 2, 3, 4]): ax = axes[i] test_mnist_classification(index) image = X_test[index].squeeze().numpy() predicted_label = torch.argmax(model(X_test[index].unsqueeze(0))).item() target_label = torch.argmax(y_test[index]).item() ax.imshow(image, cmap="gray") ax.set_title(f"Predicted: {predicted_label}, Target: {target_label}") ax.axis("off") if num_tests < num_rows * num_cols: for i in range(num_tests, num_rows * num_cols): fig.delaxes(axes[i]) plt.tight_layout() plt.show() |
This is where you, as a tester, either need to know how to do this yourself or pair-test/program with your developers, asking them for what you need and working with them to make sure it’s provided. There’s a vast spectrum of how “technical” — in a code-based sense — a tester needs to be. I personally err on the side of having as much programmatic knowledge as I can. But, again, there’s a vast spectrum.
Earlier I asked whether testers should become developers. I also suggested testers should act like developers. I also talked about testers and the technical abstraction stack. The fact that I wrote all of that without an AI context just reinforces to me how much relevance it has — at least to me — in an AI context, where the abstractions are many and varied.