Text Trek: Navigating Classifications, Part 6

In this final post of this series, we’ll look at training our learning model on our Emotions dataset. This post is the culmination of everything we’ve learned in the first three posts in this series and then implemented in the previous two posts in this series. So let’s dig in for the final stretch! Thinking About AI

First, let’s make sure we have our script at a base state, like this:

from datasets import load_dataset
from transformers import AutoTokenizer


def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)


dataset = load_dataset("jeffnyman/emotions")

model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

encoded_data = dataset.map(tokenize, batched=True, batch_size=None)

from datasets import load_dataset

from transformers import AutoTokenizer

def tokenize(batch):

return tokenizer(batch["text"], padding=True, truncation=True)

dataset = load_dataset("jeffnyman/emotions")

model_checkpoint = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

encoded_data = dataset.map(tokenize, batched=True, batch_size=None)

Keep in mind that this, essentially, is our test script. It’s one we created out of a whole lot of exploration. This is crucial in this context. You will often do much exploration to figure out what qualities you need to focus on and your ability to make risk assessments based on the qualities you most care about. Then you will encode those decisions in a script like the one above.

This is how you replicate experiments, just as with any test case you might create. The fact that this test case happens to be code-based is irrelevant to the intent behind it.

Now I’m going to ask you to add a few imports and new statements. Rather than interleave everything, let me just show you the updated script and I’ll leave it to you to compare and contrast as you please:

import torch
from datasets import load_dataset
from transformers import AutoModel
from transformers import AutoTokenizer


def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

dataset = load_dataset("jeffnyman/emotions")

model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModel.from_pretrained(model_checkpoint).to(device)

encoded_data = dataset.map(tokenize, batched=True, batch_size=None)

import torch

from datasets import load_dataset

from transformers import AutoModel

from transformers import AutoTokenizer

def tokenize(batch):

return tokenizer(batch["text"], padding=True, truncation=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

dataset = load_dataset("jeffnyman/emotions")

model_checkpoint = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

model = AutoModel.from_pretrained(model_checkpoint).to(device)

encoded_data = dataset.map(tokenize, batched=True, batch_size=None)

Here we’re using the same model checkpoint we’ve been using. A difference here is that we set up the device to use for computation based on GPU availability. If a GPU is available, the model will be loaded onto the GPU; otherwise, it will be loaded onto the CPU.

Then we’re using the AutoModel to load the pre-trained DistilBERT model with the specified checkpoint. Notice how we’re moving the model to the selected device, either GPU or CPU.

When you run this, you’ll see a download of something called “model.safetensors”. “Safetensors” is a particular format for storing and loading tensors efficiently.

The AutoModel class from the Hugging Face Transformers library encapsulates a whole lot of functionality to make working with pre-trained language models more convenient. Here’s a rough summary of what it does:

Tokenization and Embedding Lookup: When you pass text input to the AutoModel instance, it automatically handles the tokenization process. It converts the input text into token IDs, creates attention masks, and then looks up the corresponding token embeddings from the model’s embedding matrix.
Encoder Stack: After obtaining the token embeddings, the AutoModel instance passes them through the encoder stack — the transformer layers. This stack contains multiple transformer layers that process the embeddings to capture contextual information.
Hidden States: The output of the encoder stack is a series of hidden states, each corresponding to a token in the input sequence. These hidden states capture the contextual information of each token in relation to the surrounding tokens.

The key point is that this is a high-level abstraction that helps you focus more on the downstream tasks you want to perform. Contrast this to where we started off in this series, with practically no abstractions at all.

To get ourselves familiar with how this works, let’s retrieve the final hidden states of this process for just a single bit of text. Let’s use the first text from our dataset: “i didnt feel humiliated”. Add the following to your script:

datum = "i didnt feel humiliated"
inputs = tokenizer(datum, return_tensors="pt")
print(f"Input shape: {inputs['input_ids'].size()}")

datum = "i didnt feel humiliated"

inputs = tokenizer(datum, return_tensors="pt")

print(f"Input shape: {inputs['input_ids'].size()}")

So we’re encoding our text as we’ve done before but we’re also converting the tokens into PyTorch tensors. That’s what the “return_tensors” bit is doing above. You’ll get this output:


Input shape: torch.Size([1, 7])

The dimensions here are [batch size, number of tokens]. What that means is that the input consists of a single batch with 7 tokens (words) in the sequence. The main thing is that we now have our encodings as a tensor. This is very similar to what we did in the second post in this series.

Produce Our Hidden States

Let’s add the following:

inputs_on_device = {key: tensor.to(device) for key, tensor in inputs.items()}

with torch.no_grad():
    model_outputs = model(**inputs_on_device)

print(model_outputs)

inputs_on_device = {key: tensor.to(device) for key, tensor in inputs.items()}

with torch.no_grad():

model_outputs = model(**inputs_on_device)

print(model_outputs)

What this code is doing is moving the input tensors to the specified device (GPU or CPU) using a dictionary comprehension. This helps us make sure that the input tensors are on the same device as the model.

The torch.no_grad() bit is a context manager provided by the PyTorch library that temporarily disables gradient calculations (“no gradient”). I don’t want to go too far afield here but let’s just consider why this is useful.

In machine learning contexts, gradient calculations are crucial for what’s called backpropagation. This is the process of updating model parameters during training to minimize the loss function.

When gradients are calculated, additional memory and computation are required to store and process them. However, during inference or evaluation, which is what we’re doing here, you don’t need to update model parameters, so there’s no need to compute gradients.

What using this “no gradient” means is that we’re interested in obtaining model outputs without affecting the model’s parameters.

Going back to our code, we pass the tokenized and device-adjusted input tensors to the pre-trained DistilBERT model. The **inputs_on_device syntax unpacks the dictionary and passes its contents as keyword arguments to the model. This means that each key in the dictionary corresponds to an expected model input.

The output you’ll get is this:


BaseModelOutput(last_hidden_state=tensor([[[-0.1168,  0.0986, -0.1296,  ...,  0.0587,  0.3543,  0.4042],
[ 0.1325,  0.1516, -0.1169,  ..., -0.1119,  0.5562,  0.2908],
[-0.1053,  0.2862,  0.1958,  ...,  0.0241,  0.0577, -0.3627],
...,
[-0.6010,  0.2965,  0.1182,  ..., -0.0596, -0.2304,  0.4605],
[-0.3851,  0.2159, -0.1333,  ...,  0.1224, -0.0992, -0.2419],
[ 0.7642,  0.1564, -0.3384,  ...,  0.2157, -0.4236, -0.3679]]]), hidden_states=None, attentions=None)

So this is the output of the model’s forward pass.

Forward Passing

In the context of neural networks, a “forward pass” refers to the process of propagating input data through the layers of the network to produce an output. It’s called a “forward” pass because the data flows forward through the network, from the input layers to the output layers.

I don’t want to go crazy on details here but let’s consider a little of how this works. Forward pass through a network

You provide the input data to the neural network. This could be a single data point or a batch of data points.
The input data is fed into the input layer of the neural network. In the case of language models like DistilBERT, the input data consists of tokenized text.
The data moves through the various layers of the neural network. Each layer performs certain mathematical operations on the data.
As the data progresses through the layers, it undergoes transformations that create increasingly abstract and meaningful representations of the input data. These hidden representations capture different features and relationships in the data.
The final layer of the network produces the actual output based on the processed data. In the case of language models, this might be predictions about the next word in a sequence, sentiment scores, or any other task-specific output.

Crucially here the output produced at the end of the forward pass is the model’s prediction or output for the given input data.

In our code, when we call model(**inputs_on_device) within the torch.no_grad() context, we’re essentially executing a forward pass through the DistilBERT model. The model processes the input text data, performs calculations through its layers, and generates an output, which in this case includes hidden states.

In fact, the output shows us the “last_hidden_state”, which is a tensor. This tensor contains the hidden states for each token in the input sequence. Each token is represented by a vector in a sequence of vectors. Each vector captures contextual information about the token based on the surrounding tokens.

Notice also the “hidden_states=None” and “attentions=None” that we specified.

In transformer-based models like DistilBERT, the hidden states and attention weights are crucial components that provide insights into how the model processes and understands the input text. Visual of hidden states and attention weights

Like we’ve already talked about, hidden states are representations of the input text at various layers of the model. Each layer’s hidden state captures information about the tokens’ contextual relationships.

Attention mechanisms help the model focus on different parts of the input text while processing each token. Attention weights show the importance of each token’s interaction with other tokens in the sequence.

In our current output, both “hidden_states” and “attentions” are “None”. So what does that mean?

This indicates that the model wasn’t specifically configured to return these components during inference. We could certainly configure the model to output these components if we had a task that required accessing them. However, in our case, we’re pretty much solely interested in the final hidden states and so we don’t need to worry about that.

Reasoning About Hidden States

Speaking of those “final hidden states”, let’s check on that.

print(model_outputs.last_hidden_state.size())

1	print(model_outputs.last_hidden_state.size())

The output is:


torch.Size([1, 7, 768])

The tensor size corresponds to the shape of the “last_hidden_state” tensor in the model_outputs we obtained from the DistilBERT model’s forward pass. The dimensions are [batch size, number of tokens, hidden states].

The first dimension represents the batch size. In this case, we processed a single input, so the batch size is 1. The second dimension corresponds to the sequence length of the input. The input text “i didnt feel humiliated” was tokenized into 7 tokens. The third dimension is the size of each token’s hidden state. In the case of DistilBERT, this is the size of the embedding and hidden state vectors, which ended up being 768.

The output thus indicates that we have a single batch with 7 tokens, and each token is represented by a vector of size 768. And what that tells us is that, this case, we need a 768-dimensional vector is for each of the seven input tokens.

Let’s actually refine our print statement a bit:

print(model_outputs.last_hidden_state[:, 0].size())

1	print(model_outputs.last_hidden_state[:, 0].size())

This gets you:


torch.Size([1, 768])

But what’s going on here?

In classification tasks, you’re often interested in making a prediction based on the entire sequence, not individual tokens. Remember how we had that [CLS] token? This is the one that was a special token added at the beginning of sequences. What this means is that the hidden state of the [CLS] token can capture a representation of the whole sequence that’s useful for classification.

So the [:, 0] is a slice that selects all batches (:) and the first token (0), which is the [CLS] token.

So why is the “7” gone from our output? And why does that make sense?

Keep in mind that the 7 in the original tensor shape represents the sequence length of the input text. The sequence length indicates how many tokens are in the input sequence. In our case, the input text “i didnt feel humiliated” was tokenized into seven tokens.

When we changed our logic to retrieve last_hidden_state[:, 0].size(), we specifically extracted the hidden state of the [CLS] token using the index 0. This [CLS] token’s hidden state is a single vector that captures the overall representation of the sequence.

By doing this, we’re focusing on the [CLS] token’s hidden state as a feature for classification, and the sequence length dimension is no longer relevant.

I say the sequence length dimension becomes irrelevant because we’re no longer looking at the individual token-level representations. Instead, we’re using the [CLS] token’s hidden state as a higher-level representation of the entire sequence, which is suitable for classification tasks.

So, again, just to reiterate: the sequence length dimension (7) is no longer present because we’ve aggregated the sequence into a single [CLS] token representation for classification purposes.

I know this last part can seem a little confusing. Going into all the details would take me way too far afield. Just know that we’re essentially streamlining what we pass to our model so that training is more efficient.

Feeding More Data

So all that’s great and shows what happens in the context of a single datum that we would pull from our dataset. But now we need to do that for the entire dataset. First, as we did with the tokenizing step, let’s create a function to encapsulate what we just did:

def extract_first_token_last_layer_hidden_state(batch):
    inputs_on_device = {
        k: v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names
    }

    with torch.no_grad():
        model_outputs = model(**inputs_on_device)
        hidden_state = model_outputs.last_hidden_state[:, 0].cpu().numpy()

    return {"hidden_state": hidden_state}

def extract_first_token_last_layer_hidden_state(batch):

inputs_on_device = {

k: v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names

}

with torch.no_grad():

model_outputs = model(**inputs_on_device)

hidden_state = model_outputs.last_hidden_state[:, 0].cpu().numpy()

return {"hidden_state": hidden_state}

Here one change from what we worked on is that we’re converting a PyTorch tensor to a NumPy array. When you’re working with a GPU (cuda) tensor and you want to perform operations that don’t require GPU processing — as in the .numpy() conversion — then you need to move the tensor from the GPU memory to the CPU memory.

So let’s make sure our script looks like this:

import torch
from datasets import load_dataset
from transformers import AutoModel
from transformers import AutoTokenizer


def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)


def extract_first_token_last_layer_hidden_state(batch):
    inputs_on_device = {
        k: v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names
    }

    with torch.no_grad():
        model_outputs = model(**inputs_on_device)
        hidden_state = model_outputs.last_hidden_state[:, 0].cpu().numpy()

    return {"hidden_state": hidden_state}


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

dataset = load_dataset("jeffnyman/emotions")

model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModel.from_pretrained(model_checkpoint).to(device)

encoded_data = dataset.map(tokenize, batched=True, batch_size=None)

datum = "i didnt feel humiliated"
inputs = tokenizer(datum, return_tensors="pt")

import torch

from datasets import load_dataset

from transformers import AutoModel

from transformers import AutoTokenizer

def tokenize(batch):

return tokenizer(batch["text"], padding=True, truncation=True)

def extract_first_token_last_layer_hidden_state(batch):

inputs_on_device = {

k: v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names

}

with torch.no_grad():

model_outputs = model(**inputs_on_device)

hidden_state = model_outputs.last_hidden_state[:, 0].cpu().numpy()

return {"hidden_state": hidden_state}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

dataset = load_dataset("jeffnyman/emotions")

model_checkpoint = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

model = AutoModel.from_pretrained(model_checkpoint).to(device)

encoded_data = dataset.map(tokenize, batched=True, batch_size=None)

datum = "i didnt feel humiliated"

inputs = tokenizer(datum, return_tensors="pt")

But why this change in the first place? Well, we’re going to revisit our friend the map() function here.

Remember how we brought in the “input_ids” and “attention_mask” columns in the previous post? Well, we now have to convert those new columns into PyTorch tensors.

encoded_data.set_format("torch", columns=["input_ids", "attention_mask", "label"])

1	encoded_data.set_format("torch", columns=["input_ids", "attention_mask", "label"])

So now that we’ve processed and formatted our dataset into the desired “torch” format, we can proceed to extract the hidden states across all splits by calling our new function.

feature_embeddings = encoded_data.map(
    extract_first_token_last_layer_hidden_state, batched=True
)

feature_embeddings = encoded_data.map(

extract_first_token_last_layer_hidden_state, batched=True

)

Here, by the way, is why we had to use the .cpu().numpy() earlier. The map() method requires the processing function to return Python or NumPy objects when using batched inputs.

Here the “feature embeddings” naming of the variable is quite specific for me. This communicates that the variable holds embeddings or representations of the input data that can be used as features for downstream tasks, like classification.

Running all this will trigger a potentially long mapping process depending on your CPU. This mapping will occur for each split in the data set.

Earlier when we tokenized and called the map() function, you might remember that we passed “batch_size=None”. Here we didn’t do that and what that means is that a default batch size will be used.

The default batch size used by the datasets library is generally determined by its internal settings and the library’s design principles. At least in my experience, it’s not directly exposed as a parameter that you can query programmatically.

At the time of writing, the Datasets documentation says: “The default batch size is 1000.”

As we did before, let’s check our columns now:

print(feature_embeddings["train"].column_names)

1	print(feature_embeddings["train"].column_names)

What happened here is that applying our new function has added a new “hidden_state” column to our dataset.


['text', 'label', 'input_ids', 'attention_mask', 'hidden_state']

Now that we’ve extracted the hidden states associated with each piece of text, the next step in our pipeline would typically involve training a classifier on those hidden states.

Remember that the hidden states serve as the feature representation of the text. We can use these features to train a classifier to predict the desired target variable — in our case, emotion label — associated with each text.

Training Our Classifier Model

To train a classifier on the extracted hidden states, you would typically organize those hidden states into a feature matrix. Each row of the feature matrix represents a text instance and the columns then correspond to the different dimensions of the hidden states. This feature matrix is then used as input to the classifier.

Add an import at the top of your script for NumPy:

import numpy as np

1	import numpy as np

We’re going to use the hidden states as input features and the labels as targets. The hidden states serve as the learned representations of the input text and the labels are the ground truth values that we’re trying to predict using those representations.

Think of “ground truth” as “known to be accurate.”

Here’s the code to create our feature matrix, which can you append to our growing test script:

X_train = np.array(feature_embeddings["train"]["hidden_state"])
X_valid = np.array(feature_embeddings["validation"]["hidden_state"])
y_train = np.array(feature_embeddings["train"]["label"])
y_valid = np.array(feature_embeddings["validation"]["label"])
print(X_train.shape, X_valid.shape)

X_train = np.array(feature_embeddings["train"]["hidden_state"])

X_valid = np.array(feature_embeddings["validation"]["hidden_state"])

y_train = np.array(feature_embeddings["train"]["label"])

y_valid = np.array(feature_embeddings["validation"]["label"])

print(X_train.shape, X_valid.shape)

Here we’re extracting the hidden states for both the training and validation splits from the feature_embeddings dataset. We’re then converting these extracted hidden states into NumPy arrays using np.array(). Similarly, we’re extracting the corresponding labels for both the training and validation splits.

The output will be:


(16000, 768) (2000, 768)

(16000, 768) represents the shape of the training feature matrix. This indicates we have 16,000 samples (texts) with each sample having 768 features (dimensions).
(2000, 768) represents the shape of the validation feature matrix. This indicates we have 2,000 validation samples, each with 768 features (dimensions).

Before training a model on the extracted hidden states, it’s a common and valuable practice to perform a quick visual check to ensure that these hidden states effectively capture the information you want to classify (emotional sentiments in this case). Visualizing the features can help us gain insights into their distribution, patterns, and separability. That can guide our decision-making process during model training and analysis.

Let’s talk about this a bit before we try it out.

Visualization in General

Visualization can reveal potential issues, such as clustering or patterns that don’t align with the expected classes. It can also provide a sense of the separability of different classes and how well the features differentiate between them.

Overall, taking this step can help us make informed decisions about preprocessing, feature selection, or the choice of classification algorithm based on the observed behavior of the features in the visualization.

That’s great and all. But remember how in the previous post how I said visualizing some of this is hard? And in that post I even used a image classification example to make that simpler. Well, that visualizing challenge hasn’t changed. How are we going to “visualize” hidden states in 768 dimensions?

This is where some tooling comes in to help you out.

One example is to use the UMAP (Uniform Manifold Approximation and Projection) algorithm to reduce the dimensionality of the hidden states from 768 dimensions to two dimensions. UMAP is a technique designed to preserve the underlying structure of high-dimensional data in lower dimensions.

That sounds great! But there’s an important preprocessing step. UMAP performs well when the input features are scaled to a specific range. There are certain ranges that are considered typical for dimensionality reduction techniques and one of those is [0,1].

What this scaling does is ensures that the algorithm effectively captures the relationships between data points in the reduced dimensionality while preserving the underlying structure of the data. To achieve this, we can use a preprocessing technique known as MinMaxScaler. The MinMaxScaler rescales each feature of the hidden state vectors so that they fall within the [0,1] range.

This preprocessing step is crucial for obtaining meaningful results from the subsequent UMAP dimensionality reduction. This helps us make sure that the input data is in the appropriate format for the algorithm.

You’ll need to add the following imports at the top of your script:

import pandas as pd
from umap import UMAP
from sklearn.preprocessing import MinMaxScaler

import pandas as pd

from umap import UMAP

from sklearn.preprocessing import MinMaxScaler

Now add this code to your script and run it.

X_scaled = MinMaxScaler().fit_transform(X_train)
mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)
df_emb = pd.DataFrame(mapper.embedding_, columns=["X", "Y"])
df_emb["label"] = y_train
print(df_emb.head())

X_scaled = MinMaxScaler().fit_transform(X_train)

mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)

df_emb = pd.DataFrame(mapper.embedding_, columns=["X", "Y"])

df_emb["label"] = y_train

print(df_emb.head())

Yikes, that looks confusing, doesn’t it? Let’s break it down.

First, we apply the MinMaxScaler to the training feature matrix X_train. This scaling step ensures that the feature values are within the [0,1] interval, which is important for the UMAP algorithm to work effectively.
Using the scaled feature matrix, we create an instance of the UMAP class with n_components=2 to indicate that we want to reduce the dimensionality to two dimensions. We also specify the metric as “cosine”, which measures the cosine similarity between vectors. The UMAP algorithm then performs the dimensionality reduction, preserving the relationships between the original data points as much as possible.
The resulting reduced two-dimensional embedding produced by UMAP is stored in the mapper.embedding_ attribute.

You will get another round of mapping that takes place here when you run this. Note that this can take quite a bit of time. We are, after all, going from 768 to two dimensions. Exactly how much time is impossible to say as it depends on your processor.

You’ll get this output:


          X         Y  label
0  4.503757  6.347070      0
1 -2.730354  5.456592      0
2  5.498598  2.742659      3
3 -2.113808  2.490611      2
4 -2.961837  3.255969      3

As we’ve done in previous posts, here we create a pandas DataFrame to hold the two-dimensional coordinates. The columns “X” and “Y” represent the two dimensions. We also append the corresponding y_train labels as the “label” column.

So, crucially, each row in the DataFrame represents a data point from our original hidden state vectors.

Each row’s X and Y values correspond to the two-dimensional coordinates where the data point has been projected by the UMAP algorithm. This representation allows us to visually explore the distribution and relationships between different emotional classes in a two-dimensional space.

The intent of the output is to provide a clear overview of how the original high-dimensional hidden state vectors have been transformed into a lower-dimensional space for visualization and analysis.

Okay … but that’s not much in the way of a visualization, right?

Refining the Visualization

Investigating the density of points for each category separately in our reduced two-dimensional embedding can provide valuable insights into how well the different emotional classes are separated and distributed in the lower-dimensional space.

Density plots can help you visually assess the separability of the classes, which can then help you identify any potential overlaps or clusters. It can also help you understand the overall distribution of data points within each class.

Add this import to the top of your script:

import matplotlib.pyplot as plt

1	import matplotlib.pyplot as plt

And then let’s add this code to the script:

fig, axes = plt.subplots(2, 3, figsize=(7, 5))
axes = axes.flatten()
color_maps = ["Greys", "Blues", "Oranges", "Reds", "Purples", "Greens"]
labels = dataset["train"].features["label"].names

for category_index, (label, cmap) in enumerate(zip(labels, color_maps)):
    df_emb_sub = df_emb.query(f"label == {category_index}")
    axes[category_index].hexbin(
        df_emb_sub["X"], df_emb_sub["Y"], cmap=cmap, gridsize=20, linewidths=(0,)
    )
    axes[category_index].set_title(label)
    axes[category_index].set_xticks([]), axes[category_index].set_yticks([])

plt.tight_layout()
plt.show()

fig, axes = plt.subplots(2, 3, figsize=(7, 5))

axes = axes.flatten()

color_maps = ["Greys", "Blues", "Oranges", "Reds", "Purples", "Greens"]

labels = dataset["train"].features["label"].names

for category_index, (label, cmap) in enumerate(zip(labels, color_maps)):

df_emb_sub = df_emb.query(f"label == {category_index}")

axes[category_index].hexbin(

df_emb_sub["X"], df_emb_sub["Y"], cmap=cmap, gridsize=20, linewidths=(0,)

)

axes[category_index].set_title(label)

axes[category_index].set_xticks([]), axes[category_index].set_yticks([])

plt.tight_layout()

plt.show()

Running this will do yet more mapping.

The essence of the code is that we’re creating a 2×3 grid of subplots. This layout allows us to create separate density plots for each emotional category. Then we define color maps to extract the emotional category labels from the dataset features. For each category, the code creates what’s called a “hexbin plot,” which is short for hexagonal bin plot. Here’s what you’ll get:

These are some test results! But what does that show us?

Interpreting Our Test Results

Interpreting the resulting density plots from the UMAP visualization can require a combination of understanding the visualization itself, the nature of the data, and a little bit of domain knowledge. So let’s consider some general guidelines on how to interpret the density plots. I’m frame these as test observables.

Test Observable: Cluster Separation. Look for clear separations or clusters of data points in the two-dimensional space. If different emotional categories are well-separated, it suggests that the hidden states are capturing distinct features for each emotion. On the other hand, if there’s significant overlap, it might indicate that the original features are less discriminative.

Test Observable: Density Variation Observe the density of data points within each cluster. Higher-density areas might indicate a more concentrated group of data points, while lower-density areas could signify sparser regions. If there are dense and sparse areas within the same emotion, it might imply varying degrees of intensity or variations in the emotional category.

Test Observable: Overlaps and Boundaries. Pay attention to regions where clusters overlap. Overlapping areas might indicate similar characteristics shared by multiple emotional categories. You should also try to identify regions near the boundaries between clusters since these could represent ambiguous instances that are difficult to classify.

Test Observable: Outliers and Anomalies. Look for isolated data points or clusters that are far from others. These could represent outliers or anomalies that deviate from the main patterns. Understanding such instances might offer you insights into unique cases or misclassifications.

Test Observable: Correlations. If there are patterns of correlation or co-occurrence between certain emotions, it’s worth investigating further. For example, if two emotions often appear close to each other, it might suggest a connection between them.

A key point for all of the above is that your interpretation should always consider the domain context. Some emotional categories might naturally be closer due to semantic similarities. Knowledge about the emotions being classified can provide valuable insights into these patterns.

A key takeaway here is interpreting density plots is a nuanced process that combines data analysis skills and a certain amount of domain expertise. It’s important to interpret the visualization in the context of your specific use case and to cross-reference your findings with other analyses and insights.

As an exercise, consider what patterns you see in our visual. Hint: look at the darker cluster of dots in each plot.

In the context of hexbin plots like the one we just created, darker colors correspond to higher point density. Each hexagonal bin in the plot represents a specific area in the two-dimensional space. The color of the bin indicates the density of data points within that area. Darker colors indicate a higher concentration of data points, while lighter colors suggest a lower density.

Interpreting Our Test Results

The model we’re working with is a pre-trained language model (DistilBERT) and we’ve used it to extract hidden states from our text data. These hidden states are representations learned by the model during its pre-training phase, where it was trained on a large corpus of text data to predict masked words and capture contextual information.

Crucially, however, our model was not trained to know the difference between the emotions shown in the plot. That’s why the overall scattering in the plot might seem a little broadly similar in nature.

Put another way, the model hasn’t undergone specific training to differentiate between the emotions we’re analyzing. Instead, it’s learned general language patterns and context during its pre-training. The information captured in the hidden states reflects these learned patterns, which are now being visualized and analyzed for patterns related to emotions.

So that means our test results are … what?

Well, we’ve observed that the distribution of hidden states in the two-dimensional embedding space varies across different emotional categories. This variation suggests that the model has learned to differentiate between some emotions, as evidenced by the distinguishable clusters or patterns in the plot.

At the same time, we can certainly see that for certain emotions, there might not be a clear and obvious boundary between their corresponding clusters. This indicates that the model’s representations for these emotions might share some similarities or overlap to some extent.

This observation isn’t all that strange since it intuitively aligns with the complex nature of emotions and the challenges in precisely separating them based on language patterns alone.

That notion of “intuitively” is deliberate on my part. You have to be prepared for quantitative and qualitative aspects working together to form an assessment of quality. This is something specialist testers are used to anyway.

Train the Model

Now we can use the extracted hidden states as features to train a logistic regression model. Logistic regression is a simple and efficient algorithm for binary classification tasks and doesn’t require a ton of computational resources, which is a good reason for choosing it here.

By training a logistic regression model on these hidden states, we would be leveraging the information captured by the pre-trained model to perform the classification task. This approach allows you to evaluate how well the learned hidden state representations translate into differentiating between emotions in a simple, interpretable model.

Let’s try this out. Add this import to the top of our script:

from sklearn.linear_model import LogisticRegression

1	from sklearn.linear_model import LogisticRegression

Then add this code to the script:

lr_clf = LogisticRegression(max_iter=3000)
lr_clf.fit(X_train, y_train)
print(lr_clf.score(X_valid, y_valid))

lr_clf = LogisticRegression(max_iter=3000)

lr_clf.fit(X_train, y_train)

print(lr_clf.score(X_valid, y_valid))

Here a logistic regression classifier is initialized with the specified maximum number of iterations; 3,000 in our case. The logistic regression algorithm will use these iterations to optimize its internal parameters to best fit the training data.

The statement with fit() is really important since that “fits” (trains) the logistic regression model using the training data. Remember that X_train contains the hidden states of the text data and y_train contains the corresponding emotion labels.

Crucially, the model learns to map the features (hidden states) to the target labels (emotions) during this training process.

After the model is trained, it’s evaluated using the validation dataset — X_valid and y_valid. The score() method calculates the accuracy of the trained model on the validation data. This accuracy metric provides an indication of how well the logistic regression model is performing in classifying emotions based on the hidden states.

You’ll likely see this output:


0.633

An accuracy of 0.633 (or approximately 63.3%) on the validation dataset indicates the performance of our logistic regression model in classifying emotions based on the hidden states.

An accuracy of 0.633 means that the model correctly classified 63.3% of the instances in the validation dataset. In other words, out of all the validation examples, about two-thirds were classified correctly based on the model’s predictions. An accuracy score of 0.633 suggests that the model is making meaningful predictions, but there’s certainly still room for improvement.

So, wait. If 50% is essentially equated to random guessing, then our measure of 63.3% is only about 13.3% better than random guessing. Yet keep in mind that we know our dataset is imbalanced. We looked at this in the previous post. We also know our dataset is multiclass because we have six emotions we’re considering.

Why does any of that matter?

Traditional accuracy, which measures the overall proportion of correct predictions, can be misleading in imbalanced/multiclass scenarios. This is because the presence of imbalanced classes can lead to skewed results where accuracy might appear high due to the dominance of a majority class, while the model’s performance on minority classes might be overlooked.

In an imbalanced and multiclass scenario, a high accuracy like 63.3% might be misleading if it’s driven primarily by the dominant class or classes. It’s possible that the model is performing well on the majority class while struggling with minority classes. The problem here is that the overall accuracy doesn’t provide a clear picture of this behavior.

To get a more accurate assessment of our model’s performance, particularly its ability to correctly classify instances from all classes, let’s figure out how we can compare it to something.

That something could be a baseline model that establishes a comparison point for our actual model’s performance. These baseline models follow simple heuristics, such as always predicting the majority class or randomly selecting a class, and serve as benchmarks to gauge the meaningfulness of our model’s predictions.

We can use what’s called a “dummy classifier” for this approach.

Add the following import to the top of your script:

from sklearn.dummy import DummyClassifier

1	from sklearn.dummy import DummyClassifier

Then add the following:

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
print(dummy_clf.score(X_valid, y_valid))

dummy_clf = DummyClassifier(strategy="most_frequent")

dummy_clf.fit(X_train, y_train)

print(dummy_clf.score(X_valid, y_valid))

We’re creating a DummyClassifier instance with the strategy of “most_frequent,” which means it will always predict the majority class. Then, we’re fitting the dummy classifier on our training data and evaluating its accuracy on the validation data using the score() method.

This will give us the accuracy of the dummy classifier, which serves as a baseline to compare against our actual model’s performance. If our model’s accuracy is significantly better than the dummy classifier’s accuracy, it indicates that our model is providing more meaningful predictions than a simplistic majority-class predictor.

Your output will be:


0.352

An output of 0.352 for the dummy classifier’s accuracy indicates that the dummy classifier, which always predicts the majority class, achieves an accuracy of 35.2% on your validation data. This means that the majority class is present in about 35.2% of our validation samples.

Comparing this accuracy of 0.352 to our model’s accuracy of 0.633, we can observe that our model’s performance is significantly better than that of the dummy classifier. This suggests that our actual model is, in fact, making predictions that go beyond simply predicting the majority class and is, in fact, capturing meaningful patterns in the data.

The difference of around twenty-eight percentage points between the two accuracy values indicates the extent to which our model is improving over the baseline heuristic provided by the dummy classifier.

Let’s Get Confused

Let’s visualize yet one more time.

We can gain deeper insights into the classifier’s performance by examining what’s called a confusion matrix. This style of matrix offers a comprehensive view of the alignment between the predicted and actual class labels. This can help shed some light on the model’s accuracy, misclassifications, and areas of strength across all classes.

A standard way to describe this is that by analyzing the counts of true positives, true negatives, false positives, and false negatives for each class, we can uncover patterns and relationships that highlight the classifier’s performance characteristics and guide improvements. In fact, however, that standard way is not quite accurate in our context.

Add the following to the top of your script:

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

1	from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

This gets a little involved, so let’s create a function:

def plot_confusion_matrix(y_preds, y_true, labels):
    cm = confusion_matrix(y_true, y_preds, normalize="true")
    _, ax = plt.subplots(figsize=(6, 6))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)

    plt.title("Classifier Performance: Confusion Matrix")
    plt.show()

def plot_confusion_matrix(y_preds, y_true, labels):

cm = confusion_matrix(y_true, y_preds, normalize="true")

_, ax = plt.subplots(figsize=(6, 6))

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)

disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)

plt.title("Classifier Performance: Confusion Matrix")

plt.show()

Here our function takes predicted labels (y_preds), true labels (y_true), and class labels (labels) as inputs. This function generates a normalized confusion matrix using the confusion_matrix() function.

Our code will display the confusion matrix as a heatmap. Now add the following to put all that to use:

y_preds = lr_clf.predict(X_valid)
plot_confusion_matrix(y_preds, y_valid, labels)

1 2	y_preds = lr_clf.predict(X_valid) plot_confusion_matrix(y_preds, y_valid, labels)

Here we’re using the logistic regression classifier — that’s what lr_clf refers to — to predict labels for the validation data. Then we pass the predicted labels, true labels, and class labels to our function to visualize the performance of the classifier through the confusion matrix.

You’ll see something like this:

So we have a new set of test observations.

Interpreting Our New Test Results

Interpreting a confusion matrix involves understanding the different categories it represents and how they relate to the model’s predictions and the true labels. In our specific context, the confusion matrix provides insights into how well our model is performing for each emotion class.

But here’s the trick. A confusion matrix typically has four main components.

True Positive (TP): The number of instances where the model correctly predicted a positive class (correctly classified emotions).
True Negative (TN): The number of instances where the model correctly predicted a negative class (correctly classified non-emotions).
False Positive (FP): The number of instances where the model incorrectly predicted a positive class (misclassified non-emotions as emotions).
False Negative (FN): The number of instances where the model incorrectly predicted a negative class (misclassified emotions as non-emotions).

From these values, you can calculate various metrics that provide insights into a given model’s performance. Some of those metrics are:

Accuracy: The overall proportion of correct predictions (TP + TN) divided by the total number of instances.
Precision: The proportion of correctly predicted positive classes (TP) out of all instances predicted as positive (TP + FP). Precision gives you an idea of how well the model identifies true positives among the predicted positive instances.
Recall (Sensitivity): The proportion of correctly predicted positive classes (TP) out of all instances that are actually positive (TP + FN). Recall measures the model’s ability to identify all positive instances.
F1-Score: This is the harmonic mean of precision and recall, providing a balance between the two. It’s particularly useful when classes are imbalanced.
Specificity: The proportion of correctly predicted negative classes (TN) out of all instances that are actually negative (TN + FP).
False Positive Rate: The proportion of incorrectly predicted positive classes (FP) out of all instances that are actually negative (TN + FP).

By analyzing the values in these metrics from the confusion matrix, you can understand how well a given model is performing, which classes it excels at predicting, and where it might struggle. This information can guide further improvements and adjustments to the model.

Yet you see none of that as part of the matrix, right?

So let’s talk about what the matrix shows us.

In a confusion matrix visualization, each cell represents a certain category’s predictions versus the actual true labels. While the color intensity can give you a sense of density, it’s not typically used to directly indicate TP, TN, FP, or FN. Instead, the labels themselves and the alignment of the cells are used to interpret the matrix.

The confusion matrix represents the counts or proportions of instances that fall into various categories (true positive, true negative, false positive, false negative), but it doesn’t provide explicit labels for these categories.

But here’s a really important point: the concepts of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) are typically defined for binary classification tasks. These concepts are not directly applicable to multiclass scenarios.

So our confusion matrix provides a visual representation of how well our classifier is performing across different emotion categories. Each row represents the true emotion labels, while each column represents the predicted emotion labels. The values within the matrix indicate the normalized proportion of instances that fall into each combination of true and predicted labels.

But how do we interpret it?

Well, if you focus on the diagonal elements (from top left to bottom right), these represent the instances that were correctly classified. The larger the value on the diagonal, the better the classifier is performing for that emotion category. In our matrix, we can see that the diagonal values are generally higher, indicating that the classifier is performing relatively well.

The off-diagonal elements represent instances that were misclassified. For instance, if you look at the cell corresponding to “sadness” as the true label and “joy” as the predicted label (row 1, column 2), the value is 0.11. This means that 11% of instances that were actually “sadness” were misclassified as “joy.”

A few other things we can observe:

Love and Joy Misclassification: Instances labeled as “love” are sometimes misclassified as “joy” (row 3, column 2) with a value of 0.46. Similarly, “joy” is sometimes misclassified as “love” (row 2, column 3) with a value of 0.05. This suggests that there might be some similarity in the hidden state representations between these two emotions.
Anger and Fear Misclassification: Instances labeled as “anger” are sometimes misclassified as “fear” (row 4, column 5) with a value of 0.10, and vice versa (row 5, column 4) with a value of 0.09. This indicates that the model might have difficulty distinguishing between these two emotions, which could be due to overlapping linguistic patterns.
Surprise and Love Misclassification: “Surprise” is sometimes misclassified as “love” (row 6, column 3) with a value of 0.01. This suggests that the model might have trouble differentiating between these two emotions, possibly because they share some linguistic expressions.

These are some good test results!

Wrapping Up

Yikes, that was a long post and as you can see this was probably the most involved post of this six-part series. We were able to start relatively simple but the complexity grew as we moved on.

The positive side of this is that our code is effectively sending a dataset through a learning pipeline that involves tokenization, extracting hidden states, dimension reduction using UMAP, training logistic regression and dummy classifiers, and finally evaluating the trained logistic regression model using a confusion matrix plot.

With that, I feel I’ve introduced you to a broad spectrum of a particular context within artificial intelligence. Should I continue this series, this basis will serve us well. Even if this series stops here and now, I feel there’s been some value to taking people through this entire journey, trying to assume as little background knowledge as possible.

A key point for readers of a testing blog is to see that testing was front-and-center for everything we did in these posts. What I showed in these six posts is merely the tip of the iceberg for the overall context that testers need to start becoming familiar with.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …

Produce Our Hidden States

Forward Passing

Reasoning About Hidden States

Feeding More Data

Training Our Classifier Model

Visualization in General

Refining the Visualization

Interpreting Our Test Results

Interpreting Our Test Results

Train the Model

Let’s Get Confused

Interpreting Our New Test Results

Wrapping Up

Leave a Reply Cancel reply