Select Mode

Text Trek: Navigating Classifications, Part 3

In this post, we’ll explore some particular datasets. The focus here is just to get a feel for what can be presented to you and what’s available for you to use. We’ll do a little bit of code in this post to get you used to how to load a dataset. Thinking About AI

Keep in mind that in the first and second posts of this series, I focused on the fundamentals of what you have to do with data in order to pass it into a Transformer model. In this post, we’ll focus on what actual data will tend to look like. Then in the following posts, I’ll show how to feed this more realistic data into a Transformer model using the fundamentals previously discussed.

Some Ecosystem Context

First let’s talk about the Hugging Face Hub. This hub serves as a central repository and community platform for sharing, discovering, and collaborating on various models and datasets related to natural language processing.

Perhaps worth noting that there’s also a PyTorch hub and a TensorFlow hub. Having a hub these days is pretty popular!

Relevant to this post, the Hugging Face Hub hosts a whole bunch of freely available models and datasets.

The concept of Datasets refers to a specific module within this ecosystem and the goal here is the ability to have a standard interface for data including the means of loading and processing that data. These datasets are used for training and evaluating various models.

In this post, I’ll show you a few examples. My goal here will be to show you some details about the data itself and I do this because understanding your test data is crucial. I have three examples that are set up slightly differently to show what you’re likely to encounter.

A Rotten Dataset

First let’s consider my rotton_tomatoes reviews dataset. Rotten Tomatoes logo

The description I provide tells you much of what you need to initially know:

This is a dataset containing 4,265 positive and 4,265 negative processed sentences from Rotten Tomatoes movie reviews.

Thus this dataset contains movie reviews that were published on the Rotten Tomatoes web site. Here the term “processed” means that the sentences in this dataset have gone through pre-processing and text preparation before being included in the dataset.

I also provide some information on the breakdown of the data:

  • Training Data: 8,530 rows
  • Validation Data: 1,066 rows
  • Test Data: 1,066 rows

Being attentive to these data “splits” is essential because they signify specific divisions of data available for distinct stages in your analysis.

  • The training data is used to train a machine learning model.
  • The validation data is used to fine-tune and evaluate a model during training.
  • The test data is used to assess a model’s real-world performance after training.

Now let’s consider a few qualities of this data.

Checking Our Data Quality

the word quality with a gold star underneath it

If we add up the number of samples in all three splits (8530 + 1066 + 1066), we get a total of 10,662. This certainly matches the total number of samples reported in the dataset. Shows that 10,662 data items are in my dataset

Therefore, the split breakdown is consistent with the total number of samples reported in my description.

The description mentions 4,265 positive sentences along with an equal number of negative sentences. Thus we have 4265 + 4265 = 8530, which is consistent with the total training count mentioned in the description and thus does indicate an even split of 50% positive and 50% negative samples.

So that all sounds great, assuming the person who wrote the information (me, in this case) was accurate.

As a tester, of course, you would actually want to verify all of this by looking at the data directly. Specifically, you would want to inspect the dataset to check the class distribution in the train split to see if it aligns with my claim in the description of 50% positive and 50% negative. I’ll show you how to do that when we load the dataset.

I bring all this up because you will encounter datasets that have discrepancies in what they describe and what they actually provide. It’s critical to look closely at whatever data you or your team might be using.

Checking Our Data Quantity

the word quantity with multiple stars above it

Now let’s ask this: is what we’re seeing a good amount of data?

The size of the dataset — whether it’s considered “good enough” or not — really depends on the specific task you’re working on and the complexity of the model you plan to use for that task.

In general, larger datasets can potentially lead to better model performance. In the case of the Rotten Tomatoes movie reviews dataset — 8,530 samples in the train split and 1,066 samples each in the validation and test splits — we have what’s considered a reasonable amount of data, particularly for certain sentiment analysis tasks like this one.

Why do I say that, though? Let’s break it down a bit.

We can calculate the proportions of the data splits relative to the total dataset size.

  • Training: (8,530 / 10,662) × 100 ≅ 80%
  • Validation: (1,066 / 10,662) × 100 ≅ 10%
  • Test: (1,066 / 10,662) × 100 ≅ 10%

There are some widely accepted “good splits” in terms of data distribution. Visual of industry standard splits between data sets

Thus having 80% of the data in the train split and 10% each in the validation and test splits is generally considered a good distribution for a sentiment analysis task. It indicates that we have a substantial amount of data for training and that we have reserved a reasonable portion for model validation and evaluation.

As a tester in this context, if I was to put all of the above into a nice concise statement of confidence to my delivery team, I might say something like: “I’ve confirmed that our data distribution has been set up to mitigate overfitting and allow us to assess our model’s generalization performance effectively.”

You might wonder about having the test and validation splits with the same number of rows (1,066) and whether that makes sense. Put another way: someone might ask if the breakdown should perhaps differ between the two sets. Put yet another way: is having an equal breakdown a good idea?

An equal split like this can be acceptable depending on the dataset’s characteristics and how the data was split. However, it’s worth noting that this equal split isn’t a strict requirement. It’s a truism that the choice of the number of samples in each split should be driven by the specific needs of the task and the available data.

While having both test and validation splits with the same number of rows isn’t inherently problematic, it’s essential to ensure that the data is split in a way that maintains its representativeness and minimizes any potential bias.

What’s absolutely critical is making sure you have a sufficiently large number of samples in both splits to draw statistically significant conclusions about the model’s ability to generalize. This is one of the key ways that you would measure the performance (effectiveness) of the model. Thus this is a key way of understanding the quality present.

What I just described above is a key ambit of testing in this context. It’s very much the idea of testing as a design activity. By which I mean testing putting pressure on design. In this case, the “design” refers to the design of our test data.

The Sets of Data

The rationale for having three different datasets — train, validation, and test — is to facilitate effective model development, evaluation, and generalization. This is pretty much a truism in machine learning and data science tasks. Each dataset serves a specific purpose in what’s called the model development pipeline. Datasets as applied to model pipeline

The training set is the largest dataset and it’s used for doing exactly what it’s named for: training the machine learning model. This set contains labeled examples on which the model learns the underlying patterns and relationships between input features and output labels.

Input features are the data representations that serve as the model’s inputs, such as text, images, or numerical values. Output labels are the corresponding target values or categories that the model is trying to predict or classify based on the input features.

The validation set provides an independent dataset to evaluate the model’s performance during training and to make decisions about the model architecture you want to use as well as how you want to configure that architecture’s particular hyperparameters.

Hyperparameters are adjustable settings or configuration choices made before training a machine learning model. These parameters influence the learning process and performance of the model. They are what you adjust if you find the learning process and/or performance of the model is of lower quality than you would like.

The test set is a completely unseen dataset that’s used to assess the final performance of the trained model. This set provides an unbiased estimate of the model’s ability to generalize. That’s the case because this is data that the model has never been exposed to.

The upshot here is that the separation of data into these three sets is crucial for ensuring that the model is well-trained, optimized, and capable of generalizing to new, unseen data. Without these distinct sets, the model’s performance evaluation might be biased in various ways. That can lead to an inaccurate estimation of the real-world capabilities of the model.

We’re actually going to practice loading this dataset in a bit but let’s take a look at an entirely different dataset just to get a feel for what can be different.

A Verification Dataset

Now let’s check my scifact dataset.

With a dataset like this, you would be operating in a context where you want to help people verify scientific or policy-based claims. This kind of verification is predicated on the process of proving or disproving — with some degree of certainty — claims made in scientific research papers or in public statements of policy.

The classification task here would be to classify the veracity of an input claim as verified against a corpus of documents that support or refute the claim. A “claim” in this context is usually defined as an atomic factual statement expressing a finding about one aspect of an empirical entity or process, which can be verified from a single source. Visual of a claim checked against a corpus

This is a very active area of research that’s motivated by the proliferation of misinformation in political news, social media, and on the broader web.

Unlike most sentiment analysis datasets that involve just “positive” and “negative” polarities, here we’re looking at a dataset that categorizes based on a relationship between two aspects — a claim and its evidence. That relationship will be framed as support, contradict or what amounts to undetermined. Incidentally, the rationale for why this might be useful is covered in the paper “Fact or Fiction: Verifying Scientific Claims.”

In this context, “support” and “contradict” are not related to sentiments, and therefore, the term “polarity” often used in sentiment analysis would not be applicable in this context.

Let’s look at the details of this dataset since it differs structurally from the Rotten Tomatoes one a little bit.

Data Subsets

This dataset has a subset called “corpus” that has a train set of 5,183 rows. Another subset is called “claims” and that one has a train set of 1,261 rows, a test set of 300 rows, and a validation set of 450 rows.

Okay, so first it’s probably worth asking: why have two such sets?

  • The “corpus” subset is a larger training set and it contains a diverse range of data that can be used to train a claim verification model more comprehensively. Having a substantial amount of data in the training set can be beneficial because it allows the model to learn patterns and generalizations better.
  • The “claims” subset is split into three parts, just as we looked at with the Rotten Tomatoes reviews dataset and the rationale for this subset is the same here as it was there.

But that still leaves open why we might have two sets.

  • The “corpus” subset is larger and more comprehensive because it contains a wider variety of claims and their corresponding verifications. However, this kind of dataset might not have labels or specific claim-veracity pairs, which can make it unsuitable for direct training.
  • The “claims” subset is a more curated and labeled dataset specifically created for training and evaluating claim verification models. It’s smaller than the “corpus” subset but has the necessary annotations needed for supervised learning.

Data Breakdown

Similar to what we did for the Rotten Tomatoes reviews dataset, let’s do a quick calculation to verify if the split breakdown for the scifact dataset makes sense.

As per my description on the dataset page, you’re told the total data in the train set for the “corpus” subset is 5,183. The “claims” subset has a total of train set (1261) + test set (300) + validation set (450) and thus 2,011. By adding up the number of samples in each subset, we see that the total number of samples matches the total number of samples reported in the dataset (2011 + 5183 = 7194). Shows that 7,194 data items are in my dataset

Therefore, the split breakdown for the scifact dataset is consistent with the total number of samples in the dataset as provided on the dataset description.

In terms of percentages, here things are a little different than with the Rotten Tomatoes dataset.

  • Training: (1,261 / 2,011) × 100 ≅ 62.7%
  • Validation: (450 / 2,011) × 100 ≅ 22.4%
  • Test: (300 / 2,011) × 100 ≅ 14.9%

Is this good? Well, it’s hard to say. It could be fine. It could be problematic.

That’s certainly not very helpful so let’s consider how we can determine if it’s problematic or not.

The split should reflect the real-world distribution of the data, ensuring that each set (training, validation, and test) captures a representative sample of the data. If the data is not well-distributed across the sets, it could lead to biased model evaluation and that would lead to generalization issues.

It’s important to verify that the data in the training, validation, and test sets come from the same or similar distributions. If there are significant differences in data characteristics between the sets, the model’s performance on the test set might not be reliable.

This is actually a reason for having that wider corpus set. We can look at the wider distribution that the data in the claims set was pulled from. This is a really crucial aspect that gets left out of a lot of discussions on data quality in this context.

In terms of actually trying to figure out data characteristics, you can start by visually inspecting the data in the training, validation, and test sets. You can use visualizations like histograms, word clouds, and scatter plots to compare the distributions of claims, word frequencies, or any other relevant features across the sets. I’ll show some examples of this in the following posts.

You can also check for class imbalance in the datasets, especially in the training set. If one claim class dominates the training data, the model might become biased towards that particular class. That would mean the model is likely to have difficulty correctly classifying other classes. Techniques like oversampling, undersampling, or using class weights can help address class imbalance issues.

Going into all of that here would take us a bit too far afield but the key thing to note is why we even went down this road: because our data had a split that made us question it a bit. Testing is all about questioning.

Loading Some Data

Now let’s go back to that Rotten Tomatoes dataset and see about loading it up. If you’re going to follow along, you need to install the datasets library.

pip install datasets

This library provides a way for you to load a dataset by its name. Create a Python script and put the following in it:

If this is the first time you’ve loaded my dataset, you’ll see various things download:


Downloading builder script
Downloading readme
Downloading data
Generating train split: 8530 examples
Generating validation split: 1066 examples
Generating test split: 1066 examples

When you run that command, the library looks for a “builder script” that corresponds to the “rotten_tomatoes_reviews” dataset and runs the necessary functions to download the data. Here the script being referenced is my rotten_tomatoes_reviews.py.

Could you run that script yourself to download the data? You absolutely could. You could save the script to your local machine and do something like this:

The fact that you don’t have to take that second approach shows you one nice aspect of the Hub, which makes datasets available to you and easy to load.

With either approach, the train, test, and validation splits of the Rotten Tomatoes movie reviews will be downloaded as separate datasets. With both approaches, you will end up with three files:

  • rotten_tomatoes_reviews-train.arrow
  • rotten_tomatoes_reviews-validation.arrow
  • rotten_tomatoes_reviews-test.arrow

These will be stored in your home user directory under .cache/huggingface. This is the case on all operating systems. This is also the case for any dataset that you load in this way.

You might not be familiar with the arrow format type that those files are provided as. Arrow is an in-memory columnar — thus table-based — data format. It was specifically designed to speed up data analytics and data processing workloads.

For our purposes here the format of the data doesn’t matter that much. But it’s probably worth noting that you’ll find lots of datasets that have different formats. For example, there’s a reviews sentiment analysis dataset that uses CSV. Another dataset does poem sentiments and that one uses TSV. A popular emotions dataset uses JSONL.

JSON is a data format that represents data as a collection of key-value pairs whereas JSONL — JSON Lines — is a format where each line of the file represents a single JSON object and each object is a separate, self-contained entity.

There’s a climate sentiment dataset that uses a format called parquet, which, like arrow, might be a format you’re not as familiar with but is fairly common in this context. Essentially parquet is another columnar storage format that’s optimized for working with large datasets and was designed to integrate with distributed processing frameworks.

Finally, just to show you one possible variation, if you’re using local data files, you can also load up your data like this:

I call this out because some datasets that you look at on the Hub will be just a set of files in some specific format for you to download which you then have to load up yourself.

Investigate Our Data

Let’s take a look at what our dataset object actually is.

You’ll get the following output:


DatasetDict({
  train: Dataset({
      features: ['text', 'label'],
      num_rows: 8530
  })
  validation: Dataset({
      features: ['text', 'label'],
      num_rows: 1066
  })
  test: Dataset({
      features: ['text', 'label'],
      num_rows: 1066
  })
})

Thus you essentially get a dictionary data structure where each key corresponds to a different split and each split is a Dataset instance.

We can use the usual dictionary syntax to access an individual split, which returns an instance of the Dataset class.

This object instance behaves like an ordinary Python array or list. To show that, you can query the length.

Obviously this isn’t terribly useful since we can see the num_rows from our Dataset objects but it does reinforce the point that there’s no “magic” going on behind the scenes here. We’re dealing with standard data structures.

In the output, you might have noticed those “features” that are listed. Those are also referred to as the column names. You can see that by doing this: <

You’ll see this output:


['text', 'label']

This is telling you the names of the columns present in the train split of the dataset. In this case, the train split contains two columns. The “text” column contains text data that represents the movie reviews in the Rotten Tomatoes dataset. Thus each row in this column corresponds to a movie review.

The “label” column contains the labels associated with each movie review. For sentiment analysis tasks like the Rotten Tomatoes dataset, the “label” column typically contains binary labels, indicating whether the movie review is classified as positive or negative.

Now let’s go back to that specific wording of “features.” We can see what data types are being used under the hood by accessing the features attribute of a Dataset object.

Here “features” refers to the data columns or attributes present in the dataset. These represent the input information used to train or analyze a machine learning model. Think of the model as asking: “What features of this problem are relevant for me to consider?”

You’ll get the following output:


{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}

This is telling us the data types and properties of the columns in the train split of the Rotten Tomatoes dataset.

The data type of the “text” column is a value type of string (dtype='string'). The id=None part indicates that the column doesn’t have a unique identifier and this means that each entry is a standalone text value.

The data type of the “label” column is a ClassLabel. This represents categorical labels. What we see is that there are two possible class labels: “neg” (negative sentiment) and “pos” (positive sentiment). Once again, the id=None part is indicating that the column doesn’t have a unique identifier.

Reinforcing that we’re dealing with a standard data structure here, we can query a single example from the data by its index.

You’ll get the following output:


{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}

What this shows is a single row that’s represented as a dictionary, where the keys correspond to the column names.

What I hope you can see here is that what this initial bit of investigation has done is given us some understanding of our data and how it’s constructed. This is investigation that you can generalize to any dataset you work with and it’s crucial to get in the habit of doing this.

Remember that one thing we did want to do, as I stated earlier, is check the balance of our class distribution between positive and negative. The easiest way to do this is to use Pandas.

pip install pandas

Here’s a script that shows how to do this:

When you run this, you should get the following output:


1    4265
0    4265

The class distribution indicates that there are 4,265 samples for each class in the “train” split of the Rotten Tomatoes dataset. This confirms that the dataset is indeed balanced, with an equal number of positive (class 1) and negative (class 0) samples.

Again, I can’t stress enough how important it is to rely on the actual class distribution in the dataset rather than assuming an even split based on what was stated in a project or dataset description. Having accurate information about the class distribution helps in properly understanding and working with the dataset during model training and evaluation.

An Emotional Dataset

X, formerly Twitter, brand

Finally, let’s consider my Emotions dataset.

This is actually the one we’ll work through a bit as it’s a very common example. While I’ll focus on this dataset, particularly in the next two posts, please know that you can apply many of the same techniques I will show you to the Rotten Tomatoes example, should you wish to compare and contrast. A good exercise for you would then be to think about how to apply the same concepts to the scifact dataset.

In the Emotions dataset, we see another slight difference in the dataset construction. Specifically, this data comes in split (20,000 rows) and unsplit (417,000 rows) variants.

It generally makes sense to provide the dataset in both split and unsplit versions in the context of using Hugging Face datasets. The reason for this is that the split and unsplit versions cater to different needs and preferences of users of your dataset, which allows for greater flexibility in how the data is used.

The split version of the dataset typically contains a smaller subset of the data, which is more manageable for certain purposes. These subsets are partitioned into the now no-doubt-familiar train, validation, and test sets. This makes it easier for users — including yourself! — to quickly load and experiment with the data.

This can be especially helpful during the initial stages of model development when you want to quickly prototype your sentiment analysis model or perform initial tests. What this means is that the split version is very suitable for quick experimentation.

The unsplit version of the dataset includes the entire dataset as a single entity, without any predefined train-test-validation splits. This version of the set contains a more comprehensive collection of examples, which can be valuable for training models that require a larger and more diverse dataset.

So a simple way to frame this is that offering both split and unsplit versions of the dataset caters to different user requirements. Researchers and practitioners who need a quick start or want to perform rapid prototyping will likely prefer the split version. Those who require more extensive and diverse data for training complex models will likely find the unsplit version more suitable.

Your delivery team will likely fall into one of those categories. They may actually fall into both categories at different times.

What about our percentages for this dataset?

  • Training: (16,000 / (16,000 + 2,000 + 2,000)) × 100 ≅ 80%
  • Validation: (2,000 / (16,000 + 2,000 + 2,000)) × 100 ≅ 10%
  • Test: (2,000 / (16,000 + 2,000 + 2,000)) × 100 ≅ 10%

We see we have a similar breakdown to the standard one also employed in the Rotten Tomatoes dataset.

Thus having 16,000 examples in the training set and 2,000 examples each in the testing and validation sets is a reasonably good amount of data for sentiment analysis. However, again, it must be stated that the actual adequacy of the dataset depends on various factors, including the complexity of the sentiment analysis task, the nature of the data itself, and the architecture of the model that the data will be fed to.

Put another way: it’s not just a numbers game. The Rotten Tomatoes training set had 8,530 rows while the Emotions training set has 16,000 rows. Does that mean the Emotions dataset is “better”?

Well, it’s true that having a larger dataset often allows machine learning models to generalize better and learn more robust patterns. With more data, the model has the potential to capture a wider range of language variations and, in this case, sentiments. This can result in better performance on unseen data. However, diminishing returns can occur, where additional data might not significantly improve model performance at all but will increase computation time.

Good Enough Data

One thing I did with these datasets is look to see if the percentage breakdown is basically “good enough.” Since I’ll be using the Emotions dataset in the next post, let’s refine this a bit and figure out how we can determine if the dataset size is sufficient. Specifically, what things should we be considering?

Or, let’s say you’re a tester on a delivery team. And you’re helping the team understand if the data being used is good enough to achieve assessments to determine if the desired quality is reached. In that case, what should you make sure the team is considering?

  • Complexity of the task. Sentiment analysis is generally considered a moderately complex natural language processing task. With 16,000 training examples, your team can train a reasonably good model, especially if the sentiment patterns are relatively straightforward.
  • Data quality. The quality of the data is crucial. If the data is noisy, biased, or contains labeling errors, the model’s performance may be negatively affected, even with a large dataset. In fact, the larger the dataset in this case, the worse the problem would be.
  • Model architecture. Learning models, such as those based on Transformer architectures, tend to perform well on natural language tasks, even with smaller datasets. These models are capable of learning intricate patterns and have achieved impressive results with fewer data points. This is something you can work on with your team, particularly if anyone feels “larger must be better.”
  • Hyperparameter tuning. The effectiveness of the dataset can also be influenced by hyperparameter tuning and model optimization techniques. This can focus discussions you have with your team as you experiment, regardless of your data size.

So what does all this tell us?

Well, nothing much, actually. At least not by itself.

While 16,000 training examples are a good starting point, it’s generally beneficial to have more data on hand, if possible. Consider that “corpus” dataset that was used in my example with claim verifications. We had a wider dataset to pull from if needed but started with a smaller dataset.

If your team notices that your model is not performing as well as desired — and that will be noticed due to effective testing — then you can work with your team to explore various techniques. One might be data augmentation, which involves artificially increasing the diversity of your training dataset by applying various transformations to the original data.

Another technique goes by the name “transfer learning.” The short version of this is you have a model trained on one task that’s repurposed for another related task. So instead of training a model from scratch, you start with a pre-trained model — usually on a large and diverse dataset — and fine-tune it on your specific task with a smaller dataset. Maybe you take, for example, that climate sentiment data and a model that’s been trained on it and see if that learning “transfers” to your emotions sentiment model.

Let’s Experiment!

In the next two posts in this series, we’ll draw together everything we learned by feeding the Emotions dataset I just showed you into a Transformer model.

The main thing to keep in mind with all of this is that evaluating model performance and iteratively improving that performance must be done through experimentation. And experimentation is testing.

Thus the concept of testing is front-and-center in all of this. Whether that’s done by a tester is largely up to the testers in the industry who have to show they are capable of working in this context.

Share

This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.