AI Testing – Generating and Transforming, Part 1

The idea of “Generative AI” is very much in the air as I write this post. What’s often lacking is some of the ground-level understanding to see how all of this works. This is particularly important because the whole idea of “generative” concepts is really focused more on the idea of transformations. So let’s dig in!

In this post, as well as in a second and third follow-up, I want to show the basis of Generative AI but also what serves as the foundation for that basis. I’ll use very simple Python and we’ll create some examples of transformations.

However, I do want to do one thing slightly differently than what you might normally encounter. Most examples in books or blogs show you how things work. Which is nice to see. But from a testing standpoint, I also want to show you how things don’t work or, at the very least, may not work quite as you might expect. More specifically, I want to show you situations where you can see the broad outlines of what is being attempted but where the quality is not quite there yet.

As with my previous posts in this series, I will try to assume as little knowledge as possible but I do have note that it’s very difficult to provide full context and explanations for everything. What I will make sure to do, however, is be clear with the terminology I’m using so that you have an idea of what to look up for your own reference.

As in my previous posts, while following along with Python on your own machine is not necessary since I will explain the code and its output, the only way to truly learn this stuff is to practice it. Typing it in and getting a feel for how it works is pretty crucial.

Here’s The Plan …

First I’m doing to show a simple example using transformers. These are the basis of systems like GPT 3/4 — and thus ChatGPT — so there’s likely a familiarity factor here for people. This very simple example will seem to work pretty well. Not the best, but pretty well. Enough so that you’ll probably be able to see that, with more effort, it could be made to work quite a bit better. The challenge for testing is often putting pressure on the design of what’s being tested. Thus, testing can help to answer: how exactly can we start to make it work better? Part of that, of course, is being able to have a shared understanding among your team of what “better” even means.

With that example focus in place, I’ll stop and give some background theory. That will all be in this post.

In the follow-up post, I’m going to explore one particular aspect we look at in this post — question-answering — in a bit more detail, giving you some idea of the behind-the-scenes aspects of how things actually work. Finally, in the third post, I’ll do an extended example — still using transformers — but showing how things can go a little off the rails when we scale up to bigger examples or with more complicated data. And’s a key concern that people — particularly those testing AI systems — have to be aware of: the simple stuff is simple; but scaling the simple stuff leads to lots of interesting situations.

Classify Text

So let’s start with a simple Python script here that will let us explore the ideas. We’ll start by focusing on text classification.

To do that, we have to have some text to classify in the first place. As in any testing context, test data is crucial. It’s not just a matter of “having data.” Often you want to construct data that will allow for the testing of a system under various data conditions. Thus your test data has to expose those conditions so that your tests can exploit them. With that in mind, here’s some text you can throw into your program:

<p>text1 = """
<p>There is nothing so intoxicating to the scientific mind as the
<p>weird and unfamiliar. The fundamental basis of scientific thought is that an
<p>observed truth that undermines one's understanding is yet the truth. If the
<p>observation is not flawed, one's previous understanding must be. To the open
<p>mind, this is not a crisis; it's an opportunity to form a new, more perfect
<p>understanding of the world. So would it be abandoning science for a belief in
<p>magic? Not necessarily. Rather, you would include magic in your understanding of
<p>the physical phenomena that shape our world. Science is a path to knowledge -
<p>one that must include and explain every observable fact, embracing all and
<p>rejecting none. This applies to any endeavor where scientific thinking is important,
<p>which most certainly applies to religious and historical studies. (Scientific thinking
<p>is a type of knowledge seeking involving intentional information seeking, including
<p>asking questions, testing hypotheses, making observations, recognizing patterns,
<p>and making inferences.)
<p>"""
<p>

text1 = """

There is nothing so intoxicating to the scientific mind as the

weird and unfamiliar. The fundamental basis of scientific thought is that an

observed truth that undermines one's understanding is yet the truth. If the

observation is not flawed, one's previous understanding must be. To the open

mind, this is not a crisis; it's an opportunity to form a new, more perfect

understanding of the world. So would it be abandoning science for a belief in

magic? Not necessarily. Rather, you would include magic in your understanding of

the physical phenomena that shape our world. Science is a path to knowledge -

one that must include and explain every observable fact, embracing all and

rejecting none. This applies to any endeavor where scientific thinking is important,

which most certainly applies to religious and historical studies. (Scientific thinking

is a type of knowledge seeking involving intentional information seeking, including

asking questions, testing hypotheses, making observations, recognizing patterns,

and making inferences.)

"""

As a good rule of thumb, you don’t just want one example or one bit of test data. Remember: test data is often about exposing specific conditions. So let’s give ourselves another bit of text. You’ll see why I included these two momentarily. They allow me to test for two very different conditions right at the start.

Incidentally, all text I’m using in these posts is from a course I teach on “Science and Religion as History.”

<p>text2 = """
<p>Ultimate causes are something a lot of people are
<p>concerned with to an extent. This is an atavistic trait acquired long ago for
<p>surviving in the physical world in which there are actually causes and effects - say,
<p>proximity to lions and being eaten. We're built to look for causal relations
<p>between things and to be deeply satisfied when we discover a rule with cascading
<p>implications. We're also built to be impatient with the opposite - forests of facts
<p>from which we can't seem to extract any meaning. No matter how much people
<p>pride themselves on logic or intellect, if their desire to believe something is strong
<p>enough, their minds will happily weave a fiction around those wishes until those
<p>wishes become stubborn beliefs. Thus does an opinion transmute into a putative
<p>fact. This process, if adhered to, often leads to compromising the discernment,
<p>judgment, and caution mentioned earlier. It can allow us to see patterns that
<p>aren't there while also missing patterns that clearly are there.
<p>"""
<p>

text2 = """

Ultimate causes are something a lot of people are

concerned with to an extent. This is an atavistic trait acquired long ago for

surviving in the physical world in which there are actually causes and effects - say,

proximity to lions and being eaten. We're built to look for causal relations

between things and to be deeply satisfied when we discover a rule with cascading

implications. We're also built to be impatient with the opposite - forests of facts

from which we can't seem to extract any meaning. No matter how much people

pride themselves on logic or intellect, if their desire to believe something is strong

enough, their minds will happily weave a fiction around those wishes until those

wishes become stubborn beliefs. Thus does an opinion transmute into a putative

fact. This process, if adhered to, often leads to compromising the discernment,

judgment, and caution mentioned earlier. It can allow us to see patterns that

aren't there while also missing patterns that clearly are there.

"""

Now let’s use some libraries to read in our text examples and do something with them. If you’re going to create your own Python scripts along with me, you’ll need the transformers and pandas libraries here to make the examples work. You can install these via pip and they should have no problems installing across operating systems.

You will also need at least one deep learning library, tensorflow or torch. Personally I recommend the latter. Again, either of these can be installed by pip although in my experience TensorFlow tends to offer more problems than PyTorch. Working with a Python distribution context like Anaconda can help a bit with that.

Here’s some code we can start with:

<p>import pandas as pd
<p>from transformers import pipeline

<p>classifier = pipeline("text-classification")
<p>

import pandas as pd

from transformers import pipeline

classifier = pipeline("text-classification")

We’re using Transformers!

No, not those Transformers. Here Transformers refers to a unified API that lets you use a standardized interface to a wide range of what are called transformer models. I’ll have to unpack that a bit as we go on.

One cool thing this does is let you easily switch between deep learning frameworks, like PyTorch or TensorFlow. In one of my previous posts, as an example, I had to switch between TensorFlow and PyTorch and show you two distinct versions.

If you’re curious about Pandas, I use that quite a bit in my data science series. I won’t talk about that too much in these posts since I’m just using it to assist with output.

When using the Transformers API, you start a pipeline and provide the name of the task that you’re interested in running on that pipeline. The pipeline abstraction is meant to simplify the process of using pre-trained models for a wide variety of natural language processing (NLP) tasks, which is what we’re going to be focusing on in this post.

Put another way, what the pipeline does is encapsulate all the necessary data processing, tokenization, model loading, and prediction steps that are required for specific NLP tasks. That sounds great — and it is. But it can also remove you from an understanding of what’s actually happening. By way of example to show why that might matter, let’s consider something really simple:

<p>int.from_bytes(data[offset : offset + 2], byteorder="big")
<p>

1 2	<p>int.from_bytes(data[offset : offset + 2], byteorder="big") <p>

This is nothing more than code that reads a two-byte word from some data. What that code is doing is actually abstracting away this:

<p>data[offset] << 8) | data[offset + 1])
<p>

1 2	<p>data[offset] << 8) \| data[offset + 1]) <p>

Both code snippets achieve the same result, but the first code snippet uses a built-in Python function for the task rather than the bitwise shifting and OR operations that are used in the second code snippet. That can be good or bad depending on how much observability you need into what’s really going on.

Okay, so let’s get back to our code. What we’ve done here is create a text classifier. Think of our text classifier as a program that can understand and categorize different types of text. So imagine you have a large collection of documents or messages. And you want to automatically organize them into different categories based on their content. A text classifier can help you with exactly that.

So how do we do that? Add the following code to your script:

<p>outputs = classifier(text1)

<p>df = pd.DataFrame(outputs)
<p>print(df)
<p>

outputs = classifier(text1)

df = pd.DataFrame(outputs)

print(df)

That first line is doing quite a bit. The Transformers API will perform preprocessing steps on the input text to make it suitable for the underlying model. “Make it suitable” involves tokenizing the text, splitting it into smaller meaningful units — words, subwords, and so on — and converting those units into numerical representations that the model can understand. To these models, all text is really just numbers.

What’s also happening here with that single line is that the pre-trained model loaded by the pipeline performs what’s called “model inference” on the preprocessed text. What this means is that it applies algorithms and some neural network architecture to understand the context, meaning, and patterns in the text. This involves analyzing the relationships between the input tokens and making predictions based on the learned representations when training occurs.

Finally, I’m using the pandas library to create a “dataframe” from the classifier’s outputs. A DataFrame in this context is a two-dimensional tabular data structure that consists of rows and columns. You wouldn’t be far off at all if you thought of that as similar to a spreadsheet or a SQL table. Doing this last bit isn’t strictly necessary, of course, but it helps structure and organize the classification results into a tabular format. That tends to make it easier for analysis or, in this case, presentation.

Go ahead and run the script.

The first time you run this code you’ll see a few progress bars appear because the pipeline automatically downloads the model weights from the Hugging Face Hub. These downloads will — as of the time of writing — be downloaded to your home/user directory under .cache\huggingface\hub. That applies to all operating systems.

With the above code, we’re running the model on the first text block (text1). You should see something like this:

label score

0 POSITIVE 0.952205

Now change the script so that it processes text2. The output you see should be something like this:

label score

0 NEGATIVE 0.929221

What we just did is generate some predictions. In the case of the first text block, the model is very confident that the text has a positive sentiment. The opposite is true for the second text block, which is confidently asserted to have a negative sentiment.

Now you see why I included two distinct data conditions as part of my testing for this model. But notice I didn’t just use “obviously” positive or “obviously” negative sentiments in my data, which is important for realistic testing in this context. Put another, I purposely used examples that were not of the common “sentiment analysis” type in order to make the distinctions clear. When you see these examples in books or blogs, they will usually be showing you a review of a movie or a restaurant or something like that.

Recognize Entities in Text

Let’s try something a little different, which is finding named entities in our text.

Let’s reframe our example. Keep the text variables in place but use the following code:

<p>import pandas as pd
<p>from transformers import pipeline

<p>ner_tagger = pipeline("ner", aggregation_strategy="simple")

<p>outputs = ner_tagger(text1)

<p>df = pd.DataFrame(outputs)
<p>print(df)
<p>

import pandas as pd

from transformers import pipeline

ner_tagger = pipeline("ner", aggregation_strategy="simple")

outputs = ner_tagger(text1)

df = pd.DataFrame(outputs)

print(df)

Here we’re using the Transformers API to perform named entity recognition (NER) on the text. Named entity recognition is a task you use when you want to identify and classify specific named entities or information in a text. These named entities can be things like person names, organizations, locations, dates, or other relevant information — as long as it can be considered a recognized “entity.”

The idea of a “NER tagger” is based on the concept of a pre-trained model that has learned patterns and structures in text to recognize and classify named entities. Essentially, all such recognized entities get tagged in some way.

The aggregation_strategy parameter controls how multiple predictions for the same named entity are handled. What does that mean? Well, consider that for each named entity, the model may identify it in multiple places within the text. So an aggregation strategy determines how these overlapping or duplicated predictions are processed. Here we’re using “simple” and what that means is that if the model identifies the same named entity in different locations within the text, it will consider each occurrence as a separate entity without any further merging or grouping.

Running this script will download another model and some data.

If you run the above script, you will likely see this:


  Empty DataFrame
  Columns: []
  Index: []

Hmm. That seems like a whole lot of nothing, right? What this is telling us is that the model we’re using with the transformer isn’t recognizing any named entities in the text. The empty DataFrame suggests that no entities were identified or extracted from the input paragraph of text1. If you try it out, you’ll find the same happens with text2.

Why is this happening?

Well, in general, this kind of thing could be due to various reasons. The most obvious being the absence of named entities in the text. But it can also happen due to potential limitations of the model in recognizing certain types of entities or specific contexts.

As I mentioned earlier, these NER models typically are trained to identify and classify specific types of named entities like the names of people, the names of specific and distinct locations, the names of organizations, dates, and so on. However, the text data I’ve provided consists mostly of general statements, some philosophical ideas, and descriptions of scientific thinking. There really are no explicit mention of specific named entities.

So what we have here is a case where our test data falls outside the model’s ability to do what it’s designed to do. Put another way, we’re not exposing a data condition that our test condition — our script logic, in this case — can exploit. So now let’s add a third text item to our script:

<p>text3 = """
<p>Throughout its history this region was much coveted by the surrounding empires and
<p>was often controlled by them, first by the Egyptians in the second millennium, then
<p>by the Assyrians, the Babylonians, the Persians, the Greeks, and the Romans in the
<p>first millennium. Geographically and politically the history of the Levant is
<p>intrinsically tied to that of the “Fertile Crescent,” an expression referring to the
<p>fertile territory with ample rainfall that stretches from Mesopotamia (present-day
<p>Iraq and Iran) to Egypt, including the areas around the Tigris and Euphrates.
<p>"""
<p>

text3 = """

Throughout its history this region was much coveted by the surrounding empires and

was often controlled by them, first by the Egyptians in the second millennium, then

by the Assyrians, the Babylonians, the Persians, the Greeks, and the Romans in the

first millennium. Geographically and politically the history of the Levant is

intrinsically tied to that of the “Fertile Crescent,” an expression referring to the

fertile territory with ample rainfall that stretches from Mesopotamia (present-day

Iraq and Iran) to Egypt, including the areas around the Tigris and Euphrates.

"""

Change the script to read text3. Your output should now be a little different. Likely you’ll see something like this:

entity_group score word start end 0 MISC 0.995219 Egyptians 127 136 1 MISC 0.983718 Assyrian 175 183 2 LOC 0.948971 Babylon 190 197 3 MISC 0.784816 ##ians 197 201 4 MISC 0.991524 Persian 207 214 5 MISC 0.986021 Greeks 221 227 6 MISC 0.953824 Romans 237 243 7 LOC 0.998516 Levant 319 325 8 LOC 0.814256 Fertile Crescent 364 380 9 LOC 0.992953 Mesopotamia 472 483 10 LOC 0.999754 Iraq 497 501 11 LOC 0.999709 Iran 506 510 12 LOC 0.999806 Egypt 515 520 13 LOC 0.988859 Tigris 553 559

14 LOC 0.990317 Euphrates 564 573

You can see that this time the pipeline detected some entities and also assigned a category to them, either LOC (location) or MISC (for miscellaneous). There are other categories as well, but the above are all that were found in my text. The scores are telling you how confident the model was about the entities it identified.

Testers Note Discrepancies

As a tester who is used to looking for discrepancies in data or outputs, you might be wondering what that “##ians” is. This represents a subword or subtoken that was generated by the tokenizer during the tokenization process.

Look at the text, look at the output and reason about this for a bit. Clearly this is related to the words “Egyptians,” “Assyrians,” “Persians,” and “Babylonians” in the text. But something might strike you as odd here.

See if you can spot it.

To help you spot it, I’ll note the pipeline also returns start and end integers and these correspond to the character indices where the answer span was found. Here “answer span” just refers to the specific range or segment of text within a given context that contains the predicted data relevant to whatever the model is supposed to be doing.

So what do you think? What’s the (potential) issue with “##ians”?

Well, it looks like the word “Babylon” is from 190 to 197. And “##ians” is from 197 to 201. That would imply only “Babylonians” was tokenized in this way. This suggests that the tokenizer has split the word “Babylonians” into two separate subtokens: “Babylon” and “##ians”. It seems that the tokenizer treated “Babylon” as a standalone word and added the “##ians” subtoken separately. Yet “Assyrians,” “Egyptians” and “Persians” remained as complete tokens without subtokenization.

So no doubt you would be asking: why might the tokenizer not have broken up Assyrians, Egyptians and Persians the same way as Babylonians?

The behavior of the tokenizer and the specific rules it follows for subword tokenization can vary depending on the tokenizer implementation and the underlying model or language-specific considerations. Without access to the specific tokenizer you’re using or its documentation, it can be really challenging to pinpoint the exact reason why “Assyrians,” “Egyptians,” and “Persians” were not tokenized in the same way as “Babylonians.”

Certainly we know that we’re using the Named Entity Recognition (NER) pipeline from the Hugging Face Transformers library. But what specific model are we using? Well, that showed up when we ran the script and it downloaded the model for us. If you look at that, you’ll likely see it says something like this:

dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf

This model is based on the BERT (Bidirectional Encoder Representations from Transformers) architecture and has been fine-tuned on the CoNLL-2003 dataset for named entity recognition. Were you to investigate this, you would find the tokenizer used in this model is the BERT tokenizer, which splits words into subword units based on WordPiece tokenization. The goal of WordPiece tokenization is to handle out-of-vocabulary words, capture morphological variations, and create a more compact vocabulary representation.

Does all of this help us answer our discrepancy?

Well, that’s where you might have to dig in further if it matters. The BERT tokenizer’s vocabulary may not include separate subword representations for “Assyrians” “Egyptians” and “Persians” but has subword representations for “Babylonians.” Further, the BERT tokenizer tends to prioritize splitting infrequent or less common words into subword units, while more frequent words are often treated as whole tokens. It’s possible that “Babylonians” is considered less frequent or out-of-vocabulary compared to other words, resulting in different tokenization behavior. That might make sense in relation to “Egyptians” but perhaps less so in relation to “Assyrians.”

So there’s obviously some ambiguity here. But does the ambiguity matter? Well, if it does for you and your team — and that would be found out by testing your model — then this is where you would refer to the BERT tokenizer’s documentation or explore the vocabulary and tokenization rules specific to the dbmdz/bert-large-cased-finetuned-conll03-english model. These are your oracles for testing and you would need to be aware of them.

But notice how much we just discussed by the observation of a simple potential discrepancy in our output. A discrepancy that many could easily miss or simply disregard. Quality and test specialists help people hone their intuitions for not missing and not disregarding such things.

Answer Questions From Text

Now let’s try something a little more complex, which is the ability ask questions and get answers.

Here’s a script we can use to play around with this. (Again, keep the text variables.)

<p>import pandas as pd
<p>from transformers import pipeline

<p>reader = pipeline("question-answering")

<p>question = "What is science?"

<p>outputs = reader(question=question, context=text1)

<p>df = pd.DataFrame([outputs])
<p>print(df)
<p>

import pandas as pd

from transformers import pipeline

reader = pipeline("question-answering")

question = "What is science?"

outputs = reader(question=question, context=text1)

df = pd.DataFrame([outputs])

print(df)

Here we’re creating an instance of a question-answering model. This model is pre-trained to understand the context of text and be able to answer questions based on it. Given the question, the question-answering model will process the provided context — which will be one of the text strings — and try to identify the most appropriate answer based on the given question.

Go ahead and try to run it.

This will, once again, download some stuff.

Running the model on text1, you’ll probably get something like this:

score start end answer

0 0.703252 585 604 a path to knowledge

As with the NER tagging we just looked at, the start and end integers correspond to the character indices where the answer span of text was found.

Now, here’s where you — as a putative tester — want to play around a bit. Going through the text examples I’ve shown you, I bet you can see the limitations of this depending on your data and what conditions it exposes for testing.

The technique we’re performing here is actually called “extractive question answering” because the answer is extracted directly from the text. This means the model’s task is to identify and extract a specific span of text from the given context that directly answers — or is at least “believed” (predicted) to answer — the posed question.

Crucially, this model doesn’t generate new text or paraphrase the answer at all. Instead, the model selects the most appropriate answer by selecting a contiguous substring from the context.

Let’s try a bit of a longer text. You can download the following: course-text.txt. Assuming you put that file in the same directory as your Python script, let’s use the following code:

<p>import pandas as pd
<p>from transformers import pipeline

<p>with open("course-text.txt", "r", encoding="utf-8") as file:
<p>    text = file.read()

<p>reader = pipeline("question-answering")

<p>question = "Who were the Omrides?"

<p>outputs = reader(question=question, context=text)

<p>df = pd.DataFrame([outputs])
<p>print(df)
<p>

import pandas as pd

from transformers import pipeline

with open("course-text.txt", "r", encoding="utf-8") as file:

text = file.read()

reader = pipeline("question-answering")

question = "Who were the Omrides?"

outputs = reader(question=question, context=text)

df = pd.DataFrame([outputs])

print(df)

Notice the question there? You should get something like this when you run the script.

score start end answer

0 0.830422 2448 2469 Omri, Ahab, and Joram

Try these questions:

“What was the united kingdom?”

“What did Omri do?”

“Does the Book of Kings talk about the Omrid dynasty?”

“What did King Hezekiah do?”

Note that you might find the answer gets cut off in the output. You can do something like this to see the full answer:

<p>print(df["answer"].values)
<p>

1 2	<p>print(df["answer"].values) <p>

Obviously as a tester here you want to look at what questions seem to return something sensible and, of course, you can play around with this yourself by looking at the text. Keep in mind this text is test data; it’s nothing more than a series of data conditions that are designed to expose observations based on test conditions applied to it.

What If My Results Are Bad?

Let’s say your test results show you less than stellar output. What are some suggestions here from a quality enhancement perspective? After all, these are discussions you would have to have with your team.

One option is probably the most obvious one: get more training data. You can gather more labeled data specific to your task or domain. Clearly, having a larger and more diverse dataset for training can help the model learn better representations and that will improve its ability to answer questions accurately.

Another option is more and better data cleaning and/or preprocessing. You want to make an effort to have your input text data be clean, consistent, and relevant. This means you should perform necessary preprocessing steps like removing noise, correcting errors, and standardizing the text format if need be. Cleaning the data like this can help the model focus on only the relevant information and that can help it improve its understanding.

Yet another option is often referred to as “content expansion.” The idea here is you expand the context provided to the question-answering model. But what does that actually mean? As one example, instead of just using a single paragraph or sentence as the context, you can concatenate multiple relevant paragraphs or documents to provide a broader context. This can help the model gather more information and that increase the chances of the model finding accurate answers.

Another technique is called an “ensemble approach.” There are actually many such approaches but the general idea is that you combine predictions from multiple question-answering models or techniques. An ensemble approach then aggregates the answers from those different models — or variations of the same model — and that can help improve overall accuracy. Crucially this can also mitigate individual model biases or errors.

One more option worth mentioning is called fine-tuning. Here you tune your pre-trained question-answering model on a domain-specific dataset or a dataset that’s closely related to your target domain. This tuning allows the model to adapt its knowledge and parameters to better align with the specific types of questions and context you’re working with rather than being more general in nature. This can lead to improved performance on your specific task but does — or at least can — remove some of the ability to generalize outside that task.

Summarize Text

Now let’s try something even more complex. Let’s do something that a lot of people use tools like ChatGPT for: summarizing text.

Broadly speaking, the goal of text summarization is to take a long text as input and generate a shorter version that is relevant but doesn’t inaccurately characterize the content. This is a much more complicated task than the previous things we’ve been doing since it requires the model to actually generate text. Not just that, it has to generate text that is recognizably coherent.

Let’s try this with our initial short text examples first. Let’s reframe our logic like this:

<p>from transformers import pipeline

<p>summarizer = pipeline("summarization")
<p>outputs = summarizer(text1, max_length=88, clean_up_tokenization_spaces=True)

<p>print(outputs[0]["summary_text"])
<p>

from transformers import pipeline

summarizer = pipeline("summarization")

outputs = summarizer(text1, max_length=88, clean_up_tokenization_spaces=True)

print(outputs[0]["summary_text"])

This will, perhaps not surprisingly, download yet more stuff.

Here I’m reading text1. Replace that with text2 and text3 to see what you get. See if what you get makes sense.

As a tester, one thing you want to be checking for is if parts of the original text have been copied verbatim. Now let’s try it with the larger text I provided you.

<p>from transformers import pipeline

<p>with open("course-text.txt", "r", encoding="utf-8") as file:
<p>    text = file.read()

<p>summarizer = pipeline("summarization")
<p>outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)

<p>print(outputs[0]["summary_text"])
<p>

from transformers import pipeline

with open("course-text.txt", "r", encoding="utf-8") as file:

text = file.read()

summarizer = pipeline("summarization")

outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)

print(outputs[0]["summary_text"])

Incidentally, I should note that the clean_up_tokenization_spaces parameter is related to the post-processing step of the summarization pipeline. When summarizing text, the pipeline utilizes tokenization and that involves splitting the input text into smaller meaningful units called tokens.

In some cases, this tokenization process can introduce additional whitespace around the tokens. So this parameter controls whether these extra spaces should be cleaned up or preserved in the final summary text. When it’s set to “True” like we have it, this instructs the summarization pipeline to remove any unnecessary whitespace introduced during the tokenization process. This cleanup step helps to make sure that the generated summary text is generally free from any unwanted spaces that could affect readability or formatting.

Try to run the script on the larger text example. Did that come up with anything useful? I’m betting you might see some message in the output like this:

Token indices sequence length is longer than the specified maximum

sequence length for this model (4823 > 1024). Running this sequence

through the model will result in indexing errors

That error message indicates that the tokenized input sequence is longer than the maximum sequence length supported by the model that we’re using. In this case, the model’s maximum sequence length is 1024 tokens, but the input sequence that we provided has a length of 4823 tokens.

There are various things you can do to get around this. Probably one of the easiest is to just truncate the input sequence. Put the following line right after you instantiate the pipeline:

<p>trimmed_text = text[:1024]
<p>

1 2	<p>trimmed_text = text[:1024] <p>

Now make sure that you pass that new variable to the summarizer:

<p>outputs = summarizer(trimmed_text, max_length=88, clean_up_tokenization_spaces=True)
<p>

1 2	<p>outputs = summarizer(trimmed_text, max_length=88, clean_up_tokenization_spaces=True) <p>

The above can show one of the most common sources of bugs I’ve seen in these contexts. A new variable is created to hold processed text but the original text is still passed to the model. You have to make sure that you’re passing the correct data around.

With that change in place, the script will likely work for you. But, as a tester, your alarm bells are probably going off right about now. What might be an issue with doing this?

Well, consider what we’re doing here. We’re essentially trimming our test data.

What this means is that we have a loss of information from the original text. Will that loss impact the ability to make a coherent summarization? Well, that’s hard to say, right? This is where you have to do good testing to figure out how much trimming matters.

As a spoiler alert, in most cases you will not want to do this kind of trimming. So what else might you do? You would likely have to use a model with a larger maximum sequence length. This would be the case if preserving the entire text is crucial for summarization, which it often is. There are various pre-trained models available with different sequence length limits.

But let’s say you, or your team, doesn’t have time to find and test different models.

Okay, so then another approach would be to split the text into smaller chunks. This way you keep the original text in its entirety but you split it into smaller chunks that can be processed individually. If you go this route, you would have to apply the summarization pipeline on each chunk separately and then combine the generated summaries to form a final summary.

What I hope you’re hearing is that it’s essential to experiment and find the best approach for your specific situation and context. And, as we know, “experiment” is just another way of framing the idea of “testing.”

Starting Small is Easy

It really is this simple to get started. Simple, but far from easy, right? While you can get started, you still have to figure out what’s actually going on when you use these models.

Even if you don’t have full context, what I hope you can see is that the Transformers library and API and its surrounding ecosystem makes it relatively easy for practitioners to use, train, and share models. Keep in mind that in this post we downloaded a series of models that other people shared. We also downloaded pre-trained data for those models.

There is possible opacity here, though, right? If you don’t know what models are being used or what pre-trained data is being consumed, you are operating at a disadvantage when you perform testing to surface risks.

It’s usually best to start with easy-to-use pipelines like we did here. This allows you to pass text examples through the models and investigate the predictions in just a few lines of code. Then you can move on to the more complicated stuff like advanced tokenizers, more comprehensive preprocessors and various model classes that let you fine-tune the model handling via changing parameters that the models use to operate.

What’s worth noting here is that Transformers are the synthesis of several ideas. Primary among those ideas were “attention” and “transfer learning,” with a core outcome being the ability to scale up neural networks. So for the rest of this post, let’s talk a little about this theory.

Some Relevant History

Some folks may know that I gently chastise the testing community broadly for not knowing its own history, particularly how its craft has evolved. This was the basis of my “testing like it was 1980” post but also my more focused “history of automated testing” post. I suppose you could argue this is what my “history and science” series, as related to testing, was all about as well.

Even if you find the above useless, clearly I take that notion somewhat seriously. Thus, in that same vein, I’m going to consider a little of the relevant history and evolution of the concepts talked about here.

In December 2014, the paper “Sequence to Sequence Learning with Neural Networks” was published. This paper talks about an encoder-decoder or sequence-to-sequence architecture. These are really good for situations where the input and output are both sequences of arbitrary length. The encoder and decoder components referred to can be any kind of neural network architecture, as long as that architecture can model sequences.

Slightly earlier in 2014, the paper “Neural Machine Translation by Jointly Learning to Align and Translate” was submitted although went through many revisions. The idea presented here was that of “attention” and this allowed the decoder to have access to all of the encoder’s hidden states.

So, just to level set here, an encoder is a component of a model that helps understand and represent the meaning of text. Hidden states refer to the internal representations or knowledge that the encoder learns while processing the input text. The attention mechanism described in the paper essentially lets the decoder assign a different amount of weight, or “attention,” to each of the encoder states at every decoding step.

In May of 2015, Andrej Karpathy published “The Unreasonable Effectiveness of Recurrent Neural Networks”. The relevance in this context is that recurrent neural networks were often used as the architecture for the aforementioned encoder-decoder methods.

In April 2017, the paper “Learning to Generate Reviews and Discovering Sentiment” was published. The focus here was a group of researchers at OpenAI. They were able to get really good results on a sentiment classification task by using features extracted from unsupervised pretraining.

Without going into every detail, the paper showed that by training a language model on a large amount of unlabeled text data, the model could learn the underlying patterns and structures of language. This is referred to as “unsupervised pretraining” and it allowed the model to develop a deep understanding of how words and sentences are composed and how they convey meaning.

This was considered pretty important because the two-step process — unsupervised pretraining followed by fine-tuning — led to highly accurate sentiment classification results, even without explicitly providing the model with labeled sentiment data. Even more specifically, however, what the paper showed was that teaching a language model about language in general allowed it more specifically learn to understand sentiment.

In June 2017, researchers at Google published a paper called “Attention Is All You Need” and this proposed an interesting neural network architecture for sequence modeling. This architecture was called the “Transformer.” And with the Transformer, it turned out that a fundamentally new type of modeling was introduced. The idea proposed to get rid of recurrence altogether and instead rely entirely on a special form of attention called self-attention.

Again, let’s make sure we level set on our terms and our distinctions here.

Recurrent Neural Networks (RNNs) are a type of neural network architecture that processes sequential data by maintaining hidden states that are updated at each step. RNNs are designed to capture dependencies and patterns in sequential data by using feedback connections.

The hidden state of a RNN is passed from one step to the next, allowing it to consider previous inputs when processing current inputs. This enables RNNs to handle variable-length sequences.

Unlike RNNs, Transformers don’t rely on sequential processing. Instead, they process the entire sequence simultaneously through the mechanism I just mentioned: self-attention. Self-attention allows the model to weigh the importance of different elements within the sequence when making predictions.

In 2018 we learned, via another paper called “Universal Language Model Fine-tuning for Text Classification” of an effective transfer learning method that the paper itself was named for: Universal Language Model Fine-Tuning for Text Classification (ULMFiT). This is a technique for transfer learning in natural language processing. Transfer learning refers to the process of leveraging knowledge gained from one task to improve performance on another related task.

This method showed that training Long Short-Term Memory (LSTM) networks on a very large and diverse amount of textual material could produce really effective text classifiers with very little labeled data. An LSTM network, incidentally, is a type of recurrent neural network architecture designed to handle the challenges around capturing and modeling long-term dependencies in sequential data.

Long-term dependencies are interesting. Consider that in sequential data, such as text or time-series data, elements — like words or events — are arranged in a specific order. Long-term dependencies refer to meaningful connections or influences between elements that are separated by a considerable distance within this sequence. For example, in a sentence, the meaning of a word at the beginning may depend on a word at the end. Similarly, in a time-series dataset, an event that occurs earlier may have an impact on an event that happens much later.

Notice how so much of this domain involves an explosion of terminology! It makes writing these posts — and keeping them concise — extremely difficult.

The 2018 work I just covered is really a follow on from that April 2017 paper I mentioned earlier because by introducing a viable framework for pretraining and transfer learning in NLP contexts, ULMFiT provided the missing piece to make transformers capable of scaling. Being capable of scaling was a key driver to exposing such models to the wider world.

Rise of the Transformers

Two transformers were released that combined self-attention with transfer learning that really started to put this concept on the map.

One of these is called Bidirectional Encoder Representations from Transformers (BERT). This uses the encoder part of the Transformer architecture.

The other is called the Generative Pretrained Transformer (GPT). This uses only the decoder part of the Transformer architecture.

What happened next was people worked to combine the overall Transformer architecture with the technique of unsupervised learning. These models removed the need to train task-specific architectures from scratch. That last point is really important.

So sometimes people ask: “What’s the big deal about Generative AI anyway?” Well, the history I just gave you gives you the focal points. What they lead to, however, is a very specific outcome. To wit, combining transformers with unsupervised learning reduces the need to train task-specific models from scratch because it allows us to teach a model about language by giving it a large amount of text without any specific task in mind. This pretrained model learns the patterns and meanings of words and sentences in a general way, capturing a deep understanding of language.

Then, when we have a specific task — like text classification or question answering that we looked at in this post — we can fine-tune this pretrained model using a smaller set of labeled examples.

This fine-tuning process helps the model adapt its knowledge to the specific task, making it more accurate and effective but without requiring us to start from scratch each time. In simpler terms, it’s like teaching a model about language first and then teaching it specific tasks, ostensibly making it “smarter” (better able to learn) and more efficient (able to apply that learning with better performance).

So for all those who worried about those automated test tools based on AI that would have to relearn applications all the time, the idea of transformers does, at least, suggest a possible way to scale that idea.

Wrapping Up

Okay, so in this post we did a little practice and we looked at a little theory. In the next post, I’m going to break down the idea of question-answering a bit more by looking at the basis of how it actually works. Then in the third post I’m going to expand on that idea quite a bit by breaking down what the Transformers API is actually providing for us by using an extended example that doesn’t fully rely on the Transformers API.

This path will, I believe, show you what happens when you try to “scale up” the ideas we looked at in this post. I’m hoping by taking this approach you’ll be able to see how variable quality can be in this context which will simply reinforce why testing is so important.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …

AI Testing – Generating and Transforming, Part 1

Here’s The Plan …

Classify Text

Recognize Entities in Text

Testers Note Discrepancies

Answer Questions From Text

What If My Results Are Bad?

Summarize Text

Starting Small is Easy

Some Relevant History

Rise of the Transformers

Wrapping Up

Leave a Reply Cancel reply