AI Testing – Generating and Transforming, Part 3

We come to the third post of this particular series (see the first and second) where I’ll focus on an extended example that brings together much of what I’ve been talking about but also shows the difficulty of “getting it right” when it comes to AI systems and why testing is so crucial.

In previous posts, we’ve been looking at hard-coded strings stored in variables or reading from relatively simple text files. Those contents served as the basis of our tokenization that we provided to a model.

Now let’s scale that idea slightly.

Before we get started, I want to put out a cautionary note here. The end result of this, from a quality standpoint, is going to be a little variable. And that’s putting it charitably. I mentioned this in the first post. I’m taking this approach for a very specific reason. I want to show that even with what seems to be a lot of work, you can end up with something that — at least seemingly — falls short. Yet without quite understanding exactly why.

This idea is front-and-center with the idea of making determinations of perceived qualities on the basis of testing.

The Scenario

It’s definitely possible to tokenize the contents of a web page — or a group of web pages — and then allow for questions to be asked based on the tokenized data. In fact, this is a common use case for natural language processing (NLP) applications. In fact, in my case study post, I talked about the idea of tokenizing a specification repository. In this post you’ll see how that idea could be carried out, albeit in a different context.

So, for now, let’s broadly say that this is the context your team is going to be working in: you’re scraping web content to provide a set of data that you want to allow people to query. You are going to be one of the test specialists that help the team ask good questions, make fewer assumptions, and consider how to provide a basis for testability.

What we want to ultimately determine is the risks of releasing this model to the public. Now, in this case, the risks won’t encompass things like bias or adversarial aspects. Here the risks are just that we’ll release something that’s effectively useless making us look stupid. Most businesses don’t want to look stupid so we’ll consider this important enough for this post.

This post is actually based on some real-world work I’m doing where I’m providing a lot of material for a course that I teach to a language model. My students are provided an interface where they can query that model about my course material and, ideally, get useful answers.

The Plan

So how would you and the team go about implementing this functionality?

First you would clearly want to use a web scraper to extract the text content from the web page or web site. This will give you the raw text to work with. That raw text is effectively going to be your test data.

Then you would use a tokenizer to convert the raw text into numerical representations that can be used by some model that’s based on natural language processing. Depending on the specific tokenizer and model being used, this can involve lots of things like breaking the text into sentences or individual words, adding special tokens to mark the beginning and end of each sequence, and encoding the tokens as numerical IDs.

Once that’s done, you would use a pre-trained natural language processing model to answer questions about the tokenized text. This will involve passing the tokenized text and the question through the model, generating answer scores pairwise, and selecting the most relevant candidate answer.

Hopefully you can see the above as essentially what we’ve being doing in these posts so far. We’re just going to tie it all together here.

Finally, you would likely want to display the answer to the user in some sort of friendly format, such as a set of bullet points or a continuous sentence. You certainly don’t want to display it as a tokenized numeric representation and probably not even as a data structure-wrapped response, such as JSON or XML. (Unless, of course, your “user” happened to be an API that was consuming your information for a web service.)

Let’s Start Simple: One Web Page

I have a web page with some content that we can use. Let’s say we want to see if our question-answering model can deal with this content.

This document is actually a stream of consciousness document that I used to create material for a class I teach called “Science and Religion as History” and the text you see here was specifically collated as being relevant for a learning model.

That web page, and all of its content, is effectively one large set of test data. There are numerous data conditions inherent in the nature of the text, all of which could be tested for by asking specific questions — or generating prompts, if you prefer — and seeing if the model can come up with a relevant answer.

I’ve actually pulled out some related sets of data conditions from that larger document. There’s test data 001 and test data 002.

Please note that this is not just some random bits of text that I decided to pull out. As a tester, you have to be quite deliberate in the overall test data — as well as the subsets of that data — that you use. This is crucial in any testing context but I can’t stress enough how absolutely crucial it is in an AI context based on either generative or transformative text.

Scrape the Content

So let’s first do the scraping part I talked about earlier.

You will need the “scrapy” library in order to do this, which you can install via pip.

I’m going to talk about a lot of different Python examples here so, unlike in my other posts, I’ll suggest names for the scripts just so I can make it easier to reference them. Obviously what you call the scripts is entirely up to you.

Let’s create a Python script called spider.py and put the following in it:

This script is written as a Scrapy Spider, which is a class that defines how to perform scraping on a specific web site or web page. In this case, the spider creates a request to visit the specified URL and then extracts the HTML text from the response. To run this you need to run scrapy directly.

  scrapy runspider spider.py -o text_raw.json

The result of running this should be that you end up with a text_raw.json file.

Process the Content

As a tester, you’re certainly going to want to take a look at your raw text.

You might — or might not, depending on your experience — be surprised at how many people don’t actually look at the data that gets generated. Keep in mind this is our test data. So if your first instinct was to stop and look at that file that was generated, congratulations: you’re thinking like a tester!

Now that you’ve looked at the data, notice anything? And please do look at the generated text. Reason about it.

You’re probably noticing a whole bunch of HTML. Which isn’t terribly surprising given that we pulled the text from a web page. But we don’t want that HTML to be sent into the model. It would just obfuscate the actual text itself and certainly has no value for a model that will try to answer questions based on the text. So we’ll make sure to get rid of all that.

What else are you noticing about the text? Again, look at the text before reading on. Reason about what you’re seeing.

There are clearly a lot of “\n” characters. Getting rid of those is probably a good idea. While those newline characters make the text easy to read for a human, they are just a distraction for a learning model.

There are lines in the text like this:

  Let\u2019s consider another term: \u201creligion.\u201d

It would be a good idea to convert the Unicode elements into actual punctuation marks in the text. This would make the final text more readable and easier to work with. In the given example, the Unicode elements \u2019 and \u201c represent the apostrophe and open quote punctuation marks, respectively. Similarly, the Unicode element \u201d represents the close quote punctuation mark.

You’ll also see text like this:

Herna\u0301n Corte\u0301s (1520)

Again, it would be a good idea to convert the Unicode character combining marks into their corresponding characters in the text. In the given example, the letters “á” and “é” have an extra Unicode combining mark character, such as \u0301, which modifies the previous character to include an accent. We can normalize that data which is really helpful for learning models.

One thing that’s not initially obvious is the following:

<title>Document</title>

That’s just the HTML title element. That’s also something we’ll want to get rid of.

So what we want to do is preprocess our raw text to handle the situations described above. Note that what we’re doing here is enhancing the ability of our model to reason about the text but that goes hand-in-hand with adding better testability. Why? Because, like any good experiment, we’re removing the incidentals that should not play a part in the experiment.

You can use various tools to do this preprocessing. Here I’ll use one called Beautiful Soup to extract the relevant text content from the surrounding HTML elements.

You will need the “bs4” library for this to work, which is the name that pip uses to install Beautiful Soup.

Create another script called preprocess.py. Put the following in it:

You can see that this logic is handling the elements that we looked at earlier. Make sure to run that script and doing so will give you a text_processed.json file.

Tokenize the Content

Okay, so now that we have some preprocessed data on our hands, we can use a tokenizer to convert — or, rather, encode — the data into numerical representations that can be used by an AI model. In this case, we’ll use the tokenizer provided by the Hugging Face Transformers library.

If you’ve been through my other posts, you probably have it already but make sure you use pip to install the “transformers” library.

Create a file called transform.py and put the following in it:

That’s a lot of code! Take some time to look at it, reason about it. Don’t worry about every specific detail necessarily. But just get a feel for the flow of what’s going on.

Testers are trained to look for things that can differ, particularly around data. And one of the things that a tester will likely cue into with this code also happens to correspond to a key decision being made. I’m referring to this particular statement:

What “bert-base-uncased” means is that we’re using the BERT tokenizer, which is a widely used natural language processing model. This particular tokenizer is a good default choice for a tokenizer if you don’t have a specific use case or model architecture in mind. It’s a pre-trained model provided by Hugging Face Transformers that’s based on the popular BERT architecture and is trained on a large amount of English text.

BERT (Bidirectional Encoder Representations from Transformers) is a technique for pretraining natural language processing models that was developed by Google. The “bert-base-uncased” refers to a particular pretrained version of the BERT model. “Base” refers to the size of the model, which has twelve layers and 110 million parameters. “Uncased” means that the model was trained on lowercased text and doesn’t differentiate between uppercase and lowercase letters.

Now, all the above being said, the choice of tokenizer really depends on the specific use case and downstream tasks you plan to use the tokenized data for. There might be other tokenizer models that are better suited for your particular problem depending on the nature of the input data and the model architecture.

Just to give an example, let’s say you’re working with text in a non-English language or a text that has very specialized vocabulary related to a specific domain. In that case, you may want to use a tokenizer model trained on that particular language or trained on a data set that uses the particular vocabulary.

One alternative to consider might be “gpt2”. This is a generative language model tokenizer and it’s often used for text generation tasks and has a larger vocabulary than BERT. Another you’ll hear about is called RoBERTa, which stands for Robustly Optimized BERT Pre-training Approach and uses “roberta-base” as its pretrained model. This is another transformer-based model with — at least according to some benchmarks– better performance on than BERT.

How do you decide? You and your team would have to do some experimentation and comparison with different tokenizers to find the one that works best for your specific use case. When I say “experimentation,” read that as “testing.” Ask yourself: in this context, is it just the data scientists and the developers who should be doing this experimentation? Or is there a place for people with a test specialty should to step in? I’m betting you know my answer to that!

Is the Tokenizer Aligned to the Model?

As a tester, you might wonder something here, as I certainly did when I began working in this context. If we use “bert” or “gpt2” or “roberta” as the tokenizer, does that mean we’ll have to use those models later on? Or is the tokenizer entirely distinct?

This, as it turns out, is a really good question!

In general, the choice of tokenizer doesn’t constrain you to use a particular model later on. Tokenization is a preprocessing step that converts raw text into numerical tokens that can be processed by various models. Most tokenizers — including the ones used in BERT, GPT-2, and RoBERTa — output a series of numerical token IDs and attention mask vectors, which I’ll talk about soon.

The point here is that these can be used as input to a variety of downstream models for tasks such as question-answering, which we’re going to be focusing on. What this means is that you can use any model that is compatible with the tokenization format produced by your chosen tokenizer, regardless of whether that’s a BERT, GPT-2, or RoBERTa model.

That being said, there are some caveats. Some models, such as GPT-2 and BERT, do rely on specific tokenization techniques that may not be compatible with other models or require additional modifications to work correctly. So while tokenizer choice itself doesn’t necessarily constrain the choice of the model, the specific features of the chosen tokenizer can impact the overall training and performance of the downstream model.

Encoding Aspects of the Tokenizer

Going back to our code, let’s consider this specific part of it:

Here the add_special_tokens adds what are called [CLS] (start of sequence) and [SEP] (separator) tokens. What does that mean?

In BERT and other Transformer-based models, the input is typically composed of two segments: a question and a context. The purpose of these special tokens is to mark the boundaries between these segments so that the model can distinguish between them. In this case, [SEP] indicates the separator between the question and context, while [CLS] indicates the classification token that’s added at the beginning of the input. Crucially, the [CLS] token is used by the model to generate a single output prediction for the entire input sequence.

The max_length truncates the tokenized sequence to the specified length. This is pretty important. The value of the maximum length in your code should depend on the requirements of the specific tokenizer and model you’re using. Different tokenizer models may have different requirements for the maximum length of input text that they can process.

BERT, for example, has a maximum length limit of 512 tokens per input sequence. This means that if you try to feed in text that exceeds 512 tokens, the tokenizer will truncate the sequence to 512 tokens. GPT-2, on the other hand, has a maximum length limit of up to 2048 tokens per input sequence. RoBERTa also has a maximum length limit of up to 512 tokens, similar to BERT.

It’s worth noting that the choice of maximum length can also affect the performance and training time of the downstream model as well as the amount of memory and computing resources required during training and model inference. These are all qualities to consider and thus relevant to a tester.

The truncation being set to “True” means the input sentences are, in fact, truncated to the maximum length specified. The padding is used to pad any sequences to a provided sequence length which, in this case, we’re setting as the maximum length. Finally, the return_attention_mask means to generate attention masks for the padded sequences.

Attention Masks?

This refers to creating a binary mask that helps the model focus on the relevant parts of the input while ignoring the padded tokens. This is particularly important in scenarios where input sequences have varying lengths and thus where padding is used to make them of equal length for efficient processing. The test data in our context here is like that since each paragraph is not of equal length.

The attention mask is typically what’s called a binary matrix with the same shape as the input sequence. It contains values of 1 for tokens that should be paid attention to and values of 0 for tokens that should be masked, which means ignored in this context. Thus for padded tokens, the corresponding mask value is set to 0, indicating that they shouldn’t be paid attention to by the model. Just to put a little meat on this, suppose we have an input sequence with a maximum length of six tokens:

["This", "is", "a", "test", "[PAD]", "[PAD]"]

Here “[PAD]” represents the padded tokens. The attention mask for this input sequence would look like this:

[1, 1, 1, 1, 0, 0]

The binary matrix would look like this:

[
  [1, 1, 1, 1, 0, 0]
]

Clearly this can get quite a bit more complex in structure but the core idea is as simple as what I just showed you.

Padding Condition?

You might notice this in the code:

With this logic, if the tokenizer doesn’t have a pre-defined padding token — and some, like GPT-2, don’t — then I’m adding a new padding token called ‘[PAD]’.

Note that I added this code in a defensive way. If the tokenizer doesn’t have a padding token, I add one. Otherwise, I don’t. Doing this — or making sure your development team does this — aids testability because it means you can more easily plug in different models or tokenizers and handle the differences between them.

Back to Our Code …

Let’s now run the script.

Expect more downloads when you run this script unless you already have the data cached from prior things you’ve done in these posts. That’s pretty much how all scripts in these posts have worked: anything you don’t have downloaded will download. Once you’ve downloaded any relevant materials, those will be used from that point forward

This will create a text_tokenized.json file. This is finally data that a model can understand. If you look at that file, you’ll likely see something like this:


  [[[101, 2292, 1005, 1055, 5136, 2178, 2744, 1024, 1000, 4676, 1012, 1000, 2023, 2003, 1037, 2744, 2008, 2411, 4152, 4162, 2000, 8578, 2013, 1996, 2627, 1012, 2664, 2009, 1005, 1055, 2468, 3154, 2008, 2107, 8578, 2052, 2031, 2018, 2053, 4145, 1997, 2054, 2057, 2747, 2655, 1000, 4676, 1012, 1000, 2070, 5784, 6592, 2008, 2045, 2001, 1037, 2350, 4935, 1997, 1996, 2224, 1998, 4824, 1997, 1996, 2744, 1000, 4676, 1000, 2008, 2211, 1999, 1996, 14683, 2301, 1012, 1000, 4676, 1000, 2003, 2025, 1037, 3128, 4696, 1012, 2009, 1005, 1055, 2025, 1037, 2034, 2711, 2744, 1997, 2969, 1011, 23191, 1012, 2612, 1010, 2009, 1005, 1055, 1037, 4696, 9770, 2013, 1996, 2648, 2006, 2070, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ]]]

Notice all those zero values. Those are the padding elements that we just talked about.

Feed the Data to the Model

Now we can use a pre-trained model, focused on natural language processing, to answer questions about our tokenized text which, as we know, encodes the class material that we read from the web page.

Create a file called model.py. Add the following to it:

Perhaps not surprisingly, this is our most involved code to date. There’s a whole lot of code there that often you wouldn’t necessarily see. And trust me: what you see there is an incredibly small amount of code compared to an actual, full-on learning model.

I should note that the code you see here is the end result of a long evolution of script choices as I figured out how to present this in this post. Seeing that evolution, step by step, is likely massively helpful. On the other hand, it’s utterly discursive and thus confusing to read so I opted not to include all that.

First, let’s make sure we know what the code is actually doing.

Our model here prepares the text data for processing by a language model. It does this by reading the preprocessed text from our text_preprocessed.json file. It uses a tokenizer, specifically the “GPT-2” tokenizer this time, to convert the preprocessed text into a format that the language model can understand. Remember: we originally tokenized with a BERT tokenizer. But we’re using a GPT-2 model so we have to tokenize accordingly.

Doesn’t this mean our earlier tokenization was a little useless? Or couldn’t we have used the GPT-2 tokenizer in that script rather than BERT? The answer to the second question is “yes.” The answer to the first is “maybe.” I wanted to call out these distinctions and this seemed like a relatively painless way to do it.

That tokenized text that our model script generates is saved in in text_tokenized.json.

So note that we’re actually creating a new tokenized text representation here and overwriting our previous one.

The code then generates a response to a given question or prompt using a pre-trained language model. Specifically, we’re loading a “gpt2” pre-trained language model. I store this in a model_base variable to make it easier to change this for testing purposes, if need be. The code has a function that takes a prompt/question and the loaded model as inputs. It’s from this that the model generates a response in the form of generated text. The generated response is extracted and stored as an answer. This is really the meat of it all. This is where our specific test condition is being applied to our data conditions. Our prompt here is our test condition:

Also worth noting that our answer is being generated like this:

This is generating a concise answer by selecting the first thirty words from the generated text. The operating assumption here being that if the first thirty characters do not seem to indicate any relevance to the prompt, it’s unlikely too many more characters will be helpful. That might not be a good assumption from a testing standpoint.

Test Observations

Here’s where I stop being entirely pedagogical — well, for the most part — and I ask you to start reasoning about what you’re seeing. So run the script to see what happens.

Uh, Is This Going to Take Awhile?

As a testability note: how long do we wait? The code runs but there’s no real indication that it’s actually doing anything except those few print statements I put in to indicate what states have been reached. Let’s say this is a big problem for you and your team. You can take steps to alleviate this. First let’s consider the output you should be seeing:


  Loading tokenizer...
  Tokenizer loaded.
  Tokenizing text...
  Text tokenized.
  Generating text...
  Setting pad_token_id to eos_token_id:50256 for open-end generation.
  Text generation complete.
  Prompt:
  How has the understanding of the term religion evolved over time?

  Answer:
  ...

The bit about setting the pad token ID is just for information and can be safely ignored.

Let’s say you want to augment the text generation step with a progress bar. One thing you can do is install the “tqdm” library with pip. Add an import:

Then change the generate_text() like this:

It’s tricky to get progress during the generation of text because buffering will prevent any updates to the terminal, even if you flush the contents. So what the above does is break the text generation process into smaller chunks and then display progress messages between each chunk. The “tqdm” library is used to display a progress bar with the description “Generating Text” while the chunks are being generated.

If you try this, what’s one thing you notice? The thing you might notice is the one reason I took us down this path.

Splitting the text generation into smaller chunks and displaying progress messages between each chunk can cause the overall processing time to increase. This is because the model has to generate text multiple times, and each time it generates text, it incurs some computational overhead.

So you have to decide if the added observability is worth the extra time to get results. Crucially, when you add observability like this, you want to make sure that the final results aren’t impacted.

Tuning the Model

Consider this bit of code here:

This is an example of some parameters you might want to consider tuning in the model, depending on the perceived quality of the outputs you are getting. This is a massive topic that I can in no way do justice to here. But let’s at least do a fly-over of some of this stuff just to get a high-level view.

The setting of temperature controls the degree of randomness in the generated text. The higher the temperature, the more creative — and potentially unexpected — the output can be. A lower temperature will result in more conservative and predictable output.

The num_return_sequences setting controls the number of independent sequences to generate in the output. Setting the value to 1 will generate a single sequence while a higher value will generate multiple sequences.

The no_repeat_ngram_size controls the size of n-grams — sequential sequences of n words — that you don’t want to allow to be repeated in the generated text. For example, setting this to 2 will prevent any two-word sequences from being generated twice in a row. This can help prevent repetitive or nonsensical output.

Back to Test Observations

What do you see as a result of that prompt? Does it make sense to you? You probably see this:


Prompt:
How has the understanding of the term religion evolved over time?

Answer:
The term "religion" has been used to describe a variety of different religions. The most common is the Hinduism,

When I say “probably see this” consider that we’re generating text based on a model that learning about the text based on pre-trained criteria. So you could argue that variance in results would be expected. This is where framing your data conditions very operationally helps to keep things consistent.

Let’s try another prompt.

How does the model do with that?

But … hold on. What’s our barometer here? What’s the pass/fail criteria? I’ll come back to that because that goes into those relevance scores we looked at in a previous post.

Data Conditions Are Critical

What are some other data conditions that might make sense that query the text in specific ways? Keep in mind, you have access to the full text so you can look at it yourself. You can decide what you, as a reasonable human, would consider to be a relevant answer to a given question. You can also consider what would be completely impossible to answer based solely on the text.

Also one thing that might have been interesting to try, as a tester, is to increase the amount of text shown in the answer. Change the 30 to, say, 130 in the following line:

What do you see now, particularly with the same data conditions you just tried? What is that extended information telling you?

What other prompts could you utilize that would expose more of the data conditions to suggest what kind of quality of response — focused on relevance of the answer to the question — we’re getting here? Speaking to that, have you thought of other prompts based on looking at the text? Here are some that exercise the various data conditions in the text:

  • What is the origin of the term religion?
  • Is religion a native category or an imposed category?
  • What is meant by the anthropological perspective of “religion?”
  • How did the term religion become extended to non-Christian examples?
  • How were ritual and religion linked in early examples of religion?
  • What is the significance of the shift to belief as the defining characteristic of religion?
  • How did the English usage of faiths as a synonym for religions develop?
  • What were some key issues and questions raised by the plural religions, both Christian and non-Christian?

Your team might look at those and say: “Wait, how do we know those are exposing certain data conditions?” And that’s where you, as a test specialist, can step in and help them understand why you constructed the test data as you did. And it is constructed. Keep in mind that while we’re using a model that was pre-trained on a lot of existing data that wasn’t structured as tests — such as Wikipedia pages or blog posts — here we’re actually testing. So we did construct some data accordingly, albeit data that — if you looked it — certainly would not look out of place in a realistic context.

Operating as a tester, the above prompts — and you can call that “prompt engineering” if you’d like — would be test cases that you are coming up with. And these test cases — these test conditions — are designed to exercise very specific aspects of the data — the data conditions — because they force the model to reason about the text in distinct ways. Those distinct ways are what you believe, as a tester, will most expose deficiencies in the model. All of what I just said are really critical skills that test specialists should bring to the table. And, again, I hasten to add: what I did here is no different than what test specialists do now in non-AI contexts: expose test and data conditions to make informed assessments about risk regarding an application. In this case, our application is a question-answering model.

Yes, But Is It “Right”?

Consider our original prompt: “How has the understanding of the term religion evolved over time?”

One very viable response we would want to check against is this:


"The understanding of 'religion' evolved from ritual-centric to belief-centric. Initially imposed on native cultures, it expanded in the sixteenth century. The shift emphasized belief as the defining characteristic, encompassing diverse traditions and raising questions of credibility and truth."

How about our second prompt: “When did belief become the defining characteristic of religion?”

One very viable response we would want to check against is this:


"Belief became the defining characteristic of religion predominantly in the mid-1700s. This shift from ritual-centric to belief-centric understanding was influenced by Reformation figures who emphasized "piety." It led to dissociating religion from rituals, focusing on states of mind, and considering faiths as synonymous with religions."

But how do we compare those? Clearly we can’t just do a one-to-one comparison given the various ways that a response can be generated.

Use Measures and Scores to Evaluate

This is where you might go back to those QnA Relevance and QnA Relevance Pairwise measures as these can be a viable approach to compare the relevance of the outputs generated by our model and the response of a human. Or even the response of another model. These measures can provide quantitative assessments of how well the generated responses answer the given questions or prompts.

How would you go about this with your team?

For QnA Relevance, you would define a reference answer — or set of such answers == that represents a highly relevant response to the prompt, which is what I did above. Then you would calculate the QnA Relevance score between the model’s generated answer and the reference answer using an evaluation metric. such as ROUGE, BLEU, or METEOR. These metrics compare the similarity and overlap between the generated answer and the reference answer and feed into the QnA Relevance score. The higher the QnA Relevance score, the more relevant the generated answer is to the reference answer.

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation while BLEU stands for Bilingual Evaluation Understudy. Both do comparisons based on n-grams. An n-gram is as a contiguous sequence of n items in a given sequence of words or characters. METEOR stands for Metric for Evaluation of Translation with Explicit ORdering and that approach uses more varying linguistic features, like unigram matching, word order, stemming, and synonymy.

For QnA Relevance Pairwise, here you would collect a set of alternative candidate answers from different models or sources. So you might have some humans answer the prompt and also submit the prompt to, say, ChatGPT or Bard. Then you pair each candidate answer with the generated answer from your model, similar to what I described in the second post.

What Next?

So here’s what I meant earlier: I’m leaving you with a perhaps a little disappointment.

I’ve given you the broad strokes of how to create a question-answering model and put that model to the test with some data. I’ve shown you that this model may have questionable output based on our data conditions or it might have some great output. Or maybe just some okay output. Or maybe some output that’s good in some cases but not others.

I’ve told you about some measures you could apply between different sources.

But that’s also where I have to leave this post. To take this further, I would have to open up a slew of topics such as tuning the model, creating reference data, generating the scores based on that data, and so on. These are topics that are likely worth exploring in future posts but, for right now, I hope what you’ve seen is the role of testing in this entire context.

Keep in mind my test data 002 which is a quite a bit more data that has to be chunked and parsed. I would recommend going through the steps of the Python logic we created together and see what you can observe about our model with that data.

But here I’ll give you a slight assist. If you run on that larger data set, you’ll likely see this at some point in your output:


Token indices sequence length is longer than the specified maximum sequence length for this model (5548 > 1024). Running this sequence through the model will result in indexing errors

Is that concerning? Does it not matter? What it’s clearly indicating is that the tokenized input sequence exceeds the maximum length that the model was configured to handle. In the above case, the sequence length is 5548 and that exceeds the maximum length of 1024 specified for the model. Now much to be concerned about this depends on your specific use case and the impact of truncating or discarding excess tokens.

“But, wait,” you might be thinking, “this seems familiar. Can’t I just increase the following in model.py:”

Sure, by increasing the max_length value, you could accommodate longer sequences and avoid the indexing errors. However, it’s important to consider a few factors before making this change.

One of these is that the maximum length parameter should not exceed the maximum length supported by the model architecture you’re using. Different models have different maximum length restrictions. Also it’s important to note here that the “length” doesn’t refer to words or characters but tokens. That means the length is determined based on the tokenization process, where words and characters are converted into tokens.

Okay, so we’re using gpt2 for the model. What’s the limit?

The token limit for the GPT-2 model depends on the specific variant of GPT-2 you’re using. The GPT-2 models are available in different sizes, such as 124M, 355M, 774M, and 1.5B, which indicate the number of parameters in the model. The maximum token limit of the GPT-2 models corresponds to the model’s “max_position_embeddings” value, which is usually set to 1024 tokens for the base GPT-2 models, which are the 124M and 355M variants. This means that you can input a maximum of 1024 tokens for these models.

For larger variants of GPT-2, such as 774M and 1.5B, the “max_position_embeddings” value is typically set to a higher number, allowing for longer sequences. These larger models can handle input sequences of up to 2048 tokens or more.

Great. So how do we know what variant we’re using?

You could add some logic after the model is defined, and right before our prompt, in our model.py like this:

For our model, you should see output like this:


Model Name: gpt2
Maximum Position Embeddings: 1024

So we know that we can’t increase our maximum length.

But do you see how just switching to another data set already made us have to consider a whole new set of logistics? This is what I mean about how the paths to explore explode in this context.

One thing you could certainly do, however, is consider that wider set of test data and think about the prompts you would generate based on it. That’s a critical skill in a test specialist in this context. I’ll put some of my ideas here based on the data conditions I specifically exposed. See if they match your thinking.

Test Prompt Set 01


What is the geographic context of the history of Israel and Judah?

Which empires controlled the Levant throughout its history?

What is meant by the Fertile Crescent and how is it related to the history of the Levant?

Where did Abraham travel according to the biblical narrative and what is significant about the places he stopped at?

What is the archaeological context of the beginning of the history of Israel and when did it end?

Who controlled the Levant in the middle of the second millennium and how was it organized politically?

Who were the "people of the sea" and where did they establish themselves?

How did Israel come into existence and what was its origin?

What was the opposition between "Israelites" and "Canaanites" based on?

What was the economy like in the Levant during the beginning of the first millennium, and how was political organization changing?

Test Prompt Set 02


Who are the three exemplary figures that the biblical narrative in the books of Samuel centers the story of the origins of the monarchy?

Which king was presented as the first king of Israel in the biblical narrative?

When Saul resisted Philistine domination and created a state structure, which territories did he make the center of his state's territory?

Who was David in conflict with, and what kind of relationship did he have with the Philistines?

Where was David's kingdom located, and how did it compete with Saul's kingdom?

According to the books of Samuel and Kings, what was the territory of the "united kingdom" ruled by David and his son Solomon?

How was Omri's policy towards Phoenicia viewed by the editors of the biblical text, and what did they accuse them of?

What happened to the Omrid dynasty, according to the biblical text, and what were the historical factors behind it?

Who became a powerful presence among the kingdoms of the Levant in the ninth century under the Omrides, and what building projects did they undertake?

 What was the power structure of the Levant in the eighth century, and how did it change under Assyrian rule?

Test Prompt Set 03


How did the rise of Jerusalem come about, and what factors contributed to it?

Who was King Hezekiah, and what public works did he commission?

What was the relationship between Hezekiah and the Assyrians, and what was the result of their conflict?

Who was Manasseh, and how did he govern the kingdom of Judah?

Who was King Josiah, and how did he contribute to the centralization of the kingdom of Judah?

What role did the Babylonians play in the history of Judah, and how did they interact with the kingdom?

How did the destruction of Jerusalem by the Babylonians affect the Judean intellectual community, and how did they respond to the crisis?

Who was Cyrus, and how did his religious policies affect the Judeans and their relationship with the Persians?

What were the characteristics of the quasi-theocratic temple-centered organization of political and religious life that was put in place after the return of the Judean exiles from Babylon?

How was Judaism shaped during the Hellenistic era, and how did the religion continue to develop in the diaspora?

That should give you some starting points if you want to play around with this larger data set and the model we looked at here. Incidentally, all of these test cases came about with students interacting with the more complete model I’m developing.

Wrapping Up

I hope this was an interesting journey for you and while it wasn’t a journey I could help you complete, even within the constraints of my limited model, that was also part of the goal: there isn’t an end destination here.

Being a test specialist in this context means constantly engaging with systems that adapt themselves and modify themselves, via a set of entirely opaque mechanics that in turn operate via the use of potentially — depending on your experience — impenetrable mathematics. These systems, much like humans, can return extremely variable output and that means you don’t look for just “right” or “wrong” in a categorical sense but rather “more relevant” or “less relevant.”

Although sometimes there actually is very much a hard distinction between “right” and “wrong.” Yet sometimes that manifests not simply in the binary idea but rather in the loaded idea of bias or misattribution or misapplication.

It’s an utterly fascinating world to work in. And much like certain exercise equipment that’s specifically designed to utilize every one of your major muscle groups, I truly believe that an AI context — much like a gaming context, incidentally — is one that utilizes every one of the major thinking skills and intuition pumps that specialist testers have to bring to the table.

If these series of posts convinced you of nothing else, I hope it’s the enthusiasm I have for getting testers to a point where they feel comfortable operating in this world.

Share

This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.