AI Testing – Generating and Transforming, Part 2

This post continues on from the first one. Here I’m going to break down the question-answering model that we looked at a bit so that we can understand what it’s actually doing. What I show is, while decidedly simplified, exactly what tools like ChatGPT are essentially doing. This will set us up for a larger example. So let’s dig in!

In the first post, as part of showing how to use Transformers, I focused on question-answering models. In previous posts, I talked about measures and scores. Here I’ll start by combining those threads of thought with two more measures.

Specifically, we’ll talk about QnA Relevance Scores Pairwise and QnA Relevance. First let’s focus a bit on the basics of question-answering models.

Question-Answering Models

Imagine you have a really smart computer program that can understand and answer questions just like a human. But is the program truly “smart”? Well, let’s table that for now. What we can say is that this program is called a question-answering model and it uses a special type of artificial intelligence called a Transformer.

A Transformer, as I showed in the first post, is designed to process and understand language, like the words and sentences we humans use to communicate. Our Transformer is trained on a large amount of text. That can be text from books, articles, web sites, and so on. The model thus learns patterns and meanings from all of that information.

When you pose a question to the question-answering model, it takes your question and analyzes it to figure out what you’re asking. The model uses its knowledge and understanding of language to break down the question and identify the important parts.

Next, the model searches its memory — essentially, it’s (often huge) storage — for the information it learned during the training it did on all that data. The model looks for the relevant context or details that can help it answer your question.

Once the model finds the information it needs, it organizes and processes that information to form what is hopefully a clear and accurate response. This is actually one of the harder parts of all this because the model has to put the pieces together to create a meaningful answer to whatever the question is. And, just like a person, the model can provide a detailed explanation or a concise response, entirely depending on the question and the context.

QnA Measures

Given the above context, let’s talk about our measures next. QnA Relevance Scores Pairwise refers to a method used to evaluate the relevance of different candidate answers to a given question. This method involves comparing the relevance scores of each pair of candidate answers and then selecting the one with the highest score as the most relevant answer.

It’s probably worth nothing that when we say “pairwise” in the context of generating answer scores, we mean that we’re comparing each possible pair of answers. For example, let’s say we have three candidate answers. In that case, we would generate pairwise answer scores by comparing answer 1 to answer 2, answer 1 to answer 3, and answer 2 to answer 3. QnA Relevance is a more general concept that refers to the overall relevance of a given answer to a specific question. This relevance score can be calculated using various approaches and we’ll look at one of them a bit further on in the post.

The main distinction to get here is that QnA Relevance Scores Pairwise is a specific approach for evaluating the relative relevance of a set of candidate answers to a given question, whereas QnA Relevance is a more general approach that refers to the overall relevance of a given answer to a specific question.

No pun intended here, but what’s the relevance of all this? Well, to see that let’s play around with some code.

QnA In Action

Let’s create a Python program to demonstrate the concept of a question-answering model. Then we can extend that a bit to QnA Relevance Scores Pairwise and QnA Relevance. Similar to what I did in the first post, I’ll frame this example using the PyTorch framework and the Hugging Face Transformers library.

You will need the “torch” (PyTorch) and “transformers” libraries available, which you can install via pip.

Here’s the script:

This program will download models and data when you run it, unless you have previously downloaded the models and they are already cached on your system.

Running the script should generate the output “1863”.

How this works behind the scenes, as it were, is that the script generates answer scores pairwise using a variant of the BERT model. You can see the model listed in the code:

  bert-large-uncased-whole-word-masking-finetuned-squad

This particular model was trained on the Stanford Question Answering Dataset (SQuAD) for answer extraction. This involves passing the tokenized question and candidate answers through the model and generating scores for the start and end positions of the answer span in each candidate answer.

Ready to swim in some terminology soup? Okay, so, brace yourself. The answer scores pairwise is calculated as a matrix multiplication of the softmax scores for the start positions and end positions of each answer span, indicating the probability that the corresponding token is the start or end of the answer span. What this does is generate a matrix where each row corresponds to one candidate answer and where the values in each row represent the probability that the corresponding answer span is the correct answer. Okay, whoa, whoa! Hold on! What does all that even mean? This is the kind of stuff you have to be comfortable working with as a tester in this context so let’s illustrate the concept a bit with a simple example.

Question-Answering: Behind The Scenes

Let’s say we have a question and there are two candidate answers.


  Question: "What color is the sky?"
  Answer 1: "The sky is blue."
  Answer 2: "The grass is green."

Obviously we want the model to settle on the answer that has more relevance. First we’ll tokenize the input question and candidate answers:


  Tokenized Question: [101, 2054, 3609, 2003, 1996, 3719, 102]
  Tokenized Answer 1: [101, 1996, 3719, 2003, 3454, 1012, 102]
  Tokenized Answer 2: [101, 1996, 3835, 2003, 2663, 1012, 102]

Thinking as a tester, do you notice anything potentially off there? What we have are two observation points. Two views into data. This happens to be two views of our test data.

You might wonder: why are there are more tokens than there are words?

Your intuition might be that the number of tokens should match the number of words. And, generally, that’s true. But the tokenization process also accounts for additional special tokens and may involve subword splitting. It’s also worth noting that tokens [101] and [102] represent the special tokens [CLS] (start of the sequence) and [SEP] (separator), respectively.

Noticing discrepancies like that is crucial for a tester in this context, which I also showed in the first post. In this case — unlike the one I showed in the first post — the discrepancy is only apparent. But you can’t determine that if you don’t spot it in the first place. And you might that this is a case where spotting discrepancies can actually be aided by “stranger value” to the domain.

Those tokenized inputs are then passed through the model to obtain score distributions for the start and end positions of the answer span in each candidate answer. That might look like this:


  Start position score distribution for answer 1: [0.1, 0.2, 0.7, 0.9]
  Start position score distribution for answer 2: [0.8, 0.05, 0.05, 0.9]

This indicates that for answer 1, the model thinks the answer span is most likely to start at position 4 for both answers.

Do you see why?

In the start position score distribution for answer 1, the highest score of 0.9 is associated with the fourth position. This suggests that the model considers the fourth token in the answer (“blue”) to be the most probable starting position for the answer span.

Likewise in the start position score distribution for answer 2, the highest score of 0.9 is associated with the fourth position. This indicates that the model considers the fourth token in the answer (“green”) to be the most probable starting position for the answer span.

Notice the scores for the other words. That’s crucial for understanding how the model can’t just cue into a color word as the sole means of determining relevance. If that wasn’t the case, either answer could be seen as equally relevant.

Similarly, the end position score distributions might look something like this:


  End position score distribution for answer 1: [0.05, 0.2, 0.75, 0.9]
  End position score distribution for answer 2: [0.1, 0.1, 0.8, 0.9]

This indicates that for answer 1 and 2, the model thinks the answer span is most likely to end at position 4.

Doing a Little Math

Now we can calculate the pairwise answer scores by taking the outer product of the start and end position score distributions for each candidate answer. This is where we generate that matrix I mentioned earlier.


Pairwise answer scores = start position scores x end position scores:

  | [0.1, 0.2, 0.7, 0.9]   |       | [0.05, 0.2, 0.75, 0.9] |
  |------------------------|   x   |------------------------|
  | [0.8, 0.05, 0.05, 0.9] |       | [0.1, 0.1, 0.8, 0.9]   |

  = | 0.008 0.01  0.035  0.045   |  <---  Highlighted for Answer 1
    | 0.016 0.002 0.0075 0.009   |
    | 0.056 0.007 0.02625 0.033  |
    | 0.072 0.009 0.03375 0.045  |

In this representation, the left matrix represents the start position score distribution for Answer 1 and Answer 2, and the right matrix represents the end position score distribution for Answer 1 and Answer 2. The multiplication is performed element-wise, resulting in the pairwise answer scores matrix.

Even if the specific operation is opaque to you, just understand that each element in the resulting matrix represents the probability that the corresponding answer span -- formed by the combination of start and end positions -- is the correct answer. The values in the matrix are calculated by multiplying the corresponding elements from the start position score distributions with the corresponding elements from the end position score distributions.

Note that the resulting matrix is a 4×4 matrix, where each row corresponds to a start position score from Answer 1 and each column corresponds to an end position score from Answer 2.

But ... Math Makes Me Cry!

Here's some code-annotated Python that would represent what I'm talking about above:

This is a case where using tooling, in this case "numpy" (which you would have to install via pip). Let's say you wanted to generate a visual like I did above. You could do that too:

Regardless of all of how you visualize this, here's the critical thing: in the provided output matrix, the higher value in the first row and fourth column suggests that the model assigns a higher probability to answer 1 ("The sky is blue.") being the correct answer compared to answer 2 ("The grass is green.").

I'm obviously simplifying a lot here but I hope the above at least gives some idea of what's going on behind the scenes.

Interpreted and Explained

While again noting the simplifications that I put in place, I do want to call out that what we just did there together is provide interpretability and explainability around code that demonstrates how a pre-trained model can be used to generate answer scores pairwise and determine the most relevant candidate answer. That's really important. In fact, you could argue it's pretty much entirely the point of my test-focus in these posts.

We exposed how some notion of quality -- in this case, relevance -- is obtained. And we gave ourselves the means (testability) to understand how our model is framing that particular quality around very specific data conditions.

Where's the QnA?

What about those measures I mentioned earlier? Do they come into play here?

To answer that, remember that QnA Relevance Scores Pairwise and QnA Relevance typically refer to the evaluation and scoring methods used in question-answering systems like we just looked at. They assess the relevance and quality of answers generated by a given model.

In the examples I just took you through, the start and end scores represent the relevance scores assigned to each token in the input sequence. These scores indicate the likelihood of each token being part of the answer. By selecting the tokens with the highest scores, the model's goal is to find the most relevant answer within the given context.

I emphasized "represent" above because my Python code doesn't explicitly implement the QnA Relevance Scores Pairwise or QnA Relevance evaluation methods. The reason is because those methods involve comparing multiple candidate answers or evaluating the relevance of generated answers against a reference answer. My example, by contrast, demonstrates the process of extracting an answer using the pre-trained model but doesn't include the specific evaluation aspects related to relevance scoring.

So how about if I modify the example to include the evaluation aspects related to relevance scoring? Here's a way to do that:

In this modified code, I'm introducing a list of multiple questions that I want to ask about the given context. For each question, I follow a similar process as before to obtain the predicted answer using the BERT-based question-answering model.

With this you can see that I now calculate a relevance score for each predicted answer. The relevance score is computed as the sum of the start and end scores associated with the predicted answer span. Higher relevance scores indicate greater confidence in the relevance of the answer to the given question and context, which is what you saw in my matrix breakdown earlier.

To achieve this in the modified code, I store the question, answer, and relevance score in a dictionary for each question and append it to a list. After processing all the questions, I then sort the list of answers based on the relevance score in descending order. And then, finally, I iterate over the sorted list of answers and print the question, answer, and relevance score for each entry.

The output you get should be something like this:


  Question: In what year was the Battle of Gettysburg fought?
  Answer: 1863
  Relevance Score: 15.408963680267334

  Question: Who was the general of the Confederate Army at Gettysburg?
  Answer: gettysburg
  Relevance Score: -4.782402515411377

The relevance score of 15.408963680267334 suggests a high confidence in the answer's relevance to the question and context. The relevance score of -4.782402515411377 suggests a low confidence in the answer's relevance to the given question and context. Negative relevance scores typically indicate that the model doesn't consider the predicted answer to be relevant.

And that's good, right? That result -- that test observation -- certainly matches what we would expect.

I recommend you play around with these examples a bit to get a feel for them.

You could certainly modify the above script to read in more context. You could have that context come from a file -- like we did in the last post -- or a database. You could also use an API to retrieve context from a web service. You can also try to scrape context from web pages, which we'll look at in the next post.

Wrapping Up

I previously wrote a series of posts about demystifying machine learning. This post was done somewhat in that same spirit.

What I want to do in the third, and final, post is take the concept of the first post -- the Transformers API -- and the focal point of this post -- question-answering -- and scale them up a bit for what will start to look like a real-world scenario that you might encounter.

Share

This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.