We come to the third post of this particular series (see the first and second) where I’ll focus on an extended example that brings together much of what I’ve been talking about but also shows the difficulty of “getting it right” when it comes to AI systems and why testing is so crucial.
This post continues on from the first one. Here I’m going to break down the question-answering model that we looked at a bit so that we can understand what it’s actually doing. What I show is, while decidedly simplified, exactly what tools like ChatGPT are essentially doing. This will set us up for a larger example. So let’s dig in!
The idea of “Generative AI” is very much in the air as I write this post. What’s often lacking is some of the ground-level understanding to see how all of this works. This is particularly important because the whole idea of “generative” concepts is really focused more on the idea of transformations. So let’s dig in!
In the first part of this post, I used a simple binary classification task to show some ideas around measures and scores and then provided some running commentary on how the tester mindset and skillset can situate in that context. That post was about depth; this post will be more about breadth.
There are various evaluation measures and scores used to assess the performance of AI systems. As someone adopting a testing mindset in this context, those measures and scores are very important. Beyond simply understanding them as a concept, it’s important to see how they play out with working examples. That’s what I’ll attempt in this post.
In part 1 of this post we talked about a human learning to play a game like Elden Ring to overcome its challenges. We looked at some AI concepts in that particular context. One thing we didn’t do though is talk about assessing any quality risks with testing based on that learning. So let’s do that here.
Humans and machines both learn. But the way they do so is very different. Those differences provide interesting insights into quality and thus the idea of testing for risks to quality. I found one way to help conceptualize this is around the context of games. Even if you’re not a gamer, I think this context has a lot to teach. So let’s dig in!
In the previous post in this series, I talked about testability, and the various aspects of it, in relation to testing a product that’s been AI-enabled in some way. In this post, I’ll focus on a specific case study and apply the thinking from the previous post to that study.
It’s definitely time to talk seriously about testing artificial intelligence, particularly in contexts where people might be working in organizations that want to have an AI-enabled product. We need more balanced statements of how to do this rather than the alarmist statements I’m seeing more of. So let’s dig in!