The Spectrum of AI Testing: Testability

It’s definitely time to talk seriously about testing artificial intelligence, particularly in contexts where people might be working in organizations that want to have an AI-enabled product. We need more balanced statements of how to do this rather than the alarmist statements I’m seeing more of. So let’s dig in!

Awhile back I published Testing AI … Before Testing With AI. That provided a contextual example to bring home a few points that, if I’m being totally honest, I didn’t think a lot of testers had been preparing themselves for.

The passage of a few years has convinced me that not much has changed. So I want to do my part to alter that trend. And I want to do this without sounding alarmist about artificial intelligence. Yes, there’s a lot to be cautiously skeptical of but there’s also a lot to be cautiously optimistic about as well.

I wrote previously about the idea of an ethical mandate and, in this context, part of that mandate should be to help organizations see how AI can be a technology that augments certain human activities but doesn’t replace humans. This is as opposed to simply railing against artificial intelligence as “irresponsible” technology or “harmful” technology. Just about any technology can be used in a harmful or irresponsible way. One way to avoid that is test such technology to surface those risks.

Testing is Still Testing

Many years ago I talked about testing that is effective, efficient and elegant. All of that applies equally well to testing in an AI context. Beyond that, there are a couple of obvious areas of testing in this arena.

One is certainly data testing. All AI models rely very heavily on data. Thus it’s obviously important to test the quality and accuracy of the data that’s being fed into whatever models you’re testing. The good news is you can do this using the same techniques you do for testing any application that uses data: by creating test datasets that cover a wide range of scenarios including especially the edge cases.

And, as in any testing, edge cases do not necessarily mean “uncommon.” They just mean circumstances that occur at the edges or boundaries. The edges and boundaries are where bugs tend to congregate.

Examples here might include unusual input formats. Another might be infrequent events, which usually means patterns that are perhaps not the norm but are plausible. Another example would be low-confidence prediction generators, such as inputs that are ambiguous. Temporal edge cases are also important, such as looking for accuracy drift. This might occur when the AI is provided with evolving or changing data distributions.

So that’s overall data integrity and veracity. Another area is also one that testers are used to: performance. AI models can be computationally expensive. Various operational profiles thus should be utilized to put these models under various types of load, stress and volume conditions. You’ll want to see how the system performs when there are various types of resource constraints.

Talking about testing that most testers already do in one context and applying that to AI is pretty low-hanging fruit, though. The bigger challenge is that AI models are often probabilistic.

So what about the accuracy of the model’s predictions? How do we perform risk assessments based on that via testing?

I should note that this also applies to solutions like Bard and ChatGPT which, when you reduce them to their particulars, are all about predicting what words should be used next in the context of a given response to a given query.

In this series of posts I’ll talk about this topic across a few areas, particularly around evaluation metrics. But in this post I want to talk about the foundation: testability aspects.

I have a whole testability series that showcases how testability plays out in the context of writing an application. But how does this play out in the context of an AI-enabled application?

As I’ve said in many places at many times: testability is, fundamentally, about observability and controllability. Those can lead to reproducibility and thus predictability. So let’s see how those apply to testing in an AI context.

Controllability

If you’ll agree with me that part of testability is controllability, how would that play out in testing an AI system or a product that’s AI-enabled?

Controllability refers to the ability to control and manipulate inputs, conditions, or parameters during testing so that you observe specific behaviors or evaluate specific responses within specific contexts. Controllability, in any testing, allows testers to systematically explore different scenarios and edge cases and this allows us to focus on the comprehensiveness of our testing.

This is particularly important in contexts like the image above shows, where models may have so-called “hidden layers” where computation is done but what exactly is being done can be difficult to determine.

In my experience with testing AI, controllability focuses on a few key areas.

Testers can and should be able to control the input data fed into the AI system to observe how it responds under various conditions. This includes providing both typical and atypical inputs, modifying input features, introducing noise or perturbations, and testing different input scales or ranges.

Testers can and should be able to control the configuration of the AI model itself. This includes adjusting hyperparameters, the overall model architecture, or feature selection to assess the system’s sensitivity to variations.

Testers can and should be able to control simulations or emulations of specific conditions or environments that are relevant to the AI system or more specifically, to the algorithms underlying that system.

Testers can and should be able to generate synthetic or simulated data to control the characteristics, distribution, or complexity of various types of inputs. This enables targeted testing of specific scenarios that may be difficult to encounter or reproduce in real-world data, often for a variety of reasons that say nothing about how common or not those scenarios are in a real-world context.

What all of the above is saying is that testers must be able to precisely control and manipulate the factors that influence the system’s behavior. As you can probably imagine, this covers a wide range. For example, with sensors — such as in Home AI systems or autonomous vehicles — you want to control lighting conditions or environmental conditions. For other systems, you might need to control many external inputs, such as in surrogate models used in nuclear reactors.

In the next post, I’ll get into a specific case study that’s probably more like what you would encounter in your average organization.

Observability

Again, if you’ll agree with me that part of testability is observability, how would that play out in testing an AI system or a product that’s AI-enabled?

Much as with any testing process, observability is all about our ability to observe and gather relevant information. This is done so we can gain insights into the internal workings, behaviors, and outputs of the AI system. This is really important when you consider that many AI systems are essentially black boxes in terms of what’s really going on internally. This is particularly true when you get into systems based on reinforcement learning. I’ve personally found the latter the most challenging to test. My Pacumen project was created largely to explore this area and it’s an area I talked about in various posts here, particularly in my exploration category.

In this context, the standard things you would expect from any testing are operative: performance monitoring, resource profiling, logging, debugging aids or tools, and so on. But there are some specifics that I’ve found in my experience testing AI to be very useful.

For example, with logging, you want to work with the development and design team to instrument the system so that you can get logs of intermediate computations, model predictions, feature activations, gradients, and any other aspects that provide the basis for the internal workings of how the system produces outputs.

Another area of focus is on the analysis of errors. In particular, as a tester you want to work with the delivery team to see what misclassifications were made by the AI. Even more specifically, you want to start looking for the patterns of errors and start to look for where the system’s classifications and/or predictions start to degrade rather than just looking for when they break down.

How the model is being interpreted — and how interpretable the model is in the first place — is a key aspect of observability in this context. In fact, interpretability is an internal quality and there are various interpretability techniques that testers can and should use to gain insights into the decision-making process of the AI system. This includes techniques like feature importance analysis, saliency maps, attention mechanisms and so on. These can help understand how the system is arriving at its predictions and that provides a basis for evaluating its overall behavior.

Reproducibility

If we assume that part of testability is reproducibility, how would that idea play out in testing an AI system?

Reproducibility is a crucial aspect of testability in the context of an AI system. In testing, the whole idea of reproducibility is the ability to reproduce and repeat tests reliably to validate behavior and results. This is pretty much the basis of how experimental science works. A key goal is always to establish a testing process that can be consistently executed, leading to consistent evaluations and reliable conclusions.

Note that “reliable conclusions” does not have to be a simple, binary “pass/fail.” It can often be more about assessing risk and talking about how something works in a given context rather than always focusing on whether how something is working is “right” or “wrong.” This can be counter-intuitive to non-testers.

One of the most obvious things in an observability context is random seed control. Many AI algorithms involve random processes during training or testing. Reproducibility can be achieved by setting and controlling the random seed used in these processes. By ensuring that the same seed is used across test runs, the randomness is eliminated, leading to consistent results.

On the other hand, many AI models, by necessity, deal with variation. So it’s worth thinking about how that plays out in a test context. Here’s where testers need to augment their techniques. Parameter tuning is a good example. If the AI system has configurable parameters that influence its response to variability, testers can select and document the parameter settings used during testing. Testers can further experiment with different parameter configurations, keeping track of the chosen settings for reproducibility purposes. This is one way to allow tests can be repeatable with the same configurations.

Another technique is known as ensemble testing. Here instead of relying on a single model or algorithm, you can use multiple models or variations of the AI system to capture different aspects of variability. The ensemble can be composed of models with different initializations, different hyperparameters, or different training data subsets. By recording the ensemble configuration, reproducibility can be maintained at some level of approximation.

Yet another example would be stratified sampling. When using data for testing, testers can use stratified sampling techniques to ensure a representative distribution of variability factors. By selecting samples from different strata or subgroups of data, testers can capture diverse scenarios and maintain reproducibility by consistently selecting the same samples during repeated tests.

A technique that I find is really powerful and really underused is sensitivity analysis. The goal of this testing technique is to assess the impact of variability factors on the system’s performance. Vary the input parameters or conditions that contribute to variability and measure the system’s response.

The common theme of all the above is figuring out how to systematically explore different levels of variability.

Predictability

Predictability is an important aspect of testability in general. Much of the software we test is deterministic. But it operates with conditions that we can’t always see or aren’t always aware of. So there appears to be a element of randomness to how software works. And here I’m just talking about regular, non-AI based software.

Given this, predictability is often tricky. It refers to the ability to anticipate and predict the behavior, outcomes, and performance of some system during testing. The idea is pretty simple: you have expectations, you set up targeted tests to verify or falsify those expectations.

The reality, of course, is that you can’t just rely on a set of scripted, targeted tests. You have to explore the system. Much like AI and machine learning has an “explore-exploit dilemma,” the same can apply to testing as well. I talked about this in Exploring, Bug Chaining and Causal Models, all of which was non-AI based.

So how do testers do this? In my experience with AI systems, one key thing to do is perform validation against training objectives. If the AI system was trained with specific objectives, predictability can be assessed by validating the system’s behavior against those objectives. The system’s responses and outputs can be compared to the training objectives to determine if the system is aligning with the intended behavior.

Note that there’s wiggle room here in that this doesn’t mean you have simple “pass / fail” test cases, as I mentioned earlier. Instead, you have a range or spectrum of acceptable and unacceptable.

Predictability can also be achieved by estimating the expected performance of the AI system under different conditions. Testers can use techniques such as cross-validation, holdout sets, or performance modeling to estimate the system’s performance and predict how it will perform on unseen data.

I want mention something really important: when we talk about “performance” in the context of AI, it doesn’t necessarily refer to load or scalability aspects. I have used the term in that context earlier in this post. In the context of AI, however, “performance” typically refers to the effectiveness and efficiency of the AI system in achieving its intended objectives.

One example of a performance estimation in the context of AI systems is accuracy. Accuracy measures how well the AI system can correctly predict or classify outcomes. It assesses the system’s ability to produce correct and reliable results, such as accurate predictions, classifications, recommendations, or, in the case of tools like ChatGPT, responses.

Two other performance estimations in the context of AI systems are precision and recall. Precision and recall are performance metrics commonly used in classification tasks. It breaks down pretty simply:

Precision measures the proportion of correctly predicted positive instances out of all predicted positive instances.

Recall measures the proportion of correctly predicted positive instances out of all actual positive instances.

These metrics help testers assess the system’s ability to balance correct predictions and avoid false positives or false negatives.

There’s also a performance estimate called Mean Average Precision (mAP). This is commonly used in object detection or instance segmentation tasks. This measures the accuracy and precision of the system in localizing and identifying multiple objects in an image or video. Higher mAP values indicate better performance.

For AI systems that involve training — which is a whole lot of them! — performance can also encompass the time it takes to train the models or algorithms. Faster training times can be desirable, especially in scenarios where frequent model updates or retraining is necessary. I think this one is particularly relevant to those vendors out there promoting tools that utilize AI to actually execute tests. Their systems have to train on whatever those tests are executing against and that training has to be repeated if the thing being tested changes.

One other area of predictability that I find testers often don’t utilize in this context is confidence interval estimation. The idea here being that predictability can be enhanced by estimating confidence intervals for the system’s performance metrics. Confidence intervals provide a range of values within which the true performance is likely to fall. The general, if simplified, idea here is that predictability increases as the confidence intervals become narrower, indicating a more precise estimate of the system’s behavior.

Summing Up

You can probably see how predictability leads back to reproducibility which requires controllability and thus observability.

What I also hope you can see is that the broad approaches around the quality and test specialist are not all that different in an AI context than they are in any other technology context we test in. There are simply different techniques that have to be applied.

In future posts in this series I’m going to cover some of the specific measures or “scores” that you’ll often hear about in relation to AI systems but in the next post in this series, I’m going to cover a use case that brings in all the testability aspects we talked about here.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …