Welcome Mutants to Your Testing Strategy

Have you ever come across the term “mutation testing”? My experience is that some people might have heard about the concept of mutation testing — perhaps in passing — but rarely have they heard too much substantive about it. Even when I talk with developers and testers, I was initially surprised at how few of them readily accommodated this sort of technique. On the other hand, if there isn’t a lot out there about it, my surprise is unwarranted. That’s even more so the case if the material out there does not talk about how mutation testing can actually be practical.

Mutation testing was originally introduced as a way of measuring the accuracy of a set of tests by introducing bugs into a given program and seeing if the test could find those bugs. This concept stems from the idea that there’s not always an easy way to tell if a given set of tests thoroughly tests a given application or a part thereof. Think of it this way: if the application — or a section of functionality — passes the tests, the only thing that a tester can say with any sort of certainty is that the application works correctly on all the cases that are included in that set of tests.

When you think about it, that’s not actually saying all that much — at least without understanding the nature of the tests being run and their ability to actually find a bug were one present.

The general idea is that the more specific test cases a given test set contains, the higher the probability that the application will work as it was intended since, if it did not, the tests would have found that out. That can be true but need not be true since it doesn’t speak to the efficacy of the test cases themselves. What it does speak to is a presumption that quantity must equal quality. And we all know how true that is, right? (Ahem.)

Forget for a moment all the academic stuff you may have read on “defect prediction models” and “reliability models.” Aside from those things — and that’s even assuming they work in a practical fashion — there’s really no mathematical way to measure how thoroughly accurate the set of tests are and the probability that the application will work correctly based on the execution of those tests.

So how does mutation testing help determine how accurate and/or thorough a given set of tests is? First, note that what you’re really determining is the potential for bug yield with your test cases, which will ultimately tell you how good your coverage is in terms of permutations and/or combinations.

Let’s start with two assumptions. (If you’re a experimentalist or a statistics-minded person, you’ll probably want to call these null hypotheses. Feel free.)

  • First Assumption: Your set of tests is thoroughly accurate and that it covers all possible relevant conditions.
  • Second Assumption: The application you are testing has no bugs and will legitimately pass these tests.

In other words what we’re assuming here is that when the tests you have are executed against the application, all the test cases will pass. With those assumptions in mind, if the code of the application were changed — thereby mutating it — and you were to execute your test cases against the “mutated” application, two possible scenarios could play out.

  • The first scenario is that the application was, in fact, affected by the code change and your tests will detect the change(s). Remember that one assumption was that the set of tests was thoroughly accurate, which means that your test set must detect the change. If in fact your test set does detect the change, the mutant (the change to the code) is often called a killed mutant.
  • The second scenario is that the application is not changed by the mutation of the code and, in that case, the set of tests does not detect the mutation. Here the mutant (the code change) is often referred to as an equivalent mutant.

What’s the point of this? Well, if you were to take the ratio of killed mutants to all the mutants that were created in the application, you get a number that’s smaller than 1. It’s this number that measures how sensitive the application — or a part of it — is to certain types of code changes.

Now, here’s a key point: the above conclusion holds if we maintain the assumptions we started off with.

Yet, as you probably know, it’s very rarely the case that you have sort of “perfect” application that has no bugs and, on the other side of that coin, you very rarely are dealing with a “perfect” set of tests. That means you actually have to consider a third scenario.

  • The third scenario is that the application is affected by the change to the code (the mutation) but the set of tests does not detect this change because the tests do not have the right test case (or the right variation of existing test cases).

With that out there, if you once again take the ratio of all the killed mutants to all of the mutants generated in the code base, you still get a number smaller than 1 and that number also contains information about the thoroughness and/or accuracy of the set of tests.

Notice the two components here: in the first place I was talking about the sensitivity of the application code base to changes. Then I moved into talking about the sensitivity of the set of tests to changes in the application code base.

If you’re thinking cap is on and I’ve done at least a marginally good job at explaining all this, you might find the above a bit odd in that there’s not really a good way to totally and completely separate the effect that is related to inaccuracies in a test set and the effect that is related to equivalent mutants.

In other words, if the application does not change relative to equivalent mutants, that doesn’t speak to the thoroughness or accuracy of the tests because the result of the test execution is that nothing is detected. Exactly why nothing was detected is not determined. However, in the absence of other possibilities, you can accept, at a certain level of approximation, that the ratio of killed mutants to all the mutants does serve as a measure of the efficacy of a set of tests, if not always to its thoroughness or accuracy.

A Mutant Example

I mentioned something about “practical” early on, right? So what I want to do now is present a series of Java code examples that hopefully allow me to describe a bit about what I’m talking about above. (In advance I’ll thank any Java coders who might be reading this to not laugh at the simplicity of my examples as compared to applications they develop and test on.)

So let’s consider test1.java.

You can call this program by passing numbers to it from the command line. Now let’s assume that you have the following test suite that tests the program:

Test Case 1:
Input: 2 4
Expected Output: “Got less than 3”

Test Case 2:
Input: 4 4
Expected Output: “Got more than 3”

Test Case 3:
Input: 4 6
Expected Output: “Got more than 3”

Test Case 4:
Input: 2 6
Expected Output: “Got less than 3”

Test Case 5:
Input: 4
Expected Output: “Got more than 3”

This set of tests will test valid conditions — what some testers would call “positive tests” — which means the test set will test if the program reports correct values for the correct inputs. You’ll note that these tests completely neglect illegal or invalid inputs to the program. The end result is that the test1.class program full passes this set of tests. However, the program has some serious hidden errors.

So now let’s mutate the program. You can see this mutated program with test2.java.

There are three specific commented lines in that source file. These are the various mutations that are possible to apply to the program. If you were working with your developers to see how good your tests are, the developer would be putting those lines in specifically. The idea is that you should then uncomment one mutant at a time, recompile the program, and run the same set of tests as given above. If you uncomment a mutant line, you must always comment the line immediately following since that’s the line the mutant is meant to replace.

So, again, just to be clear: if you want to run a controlled mutation, you uncomment the mutation line and comment the line immediately below it.

If you run the above test set against a modified (mutated) program, you’ll see that you get the following results:

  • If either Mutant 1 or Mutant 3 are in place, the program will still completely pass the set of tests.
  • If Mutant 2 is in place, the program will fail every single one of your tests. In fact, you’ll find that the last case causes an ArrayIndexOutOfBoundsException.

What you see here is that mutants 1 and 3 do not change the output of the program and are thus equivalent mutants. The set of tests does not detect these mutants being in place, which means the set doesn’t detect a change to the program because those changes did not have any observable impact on the output of the program or how the program functioned. Mutant 2, on the other hand, is not an equivalent mutant.

Regarding the presence of mutant 2, the first four test cases will detect the change via the process of receiving output from the program that differs from the expected results. Interestingly, the fifth case may have different behavior on different machines. It may show up as bad output from the program but at the same time it may be visible as a program crash, such as via a Java exception.

So let’s take a look at the statistics here. We have three mutants and of those three only one mutant was killed, by which I mean it was found by the test set. What this indicates is that the number that measures the “quality” — what I’m calling efficacy for now — of the set of tests is 1/3. That seems pretty low. The reason this number is low is because two equivalent mutants were generated. This number should serve as a warning that the test set may not be testing enough. In fact, the program has two serious errors that really should be detected by the test set.

Let’s go back to Mutant 2 and its execution against test case 5. If the program crashes, then the mutation testing that we performed not only measured the quality of the test set, but also detected a very serious error in the code. That’s how mutation testing can find errors.

Now let’s consider a different equivalent mutation. Since test2 contained Mutants 1, 2 and 3 this one will be Mutant 4. You will see this in the file test3.java.

The difference between Mutant 4 and the previous mutants is that Mutant 4 was created in an attempt to make an equivalent mutant. This means that when it was constructed, an effort was made to build a program that should execute exactly the same as the original program. Put another way, test1.class and test3.class should be equivalent programs from an input/output point of view.

If you run this latest program against the test set, test case 5 should fail — in fact, the program should crash. The point here is simply that by creating an equivalent mutant, I increased the detection power of the test suite. Why? Because finding the problem with test3.class — the equivalent mutant program — would have helped us find the mutant in the program test2.class when Mutant 2 was applied to it.

I think that this is actually very fascinating because it leads us to a provisional conclusion that it’s possible to increase the thoroughness/accuracy — and thus, efficacy — of the test suite in one of two ways: first, increase the number of test cases in a given set of tests; second, run equivalent mutants against the set of tests.

That second conclusion is important because it indicates that mutants can, at least potentially, help make testing a bit more effective. With these simple Java examples, each mutant was created by manually making a single change to an existing program. That said, these programs were quite simple. The process of generating mutants is difficult and time consuming and can be computationally expensive. So if mutation testing is to become a part of a test strategy, care must be taken in terms of how to implement the technique.

All of this, I realize, may sound very pie-in-the-sky. Keep in mind, however, that any time the developers change an application you, as a tester, are effectively applying your regression test set to a mutant. So, in a way, you’re always doing “mutation testing” but what I’m advocating here is having a more controlled approach to it in terms of more proactively determining how efficacious a given test set is.

If you are considering mutation testing as part of a test strategy there are two ways to approach this.

  • You need to be able to work with the developers to create meaningful non-equivalent mutants if you want to focus on more unit-based tests to determine application sensitivity.
  • You need to be able to work with the developers to focus on creating meaningful equivalent mutants if you want to focus on test sensitivity.

In both cases the emphasis is on error detection but the focus of the technique changes a bit depending upon what you’re actively seeking. It can be difficult to generate the meaningful non-equivalent mutants, at least in some boilerplate fashion although it’s much easier to generate equivalent mutants.

Obviously this was a bit of a whirlwind tour into the idea of mutation testing. I hope to revisit this topic in some more depth later on.


This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.