The Archaeology of Testing

Way back in the Dark Ages of 2011, I talked about finding hidden bugs. A little bit later I talked about how testing is like studying the past. Here I’ll somewhat draw those two ideas together. So as any good archaeologists do, let’s dig in!

In this article, I’ll provide a few examples in context about bugs hiding in code and where some archaeology, via testing techniques, helps us figure out some issues. Along the way this will bring up some different ideas about testing. This will also reinforce the notion that testers need to have a good idea about code and that developers need to have a good idea about tests.

There will be an interesting interplay here, I think. We’ll be using testing in the present to reason about testing that was done in the past.

I should note that this isn’t an article about using automation (code-based tests) to find bugs. This is about how reasoning about code at the level well before it reaches a user interface can prevent bugs earlier, thus shortening our cost-of-mistake curve. But, as you’ll see, that’s only if we listen to our tests when they are telling us something.

This post is very similar to my post on applying test thinking with code. Here I’m reinforcing those ideas with a few more examples.

When a Language Allows Bugs to Abound

Back in my article on test shapes I brought up what I hope is the increasingly popular test diamond:

The supporting part of the diamond at the bottom is more helpful than it’s often given credit for. Let’s consider that in context by looking at this JavaScript code that was dug up via some code archaeology:

This is a simple inventory function from a game. A player character can hold a certain number of items, called their inventory. There’s an object (inventory) that represents this. There are functions that allow adding items to the inventory (add_to_inventory) and checking if the player has reached their limit (player_inventory).

Take a moment to look at the code. See if perhaps you spot any problems. Hint: even if you don’t know coding all that well, the two errors will likely stand out to you.

Excavating some of the material from the code base turns up the following tests:

And that will work! No errors. Is that surprising? Well, it should be because, like I said, there are two relatively obvious errors in the code. Did you spot them? Here’s what I saw:

  • iten.weight
  • inventory.items[item].weigth

Clearly here “iten” should be “item” and “weigth” should be “weight”. So being suspicious, it was time to apply some more tools on our dig here and what turned up was an interesting fact: that code is almost entirely broken. It’s not just those two errors, as it turns out. There are many more.

Yet, upon running the code you would get no feedback that such was the case. If this code were running in a browser, you would see no errors in the developer console. If you ran this via Node, you would similarly see no errors reported at all. Okay, but wouldn’t automated tests have helped? Well, the above clearly didn’t help.

So here’s what we have. We have thoroughly broken code that doesn’t show that it’s broken. We have tests that accurately exercise the code but that also don’t show that the code is broken.

How do I know the code is broken? Well, let’s turn on a type checker. After excavating the above code and tossing it in a tool like Visual Studio Code, you can put the following at the top:

And upon using that little aid in my archaeology of this code, this is what turned up:

Yikes! Each one of the commented numbers in that visual shows you an error. And note that only some of those are actually being determined by the IDE and the type checker. (You can tell by the presence or absence of the red squiggly line underneath.)

So here’s the point: it’s quite possible to have code that defies automated tests and even built-in type checkers. But it’s also certainly the point that having type checking is valuable. Types, especially static, can be thought of as an early warning test tool. This tool produces a partial proof that the program is correct. Or, at the very least, it provides confidence that the program is not broken in some obvious, detectable way.

Still: look how many bugs were hiding! That should be a huge concern.

When a Language Structure Precludes a Test

Let’s take another example, this time in Python. Archaeology turns up the following code:

What this function does is generate the serial code that is provided as part of certain binary files. The idea of the above function was that if certain conditions were met that serial code would be shown as “XXXXXX”, which meant the serial code itself couldn’t be read. But it turned out that with some files you could get a UnicodeDecodeError when trying to count the bytes. Hence the above try ... except logic.

Archaeology of this code base showed me something interesting. Some tests were routinely commented out, then commented back in, then commented back out again. This told me that people were having trouble with the test of this method. In looking at the test that was particularly problematic, this is what I dug up:

So this was testing for the error condition when the exception is raised. However, after I uncovered this test and ran it, I found that it would fail with the following:

zork1_unicode_error_serial = 

    def test_zork1_serial_number_error(zork1_unicode_error_serial):
        "handles a serial code with a unicode error"
        with pytest.raises(UnicodeDecodeError) as execinfo:
            serial_code = zork1_unicode_error_serial.serial_code
>           assert serial_code == "XXXXXX"
E           Failed: DID NOT RAISE 

tests/header_test.py:42: Failed
---- Captured stdout setup --------------------------------------------------
/Users/jnyman/AppDev/quendor/tests/../zcode/zork1-r15-sXXXXXX.z2
---- Captured stdout call ---------------------------------------------------
222222222222

Notice how the “222222222222” is captured in the standard output? The archaeology of the commit history thus let me see that the print statements with the numbers were just in place to validate which logic got hit. But those statements were apparently left in accidentally. But, to be sure, that was telling me that the appropriate path was being hit and thus the exception was also clearly being generated.

Yet notice the automation tool (Pytest, in this case) was saying that this exception was not raised. That’s what the above test output is showing. At this point, I tested the above code manually as well to make sure the exception was in fact being generated. It was.

So my archaeology showed me logic that I knew was working but a test of that logic that was failing, and hence why it was likely commented out.

I mentioned there was churn around the test in the commit history. I found that people in the past had attempted to “mark” the test as such:

And, with that in place, the test now passed!

However, my skeptical nature kicked in so I tried a few things and it turned out that the test now passed regardless of what exception was indicated in the “raises” portion. For example, I could do this: @pytest.mark.xfail(raises=IndexError). That also passed the test even though an IndexError most definitely was never raised.

Notice what’s happening here? I’m starting to fight with the test a bit, which is what I assume happened in the past. I had some logic that I knew was working, beyond doubt. I was trying to make sure I could have a test in place that would tell me if that logic ever stopped working. But I ended up with two situations:

  1. The test indicated failure when it was in fact not failing.
  2. The test indicated success even when the test itself was wrong.

The latter was clearly not tenable so I wanted to investigate the former.

The problem here was interesting. In the actual code, here’s the key point: the exception is being caught. That means that the exception is actually not raised. At least as viewed from outside of the property function itself. And the test is most certainly executing from outside the property function itself.

And that actually lead to a discussion worth having about the code. I learned that it’s often argued, for example, that good design in this context means a property getter should never thrown an exception.

So maybe this test, in trying to write it, put pressure on the design such that the problem being seen with the testing was actually pointing to a code smell. And, as you can see, clearly this could lead to a bug hiding in the code that a test is simply not going to find. And the test won’t find it because, most likely, I simply wouldn’t have the test at all, since both of the above options were leading to a test I couldn’t trust.

That right there is an example of how testing at the code level can be problematic for developers. And that’s something that testers need to understand. In this case, history showed me a commented out test but I had to recover the history of why that test was commented out. And that showed me an area where bugs could easily hide. That, in turn, suggested a better design that would have been more testable, giving us a better chance of exposing bugs.

When a Language Approach Diminishes Testability

Let’s consider one more example, this one a bit more involved. This will involve Ruby and the use of some very dynamic logic.

In digging through some interesting logic, archaeology of a code base led me to some code that was clearly in place to support some dereference operations. The code in question had to utilize the ability to chain dynamic method calls. So what archaeology turned up was an open class addition defined in one of the source files:

For context, Ruby has the concept of “open classes” where you can essentially add to an existing Ruby class (in this case Object) and your code will act as if it’s part of the standard class.

In digging in, there was some interesting churn in the commit history around this particular method. That certainly tells you something: it means this is something that people were struggling with in the past. That same churn seemed to be reflected in the tests.

Further archaeology showed that this was the only part of the application that wasn’t showing up as covered by unit tests, even though the area that calls the above method was covered. The coverage report specifically said that lines 1 and 2 of the above logic were covered but starting from line 3 down to 9, no coverage at all.

So I dug into the soil a bit to find some code that actually used the above logic and this turned up:

You can see that the using() method calls use_data_with() and it’s that latter method that relies on the call_method_chain() that was defined on Object.

Don’t worry if the logic is a bit confusing to you. In fact, let that confusion be a concern. But a good test should help us not only test this functionality to make sure it doesn’t break, but also help clear up how the logic being tested works, right?

So I dug a little deeper into the soil, and found a unit test that exercises the using() method:

So clearly this was using the chain method logic such that this line would be executed:

And that would produce something like this:

element.set 'works'

In terms of what the test is doing, the logic is clearly calling the using() method. And, to be sure, that test passes. In fact, when looking at the code coverage, the coverage was 100% for that DataSetter module. Yet the coverage report also showed that the call_method_chain() method — again, called by the logic it said was being tested — was never being executed.

So, again, keep in mind what was happening here and let’s keep it simple.

  1. There was a method called use_data_with.
  2. There was a method called call_method_chain.
  3. The use_data_with method was calling the call_method_chain method.
  4. The test was exercising the use_data_with method.
  5. The coverage report indicated full coverage of use_data_with.
  6. The coverage report indicated zero coverage of call_method_chain.

I now started to understand the churn I was seeing around the testing of this logic. It looked to me like the first instinct of the developers here was to try to get 100% code coverage. And that immediately got me thinking that this is one of those cases where strict adherence to unit testing coverage was perhaps causing more trouble than it was worth. But, like any developer/tester caught in this little trap, I still found the need to confront the trap.

First, how did I know that some notion of 100% code coverage was being sought. Well, looking around, I found that a very simple test had been created:

That change showed coverage up to line 6 of the call_method_chain method. Well, that’s better, I guess. (Actually, do you notice something odd about that “test”?) While better, that still left line 8, which wasn’t covered. It looks like eventually this was tried:

The idea was clearly to hit that last line to get the coverage. This test didn’t provide that, however. (Still something odd about this “test”, isn’t there?)

So what archaeology was showing me was that the developers could actually get the test to the point where only one line was not being covered, assuming they just use those little isolated tests. Gazing at this now, I felt that this difficulty in writing test test was was an early warning sign. Much like that type checker from my first example, our test was trying to telling us something.

So I played around a little bit with the logic. Specifically I started playing around with that last test that was getting most of the coverage. And it was while doing so that I think I actually exposed a bug. This part gets really tricky to see without having had the original context. So let’s just look at what the above test is doing.

  1. The test is calling call_method_chain on the string “testing”.
  2. The method chain is told to reverse and capitalize the text it was called on.
  3. Which means the result should be “GNITSET”

What people clearly thought was happening was this:

  • 'testing'.reverse()
  • 'testing'.capitialize()

Yet what was actually happening here was this:

  • 'testing'.reverse('GNITSET')
  • 'testing'.capitalize('GNITSET')

When the code would call either of those methods as such, there should be an ArgumentError: wrong number of arguments (1 for 0) generated. Why? Because reverse() and capitalize() are Ruby methods and those methods don’t take arguments! Yet no such error was being generated. Not only was an error not being generated from the test, I found it wouldn’t be generated when run manually either.

I think people were on the track of this because I found this test as well:

Notice how this test is only testing the call_method_chain by chaining chaining one method (starts_with?). Is this perhaps what they were trying to figure out? Were they assuming that the problem had to do with chaining multiple method calls (like reverse.capitalize as opposed to just one?

What’s interesting is that this latter test will work but it has nothing to do with the number of methods chained. It’s because the starts_with? method (which is also a Ruby method) does take an argument. What’s happening there is this:

  • 'testing'.starts_with?('t')

The key point to notice here is how an error was getting swallowed up by the dynamic logic. And it was getting swallowed up because of the complexity of what was being tested and thus of what was developed. That is a bug just waiting to happen in the real world! So the tests were giving a very good warning here. Specifically, that the method being testing could be called in such a way that very legitimate errors would be swallowed up, causing extremely hard to diagnose bugs.

So that was the good part of the tests: they were showing a problem. But let’s go back to what was wrong with some of those tests. Did you see what it was? Let’s break it down. First, we had something like this:

That got a lot more test coverage. But that’s all it got us. More coverage. It didn’t actually test anything. The test wasn’t determining if the method was returning correct information. This shows why coverage should not be a sole measure. What was really needed was something like this:

Well, obviously, right? But that’s just it. Sometimes it’s not so obvious when you’re on a quest for some holy grail of code coverage. And here, as you saw, even though our archaeology turned up what was clearly attempts to get better coverage, that same archaeology turned up the fact that the tests were, in some cases, not actually testing anything and also were exposing that the code itself could essentially hide some very bad bugs.

Finally it’s worth noting that the coverage report that people were clearly putting so much stock in was giving false security anyway. For example, consider again that line I referenced:

This would be showing as covered because the if condition is being invoked since it’s on the same line as a suffix qualifier (which Ruby allows). But there weren’t tests that actually invoked the method. So what this was showing was that to properly unit test, it was necessary to test call_method_chain in isolation.

Clearly the developers in the past had that same thought. But it’s not clear, from the archaeological evidence, if people realized what the test was telling them. Did they determine that the code could hide bad bugs and just agree to accept that risk? Or did they not know that at all? Archaeology cannot tell us everything about human motivation or decision; but it can help us determine where crucial decision points could have happened.

Archaeology Is Helpful!

Speaking broadly and generally, archaeology is the study of human activity. We study that activity through the recovery and analysis of something tangible. So here we study the activity of a developer and we did so by recovering the code that was written and used some analysis (specifically, tests) to see what we could figure out.

And what we found, in all cases, was that it was very possible for bugs to hide in our code. Some of those bugs, like type problems, were relatively easy to expose. We didn’t have to dig too deep “into the dirt” to find the issues. Other bugs, however, were buried under many “layers of soil” and it took a bit more effort to ferret them out.

What we uncovered was that the tools to ferret out the bugs were actually misleading people in a few cases. But, in all cases, we also found that had those tests been listened to when they were being written, the very difficulty in using the tool would have told us something valuable: what was being tested was in fact not very good code. In other words, testability was problematic.

Along with my testability series, I hope this post gives testers in particular a good idea of how code gets to be the way it is. And even if we have tests, it’s not enough. Tests have to put pressure on design. And even that is not enough because we still have to recognize when the tests are indicating that testability has been compromised.

And when testability is compromised, bugs can very easily hide.

What this shows is that, in line with my “testing is like…” thoughts, testing is (or can be!) very much like archaeology.

Share

This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.