In a previous post on test dogma and tradition, I talked about the famous “test pyramid” as an example of what people cling to as means of explanation. My concern there was that people often run too far with this or draw the wrong conclusions from it. Let’s look at a particular example of that.
The book Gradle in Action is one of the many books out there that have chapters on testing and that use some variant of the testing pyramid. Let’s take a look at what the book gives us:
The author then says:
The number of tests you write should be driven by the time and effort it takes to implement and maintain them. The easier a test is to write and the quicker it is to execute, the higher the return on investment (ROI). To optimize your ROI, your code base should contain many unit tests, fewer integration tests, and still fewer functional tests. This distribution of tests and their correlation to ROI is best illustrated by the test automation pyramid.
So, by this visual, the ROI — return on investment — goes up the lower down the pyramid you go. The implication, of course, being that the higher you go, the lower the ROI is. Yet keep in mind that the higher you go, the more realistic the testing actually is. The more realistic the testing, the more closely it aligns with how users get value from the software being tested.
In short, the book presents, in my opinion, a misleading simplification of the idea of testing ROI and uses the dogma of the test pyramid to do it.
I will say here that this is an excellent book about learning Gradle. It’s important to keep that in mind. I’m cherry-picking one representative example of the problem I see but this should in no way be seen as an indictment of the book. Rather, it’s a perfect example of how excellent resources often take in some dogma and tradition about testing concepts.
Features and Components
The problem is that the relationship between overall behavior, at the feature level, and individual components is important and can be difficult to manage. Consider:
There’s not always a one-to-one relationship between a given set of components and a given set of features. A single feature can invoke and/or utilize several components. Likewise, a single component can be reused across several features.
So we can look at testing in a component-first view or a feature-first view. If we go component-first, we get this:
The idea here is that each component should have a test of its own. Such an idea is predicated upon the system being built incrementally, one component at a time. With each increment, a new component is created or an existing one is modified in order to support the features. A key problem here is that since we are not using the features to guide our tests, they can only express the expected behavior of the components.
So let’s consider feature-first:
This is conceptually similar to the component-first approach, with the idea that if a test is failing, then it means that the corresponding feature is broken. Now, to be sure, some would say that’s not the case with the component-first approach. And, in a way, that’s true. It really gets into the nature of the distinction between integration and integrated tests.
A key element to focus on is whether you can answer this question:
- If a component test is failing, which features are failing?
But another key question is this:
- If a test is failing, and we know which feature is failing, do we know which components need to be fixed?
I was originally planning on stopping the post here. I was a bit worried I would defocus if I went too much further. And that’s because I want to talk about reframing our development and testing tasks in light of the necessity to have a good ROI on testing but on the need to test higher up the stack if we truly want to see what users are getting value from. This is the only way to show what “put pressure on design means.”
So I will continue, but if you feel a sufficient enough point has been made, this would be a great time to stop reading.
Test Expression and Abstraction
One of the lacks of ROI that is being intimated in the book is the idea that tests higher up the pyramid take longer to run and are generally more fragile. This can certainly be true. However, that does not automatically translate into lower ROI when you consider that you have to look at the business logic domain and the presentation layer which is what your entire user base is going to be dealing with.
Let’s talk about the possible fragility of these kinds of tests.
There comes a time where people end up writing a lot of test cases, because they need or want to combine the scenarios from the business logic domain with the ones from the UI domain. When I say “UI domain” here I mean “user interface” in the general sense. A client app, mobile app, web app, or API service all provide a particular user interface.
This seems to get into the distinction between “functional” and “behavioral” that some testers like to go on and on about. Let’s frame this as a question to see how the argument often goes.
- If I want to test whether the UI works, why should I test the business rules?
The rationale here is often that the UI elements are a purely functional concern. Yet, there is a slight complication when you realize that testing the business rules usually means using the UI.
This matters because of how tests are expressed. Let’s say we use the “UI language” in our tests By this I mean we express our tests in terms of “API endpoints” or “web pages.” If we use this UI language, maybe it will be too low-level to easily describe business concepts. On the other hand, if we use the business domain language, maybe we will not be able to test the important details of the UI because those are too low-level.
This trap — one of the BDD traps — is what you often see in feature files or specifications in tools like Cucumber wherein you end up with tests that mix UI language with business terminology. What often happens is those tests will neither be focused nor will they be very clear to anyone.
I talked in a previous post about the abstraction levels for tests so I won’t revisit all that here.
As an example…
Consider a form that asks you to enter a valid e-mail. Suppose the developers are writing the validation logic of that form and any errors are returned as an array of error messages. Behind the scenes, that array is iterated over and those error messages are sent out to any view mechanism that wants to consume them. For an API, this may be a JSON object with an appropriate block to indicate errors. For a web interface, the error(s) may be displayed in a friendly way to the user, such as in a flash message in a Rails application.
That’s two levels there, right? There’s the code-based level with the array and the user interface level.
Now, going from the test pyramid viewpoint shown earlier, my bigger ROI comes in from testing that lower level. So my first test — code-based — might be the success case. You pass a valid e-mail and expect the validation function to return an empty array, which means no error messages. This is simple because it establishes an example of what is valid input, and the input and expectations are simple enough.
This is, in fact, where a TDD approach comes in. Once you have a failing test, then — and only then — are you supposed to write some production code to fix it. The point of all of this is that you should not write new code if there is not a good reason to do so. In test-first, we use failing tests as a guide to know whether there is need for new code or not. The rule is easy:
- You should only write code to fix a failing test or to write a new failing test.
But consider that this can also be done at the requirements stage. We can frame this as you only write a requirement for something that seems to matter. To make all this (hopefully) clear consider that you probably wouldn’t write a requirement saying: “User receives no error messages upon using a valid email.” Please take a moment and think about that phrase and then think about the test I just mentioned at the code level for the empty array. Are you likely to word it that way? Probably not. You would more likely say “User is taken to the landing page when they use a valid email to login.”
Do you see what happened there? You have a UI level concern and you have a business level concern.
But now let’s consider invalid conditions. There are various conditions here:
- A user who supplies no email at all should receive a specific type of message.
- A user who supplies a malformed email will receive a certain type of message.
- A user who supplies a valid email but that email is not recognized will receive a certain type of message.
And so on. From a code perspective, what the messages actually are and how they prompt a user to action is often completely secondary. All that matters is that the messages are populated in the correct data structure.
To go just a little further with this example, consider that a user who supplies an email with multiple issues (like “@notme#domain.com”) should receive two errors: “missing username” and “incorrect domain”. Or perhaps people would argue it should just generate one message: “Invalid email” without worrying about all the particulars. The point is that this can come up during requirements. But at the code level — supposedly where my higher ROI is — this is all just messages getting placed into an array and that array being made available to any UI that wants to consume it.
Shifting From Components to Features …
This is where TDD shades into BDD in terms of the source of truth. The “problem” with TDD is that it doesn’t really say anything about what a coding task is, per se. TDD doesn’t necessarily tell you when a new coding task should be created or what kind of changes should be allowed in the context of a coding task.
The reason for the quotes around “problem” is because this actually isn’t a problem with TDD. It’s more a recognition that TDD does not explicitly say how to connect both worlds of the interface and the business.
So this leads to a lot of teams “doing TDD,” but testing the wrong things or, at the very least, putting more emphasis on the less important things. Yes, perhaps they were able to test all their classes, but they tested whether the classes behave as expected, not whether the product behaves as expected. And our users will be using the product. So TDD is taking that component-first viewpoint.
(But that’s okay. They have a pyramid that tells them they are getting more ROI, right?)
BDD tries to fix these problems by making the tests directly dependent of the feature set of the product rather than on the code implementation of it. Basically, BDD is meant to be another kind of test-first approach where a new coding task can be created only when a change in the product happens: a new requirement, a change in an existing one, or a new bug.
This means the rule of test-first goes from “Don’t write any new tests if there is not a new coding task” to “Don’t write any new tests if there is not a change in the product.” This means the focus of testing shifts from components to features.
… Puts Pressure on Design
And what this has all done is put pressure on design at different points. Putting pressure on code design is good from a code-based perspective and is a key driver of internal quality. But you can have incredibly terrible code that can be tested quite well and that provides value to the user.
In fact, I can show you an example of this from a class I’m teaching. Check out this text adventure parser class. It’s terrible code. But check out the tests for the parser. Those tests show that this parser actually goes a great job at parsing commands. That terrible code is a little better when you look at the refactored test parser. The tests serve both implementations of the parser.
However, none of this is connected up with an actual working game engine yet. More specifically, there is no UI — whether that UI be command line driven, desktop client, web-based, a mobile app, and so forth. So the ROI on all these tests is relatively low. They put little pressure on my design — which is why I have such crappy code — and they don’t tell me if the user can actually use the parser or if the parser correctly connects up to a game world.
A contrived example, to be sure, but one that displays the wider problems we tend to deal with in more complex applications.
Final Points
As I said, I was worried I might defocus a bit. What I’m hoping you can see I did here is go from the oft-used visual to framing operational questions that can guide our decisions. Specifically:
- If a component test is failing, which features are failing?
- If a test is failing, and we know which feature is failing, do we know which components need to be fixed?
From putting pressure on design at the code/component level (and getting one type of ROI) we come to this decision:
- “Don’t write any new tests if there is not a new coding task.”
From putting pressure on design at the feature level (and getting another type of ROI) we come to this decision:
- “Don’t write any new tests if there is not a change in the product.”
Framing questions like this, and having decision paths flow from their answers, takes a bit more work than just drawing a visual but, ultimately, I think it’s more of a substantive exercise.
Wow, Jeff, I am really amazed by your ability to make complex things look simple and clear. I was an admirer of your blog before, after this last piece I am now a fanboy 🙂
Please keep on giving.
On ROI, I think ROI is the wrong metric for test automation, let me try to explain.
I believe that we automate tests because we want to be able to release often (that’s it for me).
So how do you calculate the ROI of releasing every 30 minutes versus every 2 days versus every month? I don’t know, do you?
My point is that we automate not to save money (ROI) but to have the ability to react quickly to change, how do you quantify that in $$$?
Thanks again for a splendid piece
I actually agree with you that ROI is the wrong metric for this, which is probably not a point I made well in the article. The only metrics I ultimately care about are “feature coverage” and “feature readiness.”
Now, that being said, return on investment doesn’t technically have to correlate with monetary concerns only. For example, one way to look at return on investment from a test expression point of view is whether a focus on higher-level abstractions, such as BDD styles, have a lower return because they lead to more churn or longer delays in writing tests because we have to “wire up” the English to the code. So am I getting a good return on investment, in terms of time, by introducing more abstraction layers? I could also ask if I’m getting a good return on investment in terms of other factors, such as discoverability.
But the reason I agree with you on not using the term is because, let’s face it, when the term “ROI” is used it tends to mean in terms of profitability. I suppose if I had a business degree (I don’t) I would argue that all of this relates to time spent and time is money and therefore all of life reduces to a statement of ROI. My career has not yet hardened by heart enough to take such a bleak view, however. 🙂
ROI is a technical term and I think it’d very focused on profit of some sort. It’s simple, but it’s a very hard standard for a Test/Qa department to meet, for well anything.
If you want to use a busines metric, I think the correct one is the Cost-Benefit ratio:
https://en.wikipedia.org/wiki/Benefit%E2%80%93cost_ratio
It’s also worth bearing in ind that a lot of these have very subjective qualities. What counts as investment or profit? What value do you assign to reputational damage etc.
I agree. One thing I clearly didn’t point out well in the article is that I was not arguing for ROI as much. That was being used because the triangle visual from the book used that concept, with the idea being that the ROI was less the higher you go. Meaning, by the books argument, the investment in time, both to create and execute tests at the higher level, does not have a corollary return on that investment in terms of better quality. If — and I do say “if” — I had to argue in terms of ROI, I would say it’s actually the reverse.
Your point about reputation damage is a good one. How do you measure that? And assuming you could, how do you measure it in relation to the type of investment you put into certain kinds of tests? You can do this, of course, to a certain extent. How? By correlating bugs that were missed, which led to hits to the reputation. Then you could look at tests you had written that should have caught those bugs. You might find: “Oh, we didn’t write those tests. Those are UI tests and we didn’t think those had a high enough return on our investment of time to write them. In this case, the return was the feedback that would tell us we had problems.”
Maybe you could probably better express that as a cost/benefit ratio. The cost to write (and execute) the tests versus the benefit you get from the feedback. However, both cost-benefit and ROI are initially predicate upon monetary concerns. The article you reference even says “attempts to summarize the overall value for money of a project or proposal.”
So I’m hoping people aren’t getting too hung up on the use of ROI, which is somewhat epiphenomenal, and rather focusing on what I believe is the problem: that many sources out there indicate that the “ROI” (or cost-benefit) goes down the further up the pyramid you go but, in fact, it’s often the opposite when you consider that the higher up the pyramid you go, the more closely you are dealing with the actual business value of your application; the more closely you are dealing with what your customers see and use; and thus what will be more likely to cause you reputation hits if you get it really wrong.
Very nice explanation on the subject matter Jeff. Looking forward to read your next post. All the best:-)