Good Tests vs Bad Tests with Description Languages

I’ve talked about “being lucid” and using description languages before. I have a whole category here devoted to TDL (Test Description Language) and I’ve worked to present examples that are not your standard “shopping cart”. In this post, I’ll cover an example of how I helped testers go from a “bad test” to a “good test.”

I originally introduced description languages to a test team who had little experience with such concepts in particular. What I was really introducing of course was simply a structured way to describe tests but, more importantly, a structured way to think about tests. If you can’t do the latter, the former really doesn’t matter.

First Iteration

So let’s just jump right in with an example here. The structuring element we chose was the Gherkin API which only mandates the use of “Scenario:” for a title and a Given-When-Then set of keywords for providing context, action, and observable. I turned the testers loose on one particular area of functionality and here’s what they came back with:


Scenario: getUserCache
  Given a created and uploaded UserCache input file
  When  getting the user from DCI
  Then  user which was created in input file has itemFound = "true"
  And   deepFetched = "true"
  And   magic = from uploaded input data
  And   regCompanyId = from uploaded input data
  And   regMethodId = from uploaded input data

Take a moment and, solely from your own perspective, decide if you think this is a bad test versus a good test and, if so, what are some of the problems with it?

In the very way I worded the question to you, you might already say “Well, I can’t tell if it’s a good test or a bad test because even if it’s not how I would have worded it, for all I know it works. And if it works and gives an accurate result then it’s a good test.” On the other hand, you may say “It may work in terms of exercising the functionality, but it’s a bad test because ….”

If you fall into that last camp, what follows the “because ….” for you?

What concerned me when I read this test is that I didn’t consider it a test! I can’t tell what it’s doing, I’m not even sure why it’s doing whatever it is that it is doing, and it’s very close to implementation details. But rather than trying to nitpick certain details, just ask yourself this: what is the point of the test? Can you state the business rule?

Second Iteration

Working with the tester, we ended up coming up with this as an initial redefinition of the test:


Scenario: User in HBase, not in DB Cache
  Given a user is uploaded to hbase with the following:
    | magic        | xyz |
    | regCompanyId | 123 |
    | regMethodId  | 456 |
  And that user is removed from the database cache
  When attempting to get that user from database cache
  Then the user output file will contain the following:
    | itemFound    | true |
    | deepFetched  | true |
    | magic        | xyz  |
    | regCompanyId | 123  |
    | regMethodId  | 456  |

Is that better?

Well, let’s first consider the scenario title. This new version certainly tells you a little more about what is being tested. A user is in HBase but not in the database cache. As it turns out, the tester working with the developer originally called the scenario “getUserCache” because that was the method that was called in the application when the test was run.

Notice the addition of the And step to the Given context. That includes a pretty crucial aspect that was missing from the original iteration. To the tester and developer, the “deepFetched = true” meant that the user was not in the cache. But nowhere was that made clear and you had to know that implementation detail to even understand how the test was running.

Notice the removal of “DCI” from the When step. This acronym stood for “Database Cache Interface”, which was one of the components being executed against. That said, its particular name wasn’t relevant. What did matter was that it was a database cache.

We’ve made the input condition (in the Given step) a little more clear and this more clearly relates to the output condition (in the Then step). Speaking of the input and output condition, we’re including a lot of specifics around the data. But why? Do the specific values like 123 and 456 matter?

At this point I asked the testers to back up a bit and consider what’s actually being tested here. What was the point of this test? I asked the testers to frame what they thought they point was as a new scenario title. Here are two that started to emerge:


Scenario: Users removed from cache can be recovered

Scenario: Users removed from cache are recovered more slowly

Notice how we’re talking about two different things there. One is a simple statement of functionality, the other is a statement of that functionality in terms of performance. In further discussion, it was realized that the second scenario above led to this:


Scenario: Users are recovered more quickly when cached

It was time to stop again and ask: is this the right level of abstraction for those tests? Ultimately the execution speed is a result of various technical factors. And while those do matter to a user, the means by which the speed is achieved is irrelevant. Speed — which is one aspect of performance — is a quality that is best determined as an aspect of the executing test and not enshrined within it.

At this point I asked the testers to go back and look at what was really being tested here. Before reading on, stop and see what you think about that.

Third Iteration

It didn’t take too much discussion to realize that what’s really being tested is this:

Scenario: User information can be recovered from an output file

It’s the fact of being able to recover user information from an output file. Regardless of anything else, this was the key output condition. So let’s see what we ended up with when we reframed the spec in these terms:


Scenario: User information can be recovered from an output file
  Given a complete user record has been uploaded
  When the output record is fetched
  Then the output record contains values for
    | magic        |
    | regCompanyId |
    | regMethodId  |

Lots of changes here. First of all, notice all the specific data values are gone. Upon discussion it was decided that for this particular test, the data integrity was at the level of making sure the appropriate data fields were present rather than what values those fields had. If the specific values did matter then by all means they should be included. In that case they would not be incidental to the test but a key operating parameter of it. My point to the testers here was that you should always be questioning when you have specific data values rather than data conditions. It’s not to say you shouldn’t have values. It’s just to say that it should be a consideration as to why.

Speaking of data, why is that “magic” thing there? That’s been following us around since we started this. What’s the relevance? Well, it turns out that there was a flag behind the scenes, deep in the implementation, which — for some reason that no one could remember — was called “magic”. This flag essentially contained an id that other services could reference in order to determine whether the data returned from the user record matched other data from other sources that also had a “magic” id with the same value.

It was actually a correlation identifier to determine if actions on one device could be correlated with actions on another, thereby implying it was likely the same user. It was important but it wasn’t so much part of the output record as it was a part of the output process.

So we had two options: get rid of it since users never would be aware of this or indicate the condition in more business friendly terms, such as “and the correlation id is set”. However, in this case, the correlation id had little to do with the tests themselves. Correlation ids only mattered if you were testing if one action and another action were correlated to the same person. That is not the point of this test and thus “magic” (the correlation id) became an incidental. So: remove it! Just because it’s there doesn’t mean it’s relevant to every test.

That left us with:


Scenario: User information can be recovered from an output file
  Given a complete user record has been uploaded
  When the output record is fetched
  Then the output record contains values for
    | regCompanyId |
    | regMethodId  |

Also “regCompanyId” and “regMethodId” were, once again, internal names. For its part, “regMethodId” actually referred to access method id, as in the id of the device used to access content. So we could do this:


Scenario: User information can be recovered from an output file
  Given a complete user record has been uploaded
  When the output record is fetched
  Then the output record contains values for
    | company ID        |
    | access method ID  |

Not too bad, right? But haven’t we lost something here? It seems like we ended up reframing this test such that we apparently lost something that was key to it in the beginning. Can you spot what that is?

Fourth Iteration

With the reframing of the test as it now stood, where do we cover the cache part? It seems we entirely lost that aspect of the test. Well, we did but, upon discussion, we realized that’s because there were two test conditions being looked at here: serving data from an output file and serving data from the cache. We were looking for the same things in each case but the cache seemed to be a key qualifier. So it sounds like we needed another scenario and the testers came up with this:


Scenario: User information can be recovered from cache
  Given a complete user record has been uploaded and is cached
  When the output record is fetched
  Then the output record contains values for
    | company ID        |
    | access method ID  |

This looks almost entirely like a duplicate, doesn’t it? Well, a key qualifier has been added to the Given step: “… and is cached.” So what we ended up with here are two scenarios that fetch an output record for a uploaded user record. The output record will contain the same data points in both cases. I still have two scenarios that are almost identical though.

Fifth Iteration

If the near identical wording of the scenarios was bothersome to the testers, I suggested we could make an outline of the possibilities:


Scenario Outline: User information can be recovered
  Given a complete user record has been uploaded
  When the output record is from 
  Then the output record contains 
  
  Examples:
    | source      | values                       |
    | output file | company id, access method id |
    | cache       | company id, access method id |

What do you think?

This seems to be the right level of abstraction with the details that matter. This also seems to be a scenario that can evolve with changing implementation and business rules because we’ve removed as much underlying implementation as possible and focused on the business rule that mattered. Notice, however, that we reframed the test. Originally it seemed that caching was the focus. But it wasn’t. The focus was on the output record that contained certain data fields. Those data fields were required to be present regardless of whether the user was or was not cached.

Ah! It seems like even in explaining the test right there, a better scenario title came up:


Scenario Outline: Output records always contain standard data fields
  Given a complete user record has been uploaded
  When the output record is from 
  Then the output record contains 
  
  Examples:
    | source      | values                       |
    | output file | company id, access method id |
    | cache       | company id, access method id |

Notice the title change there? Is that good? What are some other ways this could be worded? One problem is that “standard data fields” doesn’t really match anything we say in the test itself. Someone could make the assumption that “company id” and “access method id” are the standard data fields but it is just an assumption.

As we tackle that issue, let’s back up a bit and look at that Examples table, specifically the ‘values’ column. The testers asked me: can’t that get a little hard to read if there’s too many values? The answer: clearly, yes. So here’s where you might include some non-executable documentation along with a concise way to state a data condition. For example, let’s take a look at what a relatively full Gherkin file might look like with this added documentation:


Feature:

  Standard values: company id, access method id
  Advanced values: device id, referrer id, last access time

  Scenario Outline: Output records always contain standard values
    Given a complete user record has been uploaded
    When the output record is from 
    Then the output record contains 
  
    Examples:
      | source      | values          |
      | output file | standard values |
      | cache       | standard values |

Notice here there is some documentation before the scenario outline. This could have also been a hyperlink to a specific reference, like a Wiki page, that contained the information. Also notice with this that the ‘values’ column of the Examples table now simply says “standard values”. This way if more or less are later deemed to be part of the “standard values”, you could just update the documentation portion of the test.

That said, this does introduce a risk which is that, if you automate the above, you have no idea if the test is actually checking what’s been documented. That said, you could argue the same with the previous example or, really, any example. At least in the previous cases you could at least remove one of the values or change its name and see if it failed. In this latter case, however, you aren’t stating data values but rather a data condition. Going this route means it is imperative that the tests can output what they checked. Here’s an example screen shot of the output I generated for this test:

You can see next to the examples there that the output indicated what the automation understood “standard values” to be in each case.

Its More Than Just Good Test vs Bad Test

Yes, I named this post a certain way and, to be sure, what I hope you saw here is one example of where you can go from a bad test to a good test and this was done via a spec workshop. Yet in this post I only gave you a slight insight into the dialogue that took place because I wanted to focus on the test writing itself. I previously discussed a more through example of a spec workshop, with dialogue.

What I also hope you saw here was that this level of work, while it can seem like a lot for just one test, is actually additive and cumulative in nature. As you start to refine your understanding about a given feature, and frame your scenarios, you start to get better at it. You start to get faster at refining existing tests and expressing new ones. And the reason for this is because you start to learn how to talk about your business domain better.

Value Add for Testers

Finally, this approach really does come down to how people articulate tests — particularly test conditions and data conditions. I don’t know about you but most tests I’ve come across have been fairly lackluster in terms of how they were written. It’s not even just a matter of bad wording as it is wording that is too specific, such that the test must change with implementation even when the business rule hasn’t changed at all.

The emphasis on BDD in the industry has put a focus on tools and elements that those tools consume. I even used one such here: Gherkin. But it’s important to realize that using Gherkin — or any other structuring language — is really just a guide for placing certain elements and reminding you that you have a context, an action, and an observable. No structuring language is going to do the thinking for you, of course.

This test design and spec workshop approach is one of the crucial arenas in which people with a testing focused mindset — and who are able to articulate that mindset — can add value to a project. Much of that value comes in from being able to do all this as early as possible in a development cycle, ideally before too much (if any) coding has been done.

But Not Just Any Testers

It is most certainly not the case that “just anyone” can do this kind of thing, regardless of how it might appear.

There is a demonstrable skill set to doing test design of this sort, only some of which comes down to the eventual writing down of design decisions. This is a skill set that puts emphasis on listening to others articulate ideas, reframing those ideas into example-based scenarios, and doing so while being concise but not necessarily terse.

There also a large component of being able to seek out test conditions and data conditions. A key skill is being able to spot ambiguities, inconsistencies, and contradictions. That may seem easy when you are looking at one scenario like I did in this post. Now imagine doing that for dozens, hundreds, or even thousands of such scenarios.

Not only that but you must be someone who has the patience and temperament to go through this kind of exercise routinely with developers and business analysts, both of whom are often speaking a very different language. This is a particular skill that you will rarely, if ever, see articulated in an opportunity description. That fact alone is what convinces me that many places hiring testers still don’t get a large part of what testing is and how testing, as a disciplined activity, adds value.

I’m gearing up to start my attempts at changing that in the industry. What I need is to figure out the best platform for this message and how best to convey it.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …