Cautionary Tales on Metrics

Some testers — and most managers — like to talk about metrics. One thing that often doesn’t get discussed is what I call metric dissociation. Here’s my thought: metrics should be easily or directly interpretable in terms of product or process features. If they’re not, I think they are dissociated from the actual work that’s being done and thus there’s great danger of the metrics being misleading, at best, or outright inaccurate, at worst.

An Enjoyment Metric

Let’s consider one example that may seem to have little relevance. This one is from a guy named Mike Rozak, who wanted to come up with, in his words, “a fundamental ‘equation’ that can represent many (or most?) of the design problems found when creating a virtual world.” He was bringing this up in the context of creating games. The equation initially given is:

E(c) = f(various attributes)

This is basically saying that the player’s enjoyment of a given game choice, E(c), is a direct result from that given choice, c, and the subsequent story/experience of the game based on that choice — i.e., the path through a game that is allowed by the choice being made. So what this metric says is that this part of the enjoyment is a function of many attributes, which I haven’t listed here since you can presumably imagine many. Rozak lists the author’s skill, the amount of “eye candy” (flash and dazzle, such as graphics) used for the game, and things along those lines.

It’s a truism that every single action that a player may want to try is probably not something the game can handle. Somewhere you’re going to run into the limits of the game, in terms of what it allows the player to do. This is another factor in the metric being built up, according to Rozak, and it’s modeled as such:

P(y|d) = f(various attributes)

This metric is meant to indicate a probability. Here it’s the probability of the player encountering a game limitation, y, given a decision the player made in the game, d. The metric says that this is a function of how much overall game content there is and how that content is modeled. So, for example, if you’re playing an airplane simulation, this might refer to the amount of detail simulated, such as actual city and landscape details or the controls that you can manipulate as part of the plane. Or if this is a quest game, it might have something to do with the number of quests there are or the details regarding how the quests are played or the various options you have for solving quests or even disregarding them entirely.

Where the above probability becomes relevant is that when players do end up trying an action that the game doesn’t allow, the assumption is that the player gets annoyed. That annoyance factor translates into some loss of enjoyment. This putative “annoyance factor” produces another term for the growing metric:

A(c’) = f(various attributes)

In other words, the annoyance factor, A(), given a choice that, from the player’s point of view, didn’t work, c’, is a function of the player’s expectation that the game should have handled the choice. Another aspect of this might be that either the game didn’t make it clear ahead of time that the choice wouldn’t work or, alternatively, that the game didn’t really make it clear after the fact why the choice didn’t work.

What all this leads to is the following Enjoyment Metric:

So here we have the metric E, which is the player’s overall enjoyment of the game. It’s said to be the sum over all decisions, d, that the player makes of the enjoyment from the decision choice that’s finally accepted, E(c), minus the probability of the player’s choice being invalid for whatever reason, multiplied by the annoyance that results from that choice being invalid. There are two scaling factors in there. I(d) is said to be a function that indicates how important the player views a given decision they made. P(d) is said to be the probability that the player will actually encounter the chance to make that kind of decision.

Wow, huh?

Rozak ends up giving a slightly revamped equation when all is said and done but this will do for my purposes. Rozak’s article on this subject says that “the obvious use for the equation is to maximise E for all (paying) players, limited by the funds and technology that the [game] title has available.” Now, Rozak is doing this on his own so I’m not trying to downplay the work he did here nor am I making fun of it. However, you can almost imagine a big game company utilizing some metrics like these.

This is an example of a cautionary metric. It’s one where you’re probably going to attempt to derive some actionable points based on the “enjoyment” that those actions will add to the game being developed. Incidentally, if you want to see the full article, check out Interactive Fiction Equation. You might also check out Rozak’s follow-up to this at Interactive Fiction Equation, Part 2. At the end of that latter article he says “It’s nice to see that the interactive-fiction equation agrees with conventional wisdom!” I think that’s a telling statement in that I’ve heard that same line used in terms of software metrics: i.e., that a given metric “agrees with” the so-called “conventional wisdom.”

A Complexity Metric

In the paper “Predicting Software Development Errors Using Complexity Metrics”, you’ll see something the authors call a “factor dimension metric,” which they give the name control. Sure sounds interesting, doesn’t it? This metric is calculated by a weighted sum that’s given by the authors as — brace yourself:

Control = a₁HNK + a₂PRC + a₃E + a₄VG + a₅MMC + a₆Error + a₇HNP + a₈LOC

Whew — that sure looks impressive! But what does it all mean? Well, let’s see here: the a_i‘s are derived from an area of study known as factor analysis. The details of that really don’t matter for now. HNK is known as Henry and Kafura’s “information flow complexity metric,” PRC is a count of the number of procedures (read: functions, methods, whatever), E is shorthand for an “effort metric” proposed by a guy named Maurice Halstead, VG is shorthand for a “complexity metric” proposed by Thomas McCabe, MMC is shorthand for another “complexity metric” proposed by Warren Harrison, and LOC refers to Lines Of Code.

Now, although this equation might help to avoid metrics that are too broad (often referred to as “multicolinearity”), it’s very difficult for me to see how you might actually advise a designer on how to re-design an architecture to achieve a “better” control metric value. It’s also somewhat hard for me to see how you might advise a programmer on how to more optimally code or design given modules of code within an application in order to achieve a “better” control metric value. While I’m at it, I’d say it’s downright impossible for me to see how you might advise a test engineer to design test strategies that will taken into account this control metric in terms of determining what areas to concentrate on, what areas might best be suited for a minimal test suite, what areas offer the best hope of catching bugs, etc.

Bottom-line: the effects of a change to the design, a given module, and to tests related to all of that is, to say the least, less than crystal clear.

There are a few take-aways here.

Slavish devotion to metrics just because they seem well thought out is a bad way to go.
Slavish devotion to metrics just because they include a lot of other metrics that are “well-known” (like the complexity metrics of McCabe and Harrison) is a bad way to go.
Slavish devotion to metrics just because they purport to take into account size and complexity is a bad way to go.

But why do I say all that? Why do I say that these routes are a bad way to go? Because such slavish devotion does not take into account what I think is a crucial aspect: is the metric actually telling you anything? Does the metric provide a way for you to gauge some relevant aspect of your product or your process? Does the metric provide indicators that will help you know if you are of course in terms of your project cycle?

If a metric obfuscates the answers to questions like these or renders them largely unanswerable, then I would say you have metric dissociation.

An Optimal Size Metric

All of what I describe above can happen with metrics that rely on averaged data as well. Analysis of averages means you’re considering data that’s one step removed from the original data. Using averages reduces the amount of information available to test a given conjecture and any conclusions will be correspondingly weaker. Let’s consider this by way of a specific example.

That specific example will be J.R. Gaffney’s “Estimating the Number of Faults in Code”. Consider, as Gaffney did, the ratio of the number of defects (D) in a program to the lines of executable code (L) that make up that program. Gaffney showed that the relationship between D and L is not programming language dependent. He came up with a “defect prediction” based on this relationship:

D = 4.2 + 0.0015(L)^4/3

An interesting ramification of this was that it seemed to indicate there was an optimal size for individual modules with respect to what’s called “defect density.” For this particular equation, this optimum module size was given as 877 LOC, where again “LOC” refers to Lines of Code.

What this actually shows us is the misuse of averages because in Gaffney’s paper the rule for optimal module size was derived on the assumption that to calculate the total number of defects in a system we could use the same model as had been derived using defect counts for an individual module. The model derived at the module level is shown by the above equation and can be extended to count the total defects in a system, D_T, based on L_i, and thus the total number of modules in the system is denoted by N:

Gaffney assumes that the average module size can be used to calculate the total defect count and also the optimum module size for any system, using the following:

You can probably see that the first and second equations of Gaffney are not, in fact, equivalent. The use of the second equation mistakenly assumes the power of a sum is equal to a sum of powers. What this means is that this metric is very potentially misleading at best and unworkable at worst.

Now, clearly, this was an example that was more involved in terms of a problem with the mathematics but that’s part of the point: it shows that the underlying math you use to “count up” (derive) your metrics can be faulty and thus lead to problems at the higher-level where the metric is supposed to telling you something about your project.

A Stable Metric

A “Quality Time” article from IEEE Software gave an interesting example of how some metrics can be seemingly persuasive and yet really say nothing at all. The idea of the article was to come up with a “stable metric” for rating buildings in terms of suitability for having an office location there. This could refer to even vague things like the prestige of the location or the quality of the landscaping.

So the idea was to list the desirable features of office buildings and then quantify each feature. For example, one idea is that a good building should help people work in teams and so you could focus on that characteristic. This means you can say that the effect of a building on the efficacy of cooperative work, E, should increase inversely with the mean distance, D, between the offices of members of a common project. The equation given was this:

Here G is the set of all employees, |G| is the size of that set, and δ(i,j) is the distance in feet from the door of employee i to that of employee j. But D is not the only factor on which E depends. Office density, d, is another important inverse term. This can be calculated as such:

d = |G| / v

Here v is the total number of offices.

The value of E must also be directly proportional to the “aesthetics” of the building. Beauty would seem to be hard to quantify. However, it was said that you can reason as follows: the product of ceiling height, H, and average office square-footage, F, are crucial elements that seem to correlate well with the “attractiveness” of a building. The problem is that E will grow too rapidly if it depends directly on A = HF. Thus it’s possible to define E as such:

E = logA / dD

This is referred to in the article as an “efficacy metric.” The article goes on to say that A is closely related to c, the cost per square foot. So you can write this:

E’ = log c / dD

That gives a new metric, namely E’ = E. (Here E’ is approximately equal to E.) This new metric requires many fewer measurements. But does this give us a quantifiable and scientific basis for choosing a building?

Well, you’ll probably note that one problem with E is the vague assumption that the few properties it measured somehow represented all other relevant properties. This metric, so far as I can see, would give you little way to effectively look for independent characteristics. For example, a building with air conditioning would seem to be important. However, E might just assume that any building that scores high in terms of the metric will have air-conditioning. (After all, why wouldn’t it, right?)

Lest that seems a little disconnected from anything real, think of some of the complexity metrics out there that are said to speak to overall comprehensibility of a program. These measures of complexity generally don’t take into account the talent, education, or experience of the person who is actually trying to comprehend the program. For example, some metrics state the complexity of a program as a linear function of the number of “basic paths” in the program flow. This does not, however, account for symmetries that some programmers will recognize and perhaps exploit, either for further coding or for testing. The stable metric here for buildings mirrors that exact same metric dissociation.

Now, I should note that this metric from the Quality Time article was purely fictional. Yet it almost seems like it could be real and that is the danger with metrics that are meant to be real but are, in fact, just as meaningless as this building metric.

The point of all of my examples above is that the “metric” can be irrelevant because the measures are weak. If the measure does not model reality usefully, it’s weak. Some people in the physical or theoretical sciences explicitly promote their work as having “a mathematical basis” and that basis is supposed to give us confidence. My reaction to that statement has always been that I would minimally expect that any scientific claim will use correct mathematics. But another minimum standard should require a useful mathematics. The same applies to metrics. The correct derivation of a weak and irrelevant measure is hardly more interesting than a non-mathematical presentation of the measure.

So you might want to consider metrics you’ve seen used or that you use yourself and think about questions that should give you pause.

Bug counts – Does this tell you anything about the tester or the testing being done? Does it tell us anything about the developer?
Test case counts – Does this tell you anything about the quality of the tests? Are the tests effective and efficient at what they do?
Traceability – Does this tell you that requirements are actually being tested? Or just that they’ve been “linked” to a test?
“On-time” release metrics – But did we kill ourselves getting there? Are we always forced to be at 110% of capacity to be on time?

I encourage everyone to think about the metrics you see and possibly use. Yes, I used some very specific — and some might even say overly complicated — examples here to make my point. Yet some of those metrics I presented in the list above, which are not as “complicated”, are often used to support the same kind of potentially fallacious reasoning. Just be cautious of metrics. Make them work for you by making them useful and relevant for your needs.

One thought on “Cautionary Tales on Metrics”

Joe Strazzere says:

18 December 2011 at 8:21 am

Nice article, Jeff!

I’ve always felt that metrics in general are abstractions that are only occasionally useful, but far too often abused. Most metrics boil down complex concepts to simple numbers, unfortunately losing most of the flavor in the process.

Even a simple bug count is far to often abused.

I was once asked to post the day’s open bug count, charted over the past two weeks, so that everyone could see the progress we were making. I expressed my concern, but it was the CEO who asked for the chart (he liked charts).

The 3rd week I posted the chart, he came over with a red pencil, extended the line until it reached the zero-count and wrote “and this is when we ship”. Ugh!

I took the chart down and never posted another. I wasn’t fired. And of course we didn’t ship on that date.

Now when I am asked for a bug count or other metric, I try very hard to understand what is really being asked for, and give my best answer. It usually involves words, and few numbers.

Here are some other thoughts on how one particular metric might be abused: http://www.allthingsquality.com/2010/04/misuse-and-abuse-of-bug-counts.html

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …

An Enjoyment Metric

A Complexity Metric

An Optimal Size Metric

A Stable Metric

One thought on “Cautionary Tales on Metrics”

Leave a Reply Cancel reply