Testers and Data Science, Part 4

This post is the last in the data science series (see 1, 2, and 3). Here we’ll wrap up the entire discussion, including whether we actually discovered anything about our initial question regarding “Which is the best Pokémon character?”

I ended the previous post with this:

“What would you be telling the business team about their quest for finding the ‘best’ (read: most quality) character?”

I know what I would be telling the business. But how about you? What are your thoughts?

If I was presenting this as one of my classes I would be saying: “We have a whole lot of actionable information at this point. We haven’t really tested out anything regarding a ‘best’ character. And, in fact, we’re pretty certain that the notion of a ‘best’ character is not testable. At least not without a great deal more insight into what ‘best’ means.”

That might appear terribly anti-climactic to you and you might have even see that coming. Yet … a key part of what I said above was the “actionable information.” Let’s take a look at what we do have, how we could use that as the basis for at least doing some comparisons of Pokémon characters. After that we’ll bring this around to the wider question of what all of this meant for testing.

Considering Our Data

From the second post, I can reason that I want to consider special attack and special defense. Why? Because those correlated well. In fact, so did attack and defense. And I suppose considering the “best” character might have something to do with how well they can win fights and thus these seem to be good attributes to consider.

From the third post, I can reason that Generation 1 was good for Attack / Defense while Generation 4 was good for Special Attack / Special Defense.

From the second and third posts, I can reason that Rock, Water, and Ground primary types are good at Defense. I can also reason that Bug, Dragon, Fighting, and Poison primary types are good at Attack. I can reason that Ice primary types are good at Special Defense while Fire, Ghost, Grass, and Psychic types are good at Special Attack.

You might notice some odd things here. If you look at the second post, you’ll note that Steel was rated very high for Defense. But that doesn’t seem to show up at all in the third post’s final gathering of data. That has to do with the fact of how many data points were found. In one post enough data points were found — based on how we were exploring the data — whereas in the other post, enough data points were not found. And, of course, we were using a different data set. Although that data was “the same” in all the relevant particulars, right? Well, your answer to that probably depends on how much you explored it as a tester.

Obviously I didn’t want these posts to become massive homework where you were spending more time on them than on your actual life or work. So here I’ll just say that the data between the two csv files is comparable. The differences have to do with how we were looking at and visualizing the data.

In these posts we didn’t consider Type 2 (the secondary type) so, at least initially, we should only pick those characters that have a Type 1.

Comparing Pokémon: Finding the Best

You would have to do your own analysis of the data to decide what you want to compare based on the above parameters. I’ll just take two examples here:

  • Generation 1: Dratini (Dragon Type) [Attack] vs Wartotle (Water Type) [Defense]
  • Generation 4: Glaceon (Ice Type) [Special Defense] vs Magmortar (Fire Type) [Special Attack]

Rather than take you through another long example, I’ll just let you check out my comparison notebook for the above four characters.

There’s clearly something odd about the data there given the above reasoning I just described from the data we gathered in these posts. See if you can find it. If you do find it, consider if it’s actually truly “something odd” or just an artifact of trying to reach an impractical goal with an impossible method.

Testers! That last statement is important. We who work in quality know that it is very easy to strive impractical goals exacerbated by the methods we are using to help enable decisions. So I think it would be a really good exercise for testers reading this to think about what it means to find the “quality” being sought here: specifically, what does it mean to find the “best Pokémon character”.

Dealing with the Bits

While I never worded things in quite this way, throughout these posts you were dealing with objects, measures, groupings, and actions. This is a fairly common breakdown in the data science world.

In this context, objects are things or events that exist in the world. In our example, a Pokémon character is an object. When whatever task we’re working on is specific enough, each object will be something that can be represented in or computed from the data. When the task is as specific as it can be, an object will correspond to a single row in a database or a spreadsheet.

The idea of measures refers to the outcome variables that will be gathered for the objects. Attack of a Pokémon, Speed of a Pokémon, Generation of a Pokémon, the Total of all stats for a Pokémon are all measures. When our specific task is detailed sufficiently enough, the measure is either an existing attribute in the dataset or one that can be directly computed from the data.

The concept of groupings — also called partitions — focuses on attributes or characteristics of the data that separate the data elements into groups. The “combat statistics” of a Pokémon fall into this: Attack, Special Attack, Defense, Special Defense, Speed, and HP. The “natures” of a Pokémon also fall into this: Type 1 (primary) and Type 2 (secondary). In a specific task, groupings or partitions are attributes of the objects or can be calculated directly from those attributes.

Finally, the idea of actions focuses on the words we use that articulate the specific thing being done with the data. In our posts here we started with an action of “identify” (first post), moved on to “characterize” (second and third posts) and moved on to “compare” (this post).

So take this specific task:

  • Compare the combat statistics of Generation 1 Water-type Pokémon from the other Generations.

The action here is compare. What is compared? Water-type Pokémon (the object). What do we want to compare? Combat statistics (the measure). Finally, there is a specific partition on the objects. They will be broken into two groups: Generation 1 and all the others.

You might remember that in some posts we checked out the “dimensions” of our data. In general, a dimension is an attribute that groups, separates, or filters bits of our data. As stated above, measures are an attribute that addresses a question of interest. Such a question of interest tends to be focused around insights or customer value. Something to understand is that the measure is likely to be vary across the dimensions. And, as with much of what I talked about, the measures and the dimensions can be explicit (in the data) or implicit (derived from calculations on the data).

So, for example, I might look at Total combat statistics over Generation. Here the Generation is the dimension while the Total combat statistic is a measure.

As you can probably see, breaking down a task into components helps in guiding refinement of a task into one that can be addressed with the data. The most direct way to do so is to consider the question “Are the object, measure, and grouping each directly described in the data?” For each of these three components, is it clear which aspects of the data are important or how to derive what we need from the data?

These are the kinds of problems you wrestle with as a tester working with data scientists.

Yet This Is What Testers Already Do!

What you should get from the above is that in these contexts, testers are working with the team to identify specific tasks that address a broad question. Then, also working with the team, testers are decomposing each task into specific objects, measures, and groupings. Finally the testers may be building visualizations that validate and support these tasks and provide results.

This shouldn’t sound terribly different from what you do in your non-data science projects.

What we do is refine a question (like “do we have enough quality?”) into tasks (like “this is the kind of testing we have to do to provide some measure of that quality”). To do this we solicit the scenarios, use cases, stories — whatever we call them — that illustrate the behavior and that motivate our decisions about what is and what is not quality.

Refining questions into tasks is called operationalization, a concept I talked about in these posts. This approach, however, is used in just about every discipline that has science at its basis. You reduce a complex set of factors into as small a set of metrics as you can. Ideally you would have just one metric. But in the end what you want is to enable decisions. You want insights that provide the path to an actionable answer. “Actionable” means that it is possible to act on the result. We can consider an answer (result) to be actionable when it no longer needs further work to make sense of it.

It All Comes Down to Quality

What we are really asking is: “Do we have quality?” Operationalizing that question means figuring out tasks to be performed over whatever we are dealing with such that those tasks are a reasonable approximation of the high-level question.

And this operationalization process is an iterative one and the end point is not always precisely defined. After all, consider these posts and the question we started with. We never really did get to the “best Pokémon” and, in fact, as this post hopefully drilled in, the notion of doing so may be flawed at its heart.

Just like in testing we do in non-data science contexts, the answer to the question of how far to go (“how much do we test”) is often, simply, far enough. The process is done when the task is directly actionable, using the data at hand, in such a way that we have clarified understanding and gained insights.

Data scientists work to clarify understanding and gain insight. That is exactly the same function that testers provide as well.

We work to gather information, reason about it, and present it in such a way that our work guides investigation and experimentation, elicits human reasoning, makes room for individual interpretations of what “good” looks like, and supports further exploration.

And that seems like a good message to conclude this series of posts on! I hope you enjoyed the journey, learning a bit about data science, figuring out the tester role in that context, and — perhaps most importantly — understanding that what data scientists ultimately do is aligned very closely with the practices of specialist testers.


This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.