Testers and Data Science, Part 3

This post follows on from the first and second posts. Here we’ll continue the theme of our Pokémon data, this time working at getting some more specific answers to our running question in this series: “Which is the best Pokémon character?”

For this post, we’ll be using similar data from the last post as well as the same technology base. Did you notice I said “similar data” and not the “same data”? Yeah, that happens sometimes. You spend all this time working on one data set, getting a bit used to it. And then, all the sudden, you are given a new data set. It’s related, to be sure, but it’s also a little different.

Download the pokemon_reduced.csv file. Put this in your project directory. Here I’m assuming you went through the first and second posts, set up a project directory and are comfortable running jupyter notebook in that directory.

Getting Started

Let’s get started, similar to how we did in the first post. Create a pokemon-003 Jupyter notebook. As a side note, you can see my version of the notebook we’ll build up in this post. Feel free to reference that to see if you’re getting the same things I am.

Let’s start off our first code cell much like we did in the first and second posts, just changing the name of our data set file:

Let’s get the first couple of rows just to get a feel for what’s in this data set:

As a tester, what do you notice? Think back to the previous two posts or look back at what you did in the first two notebooks.

Certainly this new data set is much more concise, from a breadth perspective, than our original data set. Is that a good thing? Well, hopefully, right? Apparently someone, and I’m guessing his first name starts with J, has decided that we needed — or could get away with — a simpler data set. For now, we’ll assume that’s accurate since I promise I’m not trying to trip you up. But it is really important to note these kinds of differences.

Thinking About Our Data

So one thing we learned about Pokémon characters from the previous posts (and data set) is that they have generations; currently seven, although our data only accounts for six.

We also learned that these little beasts have a “nature”, which is basically wrapped up in the idea of their “type.” And we know they can be of a single type or a dual type. In our data, a “single type Pokémon” means they have a value for the Type 1 field and no value for the Type 2 field. A “dual type Pokémon” means they have values for both Type 1 and Type 2.

That specific way of wording things is not what I said in the previous posts. There I talked about “primary types” and “secondary types.” Which is equally as accurate, I suppose. But do note that it’s really easy to frame data science problems more specifically when it’s something you want to focus on and less specifically when it’s not. This happens a lot and it has to do with the features of the data that are considered relevant. It’s important to be aware of this, become adept at traversing these semantic boundaries, but also, as a tester, working with the team to come up with consistent nomenclature.

Thinking About Our Insights

So what do want to do here? We want insight into … well, something. Presumably that’s why we’re torturing ourselves by overloading on Pokémon characters. In the previous post we talked about operationalization of questions.

Our collaboration with stakeholders will inform our process of operationalizing questions. And that’s no different than how this works in any other situation where you are dealing with the business or customer surrogates (or even customers themselves) when trying to determine what “quality” looks like. In the context of a tester working within data science, it helps to learn what data is available and how the results will be used. As testers we need to help to identify the questions and goals of the stakeholders with respect to the data and to further understand what data is available or can be made available.

We’ve also found that visualizations are important. These visualizations are one of the key ways that we testers support sense making. These visualizations that we decide to create are framed using the operationalization we perform to help guide decision-making about which visualizations to choose.

We took one approach at this in the second post. For this post, let’s say that we’re going to learn what we can about these characters by focusing on one generation and looking at the nature of the characters within that generation. This actually could be a follow on from what we did in the previous post.

These Pokémon characters have a series of stats. They have defensive stats (two of them), attacking stats (two of those as well), speed stats (one), and a health points stat (one). We were asking in the previous post what the “best Pokémon” was but we really didn’t figure that out to any extent. We simply looked at some of those stats relative to each other. And that’s good as it helps us further operationalize. In this post we’ll look at those stats relative to those types but we’ll also factor those into the generation.

Restructure Our Data (Just a Little)

Let’s change our index to the “Name” column, rather than have it be 0, 1, 2, 3 and so on.

Was this necessary? Not really. But sometimes you’ll find people do this to make it a bit easier to focus on the data. Here we’re making it slightly more clear that each row is one character.

And, actually, doing this calls something to our attention. In our first data set, I had talked about the notion of a “Mega” character but also indicated that the most common was chosen, which made the id (the Number column) unique. It looks like in this data set, that’s not the case. Notice we have:

Venusaur
VenusaurMega Venusaur

Both have a Number value of 3. That’s really good to know. Also, as it turns out and also as I mentioned in the first post, characters with a “Mega” evolution applied to them basically have their name plus “Mega” attached. But that reads a little odd here. So let’s actually do one more bit of data manipulation to remove all the text before “Mega”:

It’s a simple bit of text manipulation but I did want to show the kinds of things that data scientists will often do with data to pretty it up a bit or make it easier to reference.

Working with Data

Since we have a new set of data here, let’s first get the dimensionality of our data:

You should get (800, 12) which is different than the data set we used in the first two posts, which had a dimensionality of (721, 23). Based on what I’ve said so far in this post, and putting on your observant tester hat, what is the likely discrepancy there? Why are there more characters in this data set? Hint: the answer was already given, albeit in an implied way.

Since we know we want to focus on the natures of the characters (their types) within generations, let’s start looking at the generations. Let’s get a count first:

Okay, that’s good.

Wait! Do you see why that’s good?

Remember in the first post I said that sometimes you have to deal with missing or incomplete data? Well, if we had gotten any number other than 800 — what would that have told us? Or, put another way, what is the output above telling us?

Answer: it tells us that each of the characters does fall into a generation because the generation count is 800 and that matches the shape of our data. If there were any null values in the generations, we would have gotten less than 800 here.

We can get the same information but in descending order of the generations:

So we have a good breakdown of the generations here, incidentally also noting that this count adds up to 800. So the good news is that the laws of basic mathematics are working for us! (Whooo hooo!) More importantly, however, we see that the first generation has the most characters in it. For our data project, we probably want to grab the data grouping that has the most samples in it. In this case, generation five or even three would work just as well.

Now let’s get that numerical description of our data that we also did in the first post:

That’s getting us some insight into our purely numerical data. And, as we see, the Number and the Generation are actually not terribly relevant in the same way that the other values are. It’s those values, from Total to Speed, that we’ll likely want to consider. You might also notice that lovely 800 for the count for each of the data features. That’s, once again, a good thing. It means we have relevant data for any grouping that we choose.

Regroup and Think

Let’s consider what we’ve learned here, even if it’s just restating what we already knew. From just a look on the data frames generated in our notebook, it’s clear that Pokémon characters can have dual or single nature. All the characters belong to different generations starting from 1 and up to 6. There are no null values inside the data frame. All the characters have six attributes called “combat statistics” — namely HP, Attack, Defense, Special Attack (Sp. Atk), Special Defense (Sp. Def), and Speed. Total is the sum of all the six attributes.

Compartmentalize the Data

If you remember in the second post, I indicated that as part of the ability to look at the data we can add Python code to the notebook. This could be code we write or that is provided for us by developers. So let’s add this into a code cell:

Now do this:

Here I’m calling the function we just created. But what am I actually doing here? Well, take a minute and see if you can reason out what this might be doing. If it helps, try this:

Tester observation time! What do you notice that’s new and different?

What I did here is add a column to the data set, called Types, and the value it has for each row is either “Single” (if Type_1 has a value and Type_2 has no value) or “Dual” (if both Type_1 and Type_2 have values). Here I used the pandas apply function and what that does is apply some function — in our case, our get_number_of_types — across an axis of the DataFrame. In this case, I specified “axis=1” and that refers to the columns.

Now maybe you, as a tester, would do this to make it easier for you to explore the data set a bit. Or maybe a developer would have provided this column as part of what they believed made it easier to explore the data set. Or perhaps business users or domain experts said that this would be a helpful thing to have in place. Either way, being aware that this is possible is useful. Being able to do this on your own is extremely useful.

Focus on the Generations

Okay, so you know the basic idea we’re going for here regarding exploring the generations and then the nature of the characters within a generation. This is basically your test data.

So, as a tester, what do you want to do? Think about this before moving on.

Well, one thing we can do is we can create some handy subsets for ourselves. This is no different than how we would partition data in any other context. Here’s what you can do:

We want to focus on generation one for now, as discussed before, so let’s further subdivide that:

Notice here how I’m subdividing based on the new Types column that we just added. If you want to see the individual data sets returned, you could do code cells like these:

Okay, so what next? Well, it would probably help to have some idea of the overall percentages for the characters in each of the Type designations we just created. You can do that like this:

Your gen_1_single_percentage should be 46.9% and your gen_1_dual_percentage should be 53.0%. Let’s do our first visualization in this post:

That visual certainly matches what we calculated. Data consistency for the win! Don’t worry about all the details of that code. The one thing I might call out is the “explode” bit. Each part of the pie chart is called a wedge and a wedge of a pie chart can be made to “explode” from the rest of the wedges of the pie chart to make it stand out a bit. Which can be exciting if you like to make things explode.

But with all this exploding going on, what have we actually learned here?

Our percentages tell us that, within generation 1 at least, there is little difference between the number of single-natured (Type_1) and dual-natured (Type_1 and Type_2) characters.

Classify the Data

We’ve seen the composition on the basis of Single/Dual. Let’s see the composition on the basis of type itself.

Let’s take a look at what that got us.

This is giving us some insight into the different Type 1 values — meaning how many characters have that type — and the percentage of the whole that this makes up. Clearly we see “Water” is pretty common. Let’s reduce the bits we want to consider. Here’s another chance for a Python function.

Now we’ll use our apply function again:

What this is doing is basically classifying all the characters based on their percentage, which is in turn based on their count. If that percentage is less than 4%, that character grouping will come under the ‘Other’ category.

Let’s get one more visualization under our belt. First let’s do a little digging into our data.

Here the new_type_of_pokemons_series variable basically just provides a data frame with the Type being the index, thus leaving just Type 1 and Type Percentage as separate columns. The labels_for_types variable is just an array of those Type labels and the sizes variable is an array of the Type Percentage values. Now let’s finally generate the visual:

From the chart as well as the data sets already generated, it has been clear that the majority of Generation 1 characters are of the “Water” type. Here the exploded part of the visualization just makes that clear.

So we already focused on the Generation 1 characters. Now we’re focusing on the Water nature. Make sure you understand the progression and how we got from where we started to where we are now.

Provide Observations of Data

Let’s check the total stats of dual-nature, Water-type character and see out of all the dual type which one is strongest.

What do we see? What is that telling you?

Well, from the graph it’s obvious that Water+Dark type characters have the best attributes if we consider dual-nature characters where Water is their primary type. Keep in mind how the graph is being generated, which is by using the Total column which is, as you may remember, a summation of the all combat statistics values.

So is that it? Do we have our answer? Are we getting to one idea of what the “best Pokémon” is?

So, as a tester, who is used to deciding when you have “done enough” to have a valid enough answer, what do you think?

Before reaching our final conclusion, let’s check the number of observations of each type and see whether that’s true for all cases or not.

Huh. Well that’s interesting.

It is interesting, right? Do you see why?

From the graph, we can see that there is only one observation of Water+Dark. The same applies to Water+Fighting and Water+Flying, even though those looked pretty “good” on our previous graph. Clearly to conclude anything about these natures we’re going to need more observations.

Okay, well, since we have dual as well as single natured Water characters, what can we check next?

How about if we check which type overall — single or dual — will be more beneficial.

From the graph, we can conclude that for all the attributes, except for Speed, the dual-nature Water type characters are better than the single-nature Water type characters. We can also see that Defense is the best statistic of the group for Generation 1 Water-nature characters.

What next?

Well, a logical question is: do all the generations follow this same pattern? This gets a little involved to check, so brace yourself:

Lots of visuals! And from those graphs a few things stand out for us.

What are they?

Only Generation 1 Water-type characters have Defense as their best stat. We already determined that. But it’s also clear that Generation 2 and Generation 4 Water-type characters have HP (health points) as their best stat. Generation 3 clearly leans more toward the attack stat. On the other hand, Generation 4 and Generation 6 have the higher special attacks.

Focus In On Data Attributes

Now let’s start to bring in our focus a bit by finding the “best attribute.” At least so far as we can determine with the data analysis we’ve done so far. First, let’s break this up by the types of the characters:

Once again we’ll provide a function. This one will get us the best attribute for a particular type of character:

You can then call that function for each of the types_of_pokemon that we got.

You might want to take a look at that data and see what it’s returning. By now you should now how to get at the contents of type_data variable. Essentially what you’ll get is a dictionary that holds, as a key, the type (say, ‘Bug’) and the value for that key will be the best attribute for that type (say, ‘Attack’).

Let’s do another function to get the best attribute:

Now let’s create a new column in our data, calling it ‘Best Attribute’ and populate it accordingly.

The above “is_copy = False” might need a bit of explaining. If you keep that out, you will see what errors or, in this case, warnings look like in notebooks. Specifically, without that line in place, you will get a SettingWithCopyWarning warning below the code cell when you execute it. What this is meant to do is indicate that you are operating on a copy and not the original. The warning is just alerting you to that fact.

Now let’s take a look at the first couple of rows:

You’ll see you have a “Best Attribute” column, telling you the best attribute for each character.

Let’s get an overall view of that data:

That will get you a nice list. Looking at that list, what do you observe?

  • Question: Which types will have better Attack?
  • Answer: Bug, Dragon, Fighting, and Poison clearly are good with Attack. That means Fire, Ghost, Grass, and Psychic are good with Special Attack.
  • Question: Which types will have better Defense?
  • Answer: Ground, Rock and Water. Ice apparently has better Special Defense.
  • Question: Which types will have better Speed?
  • Answer: It looks like only Electric.
  • Question: Which types will have better health points (HP)?
  • Answer: Normal and Fairy.

We’ve definitely got some actionable data there.

Gathering Our Thoughts

So do you see what we did here? Along with the second post, we continued down the path of operationalizing an initial question we started with: “Which is the best Pokémon character?”

In the second post we got an idea of how the combat statistics related to each other, such as by comparing Speed and Defense. Here we looked at the types that have those particular combat statistics. We haven’t found the “best” character but we have analyzed much of the data that would go into what seems to make the “best” character.

Along the way, a lot of code was thrown at you. As I’ve indicated, perhaps this is code you yourself will write as you learn to work in these environments. Consider it a form of “automation,” perhaps. But it may also be provided to you by developers. What matters is what you do with it; how you use the logic to get the answers you need. Every time I wrote a Python function, I could have simply used a series of commands to further refine a given data set. That, however, could have sacrificed a certain amount of clarity around what was being done.

You also cannot have failed to notice that I’ve very few of the results of each command in these posts. The reason for that is I’m certainly hoping you’ve been exploring along with me, trying out your own notebook and perhaps comparing your work to the oracle I provided you, which was my own attempt at this.

My goal here was for you to feel a little cognitive friction as you had a lot thrown at you so that you could see how you responded to it. Did you take more time to investigate the csv files or just assume that I would feed you any information you needed? Did you take time to investigate the outputs of some of the variables, to understand what kind of data you were getting returned? Or did you just follow along with whatever I said? Or were you just using these posts, up to this point, to get a little feel for data science?

By the way, there is no wrong answer to any of those questions.

What’s Next?

What has all this work actually told us, though? As a tester, what would you be telling the business team about their quest for finding the “best” (read: most quality) character?

We did a lot of testing around our data. (You did recognize what we were doing as testing, right? We did testing as a design activity and testing as an execution activity.) What has that testing told you so far?

We’ll come back to those remaining questions in the fourth, and final, post in this series. In that post we’ll do one final bit of analysis and a summing up of what I hope has been useful in this series.

Share

About Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.
This entry was posted in Data Science. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *