Testers and Data Science, Part 1

I’ve talked a bit about testers and AI as well as testers and machine learning. Here I want to focus a bit on one area that can be a basis for both of those areas: data science. As a tester, you don’t need to be a data scientist. But it certainly doesn’t hurt to have a grounding in what data scientists do. Here we’ll do some exploratory computing with Jupyter; we’ll use some numerical and visualization libraries and we’ll explore the (fascinating?) world of Pokémon data. So let’s take a few posts to dig into this.

Fair warning: this is going to be a long post, so if you want to dig in, you should settle in the for long haul. This will be a crash course in some data science.

I say “some” because I’m not going to teach everything about data science. Beyond my lack of ability to do so, the subject itself is huge. That being said, I do recommend taking a look at Leveling Up as a Data Scientist, Part 1 and, particularly Part 2. If you want to get slightly deeper into some specific techniques, check out 10 Statistical Techniques Data Scientists Need to Master. Even if you don’t plan to get into those techniques, knowing what others are doing is useful and helpful. Along with that, the book “Practical Statistics for Data Scientists: 50 Essential Concepts” is pretty good for the statistics background.

I also recommend the book Data Science from Scratch: First Principles with Python, which is a nice gentle introduction to a lot of topics. If you like R, I recommend R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. If Java is your thing, I haven’t found too many books or sources that I liked on this topic with Java but, to be fair, I haven’t sought them out as much as I’m not a Java fan in this context. Another book that helped me quite a bit was Making Data Visual: A Practical Guide to Using Visualization.

The Basis of Data Science

Data science is concerned with the cleaning, manipulation, and processing of data. A large part of the focus is on data sets that can be transformed into a structured format that is suitable for analysis and modeling.

There’s also a focus on feature extraction from those data sets. In this context a feature is any part of the data set that you consider relevant for a given task. For example, maybe you have movie reviews that can be processed into a word frequency table. The contents of that table will then used to perform sentiment analysis. That analysis can then be compared with the rating value that someone gave for the movie. The task might be to see whether sentiments (what people say) and the rating they gave coincide.

You, or your team, will ultimately be working in the context of some language — Python, Java, Scala, R, etc. You will also be using the data-oriented library ecosystem of that language. It’s really important to realize that there is data analysis methodology and then there are tools that support that methodology. You don’t want to conflate the two any more than, say, you want to conflate DevOps (as an approach) with Docker (a tool that can be used in the approach).

Some Basic Setup

Here I’m going to focus on Python, specifically Python 3.x. I’ll trust you’re able to get this set up on your system. Do note that if you are on Windows, I would strongly recommend a 64-bit version of Python.

A large part of data science is having some basis in interactive and exploratory computing. I’ll show you how this works in this post with Python. Specifically, there are tools called IPython and Jupyter which support this style of computing. In these Python contexts, a lot of what you do — whether as developer or tester — is best explored from either an IPython or Jupyter session. You can install both of these via the pip tool:

pip install ipython
pip install jupyter

Also helpful in data science is having an understanding of arrays and array-oriented computing. That will help you use tools with so-called array-oriented semantics. We’ll do that in this series, looking at Python libraries like NumPy, pandas, and a few others.

The Tool Ecosystem

In fact, let me give you a little context for some of these tools I’ve mentioned as well as some others. Fair warning: I’m going to throw a lot at you here because I want to emphasize the fact that there is a lot here. Tool ecosystems in the context of data science — not to mention artificial intelligence, machine learning, deep learning, and so on — are wide and deep. These are waters you’ll want some proficiency with swimming in.

Interactive Tools

IPython is basically just an enhanced Python interpreter. Instead of typing python at your terminal or command prompt, you would type ipython. There’s not a lot to say about it beyond that. Or, rather, there is — but not much of it relevant for the data science I’m focusing on.

Jupyter is an execution platform that is focused on the concept of a notebook. Jupyter notebooks are basically web pages that have executable code in them. These notebooks were originally created within the IPython project but then branched out on their own. In fact, what was called the IPython web notebook morphed into the Jupyter notebook.

In a Python context, the Jupyter kernel uses the IPython system for its underlying behavior. When I say “kernel”, here’s what to take from that. A Jupyter notebook can interact with implementations of the Jupyter interactive computing protocol in many different programming languages. Each such implementation is a kernel. This polyglot approach means that Jupyter is a language-agnostic interactive computing tool. Its importance from my perspective is that Jupyter provides an environment for the interactive and exploratory computing I mentioned earlier.

Data Tools

NumPy, short for Numerical Python, adds new data structures to Python for fast array-processing capabilities. NumPy is one of the foundational packages for pretty much all numerical computing in Python and, as such, it can act as a go-between for data being passed between algorithms and libraries. How this manifests is that many computational packages providing any sort of scientific or numerical functionality use NumPy’s array objects as the basis for data exchange.

A library called pandas (yes, with a lowercase ‘p’) provides data structures and functions that are designed to handle working with structured or tabular data. The pandas library adds abilities to create and manipulate subsets of data. If you’re curious, as I was, the “pandas” name is derived from ‘panel data’, which is, according to the creator of the library, “an econometrics term for multidimensional structured datasets.” (Uh huh. Just as I suspected.) Apparently the name is also meant to indicate “Python data analysis.”

Importantly, pandas adopts significant parts of NumPy’s idiomatic style of array-based computing. The main difference between the two is that pandas is designed for working with tabular and/or heterogeneous data while NumPy is better suited for working with homogeneous numerical array data.

I’ll note that pandas is used a lot in data science contexts. Often it’s used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like seaborn or matplotlib. You’ll see that in these posts.

Analysis Tools

Regarding some of those other tools I just mentioned, the scikit-learn library is a general purpose machine learning toolkit, essentially providing implementations of many of the algorithms that are useful in machine learning contexts. The statsmodels library, which is a statistical analysis package, contains algorithms for classical statistics and econometrics. An interesting point of distinction between these two libraries is that statsmodels is focused on statistical inference, essentially providing uncertainty estimates for various parameters. By contrast, scikit-learn is more prediction focused. We probably won’t use these too much in this series.

Finally, the library seaborn provides data visualization capabilities. It does so by providing a high-level interface to yet another member of the ecosystem, matplotlib, the latter of which is used for producing plots and other two-dimensional data visualizations. Seaborn also relies on SciPy, which is a collection of packages for dealing with standard problem domains in scientific computing. These are tools that you use for exploratory analysis and, with the exception of SciPy, you’ll see a lot of these tools in this series.

Tool Overload?

Wow! That’s a lot of stuff, right? Yes, it is. In data science, you’ll find that there’s an ecosystem of tools that often work together in order to provide the basis for all of the work your team is doing. Those tools are often used in the context of one of those interactive and exploratory computing environments that I mentioned. As I said, we’re not going to use all of those tools in these posts but I’ve found that even having a simple description of the tools, like I gave above, provides just enough difference between being totally flummoxed and only marginally confused.

Okay, so let’s see all this in action.

Getting Some Data

It might come as no shock to you that in doing data science, you’re going to want some data. In this post, I’m going to give you a set of data. For learning and practicing, you can use Kaggle to get some datasets. Do note, however, that in many data science contexts, people won’t just be handing you data. You may actually have to get or create your own.

To start off, you can download this file: pokemon_data_science.csv.

You are probably a lucky person if you haven’t been aware of the Pokémon craze. Either way, take a moment to download that file and look at it. When acting as a developer, and especially as a tester, in a data science context, understanding the data is fairly critical. So, again, this isn’t a race: take some time and look at the data. Unless you’re familiar with many of the Pokémon concepts, some of it is going to be a bit strange to you. It certainly was to me. As far as the source, the Pokédex is where this data came from.

That last point might seem to be of dubious value but, of course, it isn’t. As testers we have to be aware of oracles and the provenance of information that we are given.

Analysis of the Data

The dataset is focused on the statistics and features of the Pokémon characters. There are (currently) seven generations but this dataset only covers six of them. Specifically, the dataset includes 23 variables per each of the 721 Pokémon of the first six generations. That’s what a surface glance of the data should tell you. The Number column holds the Pokémon ID in the Pokédex and the Name column is the official name of the character. Beyond those two fields, we have twenty-one other variables. Of those, 12 are numerical (10 continuous and 2 discrete), 6 are categorical and 3 are boolean.

But, like much data, it can get complicated when you move past a surface level examination.

For example, as you can see in the data, each character has a type. There is a primary type (which is Type_1) and a secondary type (which is Type_2). Not all characters have a secondary type, however. Beyond that, many Pokémon are multi-form and also some of them can mega-evolve. A form means that they are the same kind but they look different. Mega evolution means the character has the same name but with “Mega” attached to it. The point here is that this can lead to a pretty complicated data set and makes tne number of a given character possibly non-unique. So for multi-form and mega-evolve, only one of the forms was chosen for this dataset. The form chosen was the one widely considered as being the most common.

I bring all of this up because it’s a really important point! You have to understand the basis of the data, not just the data that’s in front of you. This is where what’s called “substantive expertise” — basically, good domain knowledge — is crucial. Beyond even that, you need to understand some of the assumptions that have gone into the data.

Also of potential interest is the notion of gender for these characters. They can be male or female but they can also be unknown. There’s also a probability associated with being male or female (here covered by the Pr_Male column). A lot of this has to do with the notion of egg groups and the idea of legendary status. All of that might make more sense if you consider that a large part of the Pokémon craze is the “breeding” of the creatures, which is where the generations come in.

There can be a lot of information lurking in data. And the data itself does not contain that information, per se. Or, rather, it does but only implicitly. This is part of what you have to understand. My hope with the above is you see that even with something as “simple” as Pokémon, there is a fair amount of nuance and complexity. But, as testers, we’re used to that, correct? We often deal with data and have to make sense of it. So we already have some alignment with data scientists.

Analysis of Information

When looking at the overall data, you should start thinking about what sort of data subsets you would want to gather. And then you need to start thinking about what information you want to get from those data subsets. What insights would you be looking for?

STOP! Did your tester senses kick in there? Insights! Here’s another area where we have some alignment with data scientists. We want to make sense of things; we want to help others make sense of things.

As far as our data, as just stated, each Pokémon is described with a number of variables. Not only do we have the combat statistics (the variables that describe the ability of the character to fight), but also many variables that describe more details about them, such as their color or gender. Regarding the combat, you can see there’s information in the data about “attack” and “defense” along with “special” attacks and defense. There’s also some notion of “health points.” There’s also the notion of “speed.”

And so perhaps you’re wondering if there are correlations between how quick one of these characters is and their defense abilities. Or whether their total health has any correlation with their attack abilities. And how does any of that tie in with their generation, if at all? Does the male/female distinction matter? And what about those generations? Have the characters gotten better or worse or stayed largely the same?

Depending on your interest in Pokémon, it may sadden you or gladden you that I don’t plan to answer all of those questions here. What I want to do is simply use this dataset to show you a bit of interactive and exploratory computing, using some of the libraries we talked about earlier.

Specifically, we can analyze the wide variety of variables used to describe the Pokémon characters. And if that’s so, then there is a chance to find relationships between those variables and perhaps even to cluster the Pokémon according to some criteria. And that in turn may lead us to predict which Pokémon will do better in certain combat situations than in others.

Getting Ready to Code

First, make sure you have some project directory that you can use. Put the CSV file you downloaded in there and then make sure you are in that directory at your terminal prompt.

You can do everything I do next at a regular Python REPL, an IPython REPL, a Python script or via Jupyter. If you are not using something like Jupyter or the REPL, you will have to substitute print() functions around some elements. The REPL or Jupyter will, by contrast, self-evaluate. Personally, I would recommend giving Jupyter a try and I’m going to assume you’re doing so in this post. Fire it up like this:

jupyter notebook

On most platforms, Jupyter will automatically open up in your default web browser, likely at port 8888. If for whatever reason that doesn’t happen, you can also navigate to the URL address displayed in your terminal when you started the notebook. The above command essentially makes whatever directory you were in a web repository, with Jupyter being the web server.

To create a new notebook in the web interface click the New button and select one of the options that appears. These options can differ depending on how you installed Python. If you just have a standard Python installation, you’ll either see “Python 2” or “Python 3”. If you are using Python as part of the Anaconda distribution, you might see “conda” in the list.

You can save a notebook (see under the File menu) and that will create a file with the extension ipynb. This is an entirely self-contained file format that holds all of the content — including any evaluated code output — that is present in the notebook. This is really nice because these can be loaded and edited by anyone who loads up your notebook.

For now, create a notebook called pokemon-001.

In fact, you’ll see I have my own version of this file stored on my GitHub repo. Notice how GitHub knows how to display these notebooks, which is really nice. You can reference my notebook to see if you are getting the same results.

Accessing the Data

Accessing your data is a necessary first step after you’ve actually got some data. We’ll access our data using the pandas library. The pandas library has two main data structures that it bolts on to Python: Series and DataFrame. I’ll mainly be focusing on a DataFrame here. Essentially a DataFrame represents a rectangular table of data and contains an ordered collection of columns. Think of it as an in-memory spreadsheet.

The library provides methods to load data from different file types, like Excel and csv. You can also load data from json, sql, and various other means. Make sure the csv file that you downloaded is in the directory where you started up Jupyter or your IPython REPL. If you’re doing this in Jupyter — and that is recommended — you should have an empty “code cell” waiting for you. Here you can enter Python code. So try this:

Note that pressing Return/Enter keeps you in the same cell. Before we execute anything, do note that you will need to install any libraries you reference in the notebook. This is done outside of your notebook. You can do all this using pip:

pip install pandas
pip install seaborn

Installing seaborn should grab matplotlib for you.

Most of that code I just showed you is code that would work in any Python context, the single exception being the first statement. A statement with a % character is called a magic function in the context of IPython / Jupyter. What this does is allow any graphs generated by matplotlib to be done “inline” — meaning, in the notebook itself.

To execute that code, press Shift+Return/Enter. You can read some basics for editing a notebook. There’s not much to it. What the above is doing is getting us the libraries that we need in place.

There won’t be any output, just as there wouldn’t be in a regular Python context. But you will be taken to another code cell where you can enter more code. I’ll note here that nothing would stop you from putting all of the code in this post in one code cell. But a nice thing with Jupyter is that you can interleave markdown cells that explain what the code is doing or provide some wider context about the data. For example, all that context I gave you above about Pokémon could be interleaved with the code itself.

STOP! Hey testers! That’s pretty cool because we could interleave our test execution and results with our notes or with the requirements.

Now we’ll grab our data:

You won’t actually see any output from the above statement but that code will return a pandas.DataFrame object. If you are a “trust, but verify” kind of person, do this:

A DataFrame is a table-like data structure that will make it easier for us to manipulate our data set and extract information from it. From now on pokemon will be the representation of our DataFrame. But, as you’ll see, you can create multiple views of your data, each one of which would be a new DataFrame object. Each such DataFrame could be a subset of the full data set.

Exploring the Data

The pandas library provides some methods to visualize the data that you’re working on. When using methods like read_csv(), as we just did, the first line of that dataset will be used for indicating the columns. Let’s take a look at a certain subset of our data, such as the first couple of rows.

The head() call allows you to visualize the first few rows on your DataFrame. The default value for number of rows to show is 5. You could specify more if you wanted:

Keep in mind that each row of the dataset represents one character as represented by several features for that character. Those features are the table’s columns. Those columns indicate the “shape” of our data. Let’s get that shape.

You should get this:

(721, 23)

This means we have 721 data items and 23 features for each item in the dataset. Actually, 21 of those only matter since the name and number are not features in the sense that we would want to do much analysis on.

Whoa! See how I just slipped that statement in, almost as if it were “obvious.” That’s an important thing to keep in mind as you begin analyzing and exploring your data. You have to actually analyze it and explore it. This will allow you to reason about it and thus make decisions about it. And giving people the means to reason about things and make decisions is almost entirely what testing is about. Again, we see alignment here.

But what does the “shape” mean? The shape refers to the dimensionality of the DataFrame. So, again, what this tells us is that we have 721 characters and 23 features of those characters. That’s a pretty big dimensional space. Those features are each represented by a column. We can get a list of those:

And there’s all your columns as an Index. Here the Index is an immutable ndarray, which is coming from NumPy. Remember how I said pandas relies on NumPy data structures? Well, here’s your first indication of that.

Exploring Specific Data

So far we’ve looked at some very general aspects of the overall data set. You can select a single column, which yields that other data structure I mentioned: a Series.

If you want to check that this is a Series:

By the way, I should note that pandas also supports the Python dictionary syntax for accessing columns. So you could do this as well:

A Series is basically a one-dimensional array-like object. This object contains a sequence of values and an associated array of data labels, called the index. If you are one of those people who likes to understand the data structures you’re working with, you can think of a Series as an ordered dictionary because it’s really just a mapping of the index values to the data values. And — if you want to get a little more into the weeds — what the above commands implicitly show you is that a DataFrame can be considered as something like a dictionary of Series that are all sharing the same index.

I mentioned earlier that a lot of libraries use NumPy as their basis. Here’s where I can prove that again:

That’s telling you that we’re dealing with a NumPy ndarray, which is basically an “n-dimensional” array. These are homogeneous arrays and here we can tell what type of data is being held by this array:

That tells you that this feature is represented as a series of string objects.

So what you’re seeing here is that we’re checking out what’s in the dataset and what it looks like. But let’s get some general information about those columns:

Here’s a chance for you to put on your tester hat. Do you notice anything interesting about the data returned? Think about that. I’ll come back to this shortly. But this is your chance to see if you’re paying attention!

Now let’s get some summary statistics of some of the columns.

Hey, look at that! Another chance to put on your tester hat! What are you noticing about the data that is returned?

What you should note is that this last command only works on the numeric columns. I said that this command was going to generate some “summary statistics” on “some of the columns” and the fact that it’s about statistics should indicate to you that this wouldn’t make sense on non-numeric data. That’s why some features are left off of the output from this command.

Missing Data?

One thing you often have to deal with in data science is looking at missing or incomplete data. Such data can make analysis problematic in a variety of ways. So, first, let’s see if we can check how many null values there are.

So we see that three columns have some null values and that will be important to understand. And now maybe you see what I was asking you to notice about the output for the earlier pokemon.info command. When you ran that command, you should have seen that most features returned 721 — the same as the amount of rows in the dataset as a whole — but three did not. They returned less. And here the pokemon.isnull is confirming that those same three features have some null values.

Digging Into Data

Beyond looking for missing values, we can try to get an understanding of other aspects of our data. For example, let’s see how many “Legendary” characters we have:

Okay, possibly good to know. Out of our total data set of characters — over 700 — a very small fraction of them (46) are legendary. Whether that actually matters, of course, depends on what we’re looking for.

You’re probably getting the point that there are a lot of ways to look at and filter your data with the pandas library. For example, consider this question: “Is there a Pikachu character and what info do I have on it?”

Boolean filtering like that is really simple. So let’s look at our data in another way.

That returns the count of each type present in the “Type_1” column of the dataset. We can also get the same thing but in decreasing order.

So we now know that there are quite a few characters with Water and Normal types. Maybe we want to know the character that has the highest health points:

That seems to suggest that the character with id 241 has the most HP. I’ll leave it as an exericse for you to determine if that’s correct.

Maybe we want to arrange the characters in descending order based on the Total value. The Total is just that: a total of the character’s base combat statistics. In other words, the total indicates the value of the HP, Attack, Special Attack, Defense, Special Defense, and Speed for the character. In getting this, let’s just get the first ten of whatever list is returned.

Nifty. Okay, let’s show all the unique colors, using the first type as our basis:

We can also just get a count of those unique values:

Okay, so let’s check the size of the dataset, as grouped by color:

Hopefully you’re getting a feel for this. By no means are you expected to remember all this for these posts. I just want you to get used to the flow of using an exploratory computing platform to analyze some data.

Visualize Our Data

Okay, so we got a lot of data for the color. To close out all this work, let’s do a little visualizing of the data.

Here I’m using Seaborn to get a “countplot”, which is essentially a histogram across a categorical variable (color, in this case) as opposed to a quantitative variable. I’m then displaying that plot as a bar chart. The underlying matplotlib library is being used for this.

Let’s do one more visualization:

Here I’m creating a new subset of the data based on the pokemon data I’ve been using. However, I’m dropping a few columns. Then I’m generating a histogram against that subset of data. That notion of getting only a subset of your data and then doing further manipulation on that subset, rather than on the whole, is crucial in data science.

Okay, Okay! We Get It!

I know it seems like I’ve gone through a lot of material here. Do understand that, as a tester in this kind of environment, learning how to look at data is going to be a key skill and being able to do that via the libraries in an interactive or exploratory computing environment will be very helpful. If you made it this far but found your eyes glazing over and your mind numbing, you might not want to work in data science contexts.

But … we haven’t done anything here, have we?

Well, we have. I’ve shown you that it’s possible to manipulate this data in a lot of ways. And I’m fairly certain you’re feeling that there is so much more that you don’t know than you do right now. And that’s very likely true! That’s certainly the case for me. But, as testers in these kinds of environments, what’s stopping us from learning? Nothing, of course. Except often being thrown in at the deep end with no prior experience. I hope I’m providing a little experience here.

Yeah, great. But still … we’ve done nothing here, right?

Beyond showing you some of the cognitive friction you are likely to encounter, I actually did cover a few important things here.

  • There’s a little thing you have to do called interacting with the outside world. At bare minimum, as you’ve seen, this means reading a potential variety of file formats and data stores and then having to figure out what you actually managed to get. But there’s also understanding the problem domain you are working with.
  • Then there’s preparation. To do data analysis, you often have to transform the data. That means cleaning it, munging it, combining it, normalizing it, reshaping it, slicing it, and so forth. You saw only a hint of that in this post.
  • Then there’s the actual transformation part. This is where you apply mathematical and statistical operations to groups of datasets to derive new datasets. Again, you perhaps saw a hint of this here when we got bits of information about data form the total dataset.
  • Then there’s modeling and computation. This is where you connect your data to statistical models, machine learning algorithms, or other computational tools that actually help you and your business provide value. We didn’t really touch on this at all here.
  • And then, finally, there’s presentation. This is where you create interactive or static graphical visualizations or textual summaries that help people reason about the (often) massive amount of data that is at your disposal. You provide people with ways to gather actionable insights. And, again, this isn’t something we touched on too much.

What all of this adds up to, in our professional contexts, is data-driven decisions at scale. That’s often what’s at stake as buinesses embark on a path of getting key information or actionable insights from data. The goal is using data and analytics to make data-driven decisions.

A Few Cleanup Points

As you play around with Jupyter, you might wonder how you can get rid of all the output and start fresh. On the Kernel menu you’ll see an option for “Restart & Clear Output”. Also, on the homescreen of your Jupyter interface, you’ll see that your notebooks are “Running” and may wonder how to stop that. On the Running tab, you can click the “Shutdown” button. If you are in the running notebook itself, you can go to the File menu and click “Close and Halt”.

Also, of course, you’ll want to shut down the server at your terminal prompt when you are done.

The Tester Role

So this would be a good time to ask: what does the tester do in this kind of environment? Are you just handed all the data? Are you told all the commands to use? Are you told what to think about the data? Or how to investigate the data? Do you have to be provided with the specific guidelines for how to put the data to the test?

But if you have to be provided with all this, then someone is going to ask: “Well, why do we need you?”

But we’ve heard that before, in other contexts, right? So hopefully this post gave you a little head start on understanding the value you can provide by learning some of the techniques by which you will provide it. In the next post in this series, we’ll dig in a bit more with our dataset and I’ll ask you to think about the data in a much more specific and constrained context.


About Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.
This entry was posted in Data Science. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *