Select Mode

Testing at the Crime Scene, Part 3

In this third post to the crime scene series, we’re going to continue using our crime scene techniques by adding an extra complexity dimension to what we started in the second post. We’re then going to try our analysis on a much larger code base than any we’ve looked at so far. So put on your detective hat and let’s dive in!

Needless to say, you really should read the first two posts in this series to make any sense out of this one. In fact, this post is continuing on directly from where we ended up in the second post with our Frotz crime scene analysis.

In the previous post, with our analysis of Frotz, we focused on the problematic modules doutput.c and dinput.c. We applied a particular complexity measure around lines of code. And while we agreed that, as evidence goes, “lines of code” wasn’t the greatest, it also wasn’t the worst. We were able to validate our choice because other lines of evidence converged to show us that there was at least some correlation between lines of code and our hotspots.

Multiple Complexity Measures

In a crime scene, the more clues you have, the better. The more of those clues that can actually count as probative evidence is even better still. So let’s look at another really simple complexity measure here: whitespace. Even more specifically, whitespace that is used as indentation.

The idea of calculating indentation, just like calculating lines of code, is fairly trivial. You just read a module of code line by line and look at the number of leading spaces and tabs. The idea here is that leading spaces or tabs count as a logical indentation. What this means is that empty and blank lines are ignored and thus we’re not doing just whitespace analysis here.

Quick Side Trip: Negative Space

Negative space is an interesting concept. In my career as a test specialist, I’ve certainly found it helpful. Negative space becomes interesting when the absence of stuff says more than the presence of said “stuff” would. So when I’m looking at someone’s repository of tests, while I’m certainly looking at what is there, I’m often more interested in what is not there. In fact, I often like to simulate a situation where some number of tests stop being there, as in an apocalyptic “test snap”.

Consider this concept of negative space in the context of music. Musicians can use silence as a sort of beat. This provides a time signature of its own and is called a “Rest.” The important point is that this is still considered a note, but without an actual note being played. In the context of drawn art, the definition of negative space is the area around and between a subject. That area traces the outline of a subject to reveal its overall form. Negative space in home interior design refers to the areas around objects, which can have a practical value (allowing passage through the room) but also a more subjective value (feelings of comfort and harmony rather than of constraints and clutter).

Negative space in code refers to everything that gives the code form but is not itself the code. It’s the rest note; it’s the suggestive outline; it’s the through passage that facilitates getting from one method to another. It’s also one means by which we can correlate some idea of complexity and thus perhaps some understanding of quality and design.

Let’s Analyze Some Complexity

You’ll need to a grab a few scripts and put these in your scripts directory.

Once those are in place, let’s trying running the analysis like this: python ../scripts/analyze_complexity.py src/dumb/dinput.c

You should see something along these lines:


n, total, mean, sd, max

522, 870.25, 1.67, 1.36, 6.0

The first column shows the number of revisions to this file and the second column is the accumulated complexity. The remaining columns convey some of the descriptive statistics, where the mean tells us that there’s some complexity but, on average, it’s about 1.67 logical indentations. That might not be too bad. The standard deviation (sd) specifies the variance of the complexity within the module. A relatively low number like we got indicates that most lines have a complexity close to the mean. Again, not too bad. Yet the max complexity column shows possible signs of trouble. A maximum logical indentation level of 6 may be considered high.

Why? Mainly because a large maximum indentation value means there is a lot of indenting, which essentially, for a C-based language, tends to mean nested conditions. What that’s likely indicating to us is that we can expect reservoirs of complexity.

Let’s look at the file dinput.c. In that code listing, search for the dumb_read_line() function. You can see there’s clearly a lot going on there. Our converging lines of complexity — lines of code from the previous post plus indentation from this one — are not leading us astray.

Let’s try the same on our other problematic module: python ../scripts/analyze_complexity.py src/dumb/doutput.c

You should see something like this:


n, total, mean, sd, max

790, 1026.0, 1.3, 1.12, 5.0

Feel free to check out doutput.c if you want to get a feel for that source module and see if it aligns with the above measures, particularly compared to dinput.c.

Analysis Guides, Not Dictates

Sometimes people will use initial crime scene evidence and jump to immediate (and possibly erroneous) conclusions. Perhaps the above has convinced you — were you a developer on Frotz — to immediately start refactoring the above modules. That, of course, would likely indicate a lot of testing for those modules is in the immediate future. But before you hop in and start refactoring, you may want to check out the complexity trend for the module in question. Keep in mind that we have a version controlled repository here! We have a lot of historical data. What this means is we can apply complexity analysis to that historical data and use that to track trends in the hotspot. Uncovering complexity trends is really important.

With that understood, while we’ve looked at one particular revision — meaning, we looked at dinput.c or doutput.c as they are right now (in the timeframe we’re considering) — we want to see a range of revisions. With that range, you then want to calculate your complexity measure for each revision. How do you know what to check? Well, remember your version control history is a log. You can look at that log:

To aid us in looking at our revisions, you’ll need two more Python scripts to put in your scripts directory.

With those in place, I’ll show you a command I used to pick two particular revision points. python ../scripts/git_complexity_trend.py --start 58c48ed --end 16e70d0 --file src/dumb/doutput.c

Here’s the output of that:


rev, n, total, mean, sd

cd5637c, 781, 1025.75, 1.31, 1.12

2f4e06a, 792, 1029.0, 1.3, 1.12

What just happened is that we specified a range of revisions determined by the --start and --end flags. Their arguments represent our analysis period. The output provides us with the complexity statistics for each historical revision of the code. The first column specifies the commit hash from each revision’s git commit. Since we’re taking such a small range, there’s not too much surprise that there is very little change. Let’s do the same thing for our dinput.c module. python ../scripts/git_complexity_trend.py --start 58c48ed --end 16e70d0 --file src/dumb/dinput.c

We get this:


rev, n, total, mean, sd

e5d1244, 522, 870.25, 1.67, 1.36

16e70d0, 522, 870.25, 1.67, 1.36

In this range, no changes at all. So we can take a slightly longer range: python ../scripts/git_complexity_trend.py --start ffa4db5 --end 1f30311 --file src/dumb/doutput.c


rev, n, total, mean, sd

74596e7, 770, 1022.5, 1.33, 1.12

cd5637c, 781, 1025.75, 1.31, 1.12

2f4e06a, 792, 1029.0, 1.3, 1.12

ef91a84, 792, 1029.0, 1.3, 1.12

15fbca2, 790, 1026.0, 1.3, 1.12

Clearly the range of commits we’re looking at here are not enough to measure the complexity trends to determine whether the code was improving or getting worse over time. For that, we would probably be most served by just looking at all of the historical commits in the Frotz repository. What you do see here, however, is the way that you would go about this kind of analysis. But, even more importantly, you see how this analysis can lead you astray if you don’t consider your clues in relation to a trend.

Cautionary Aspects to Analysis

There are some warnings to be aware of with any analysis, particularly around a subject like “complexity.” Let’s consider a few related to ours, which were lines of code (in the previous post) and indentation. Just as in actual clues, such as with fingerprints, you have to look pretty closely at what you’re seeing.

As an example of what I mean in our context, different languages have different code constructs. As such, the amount of lines used in a given module of code can differ solely based on those differences in the expressivity of the language. Those changes, however, will not necessarily have an impact on true complexity. Similarly, certain languages have code constructs like comprehensions, or other aspects of functional programming, that have very few lines — and thus very few lines of code and very little indentation — but can be extremely complex.

There is also a key difference in the complexity measures we looked at. Lines of code can be reduced or enlarged and that, generally, means a demonstrable impact. But whitespace for indentation can be changed — say from 2 to 4 spaces — and that really doesn’t mean anything at all.

Everything is a clue. But whether that clue is telling you something important is up for you to determine based on your analysis.

Exploring a Larger Repository

This time we’re going to work with a much larger code repository, specifically Pygame. Pygame is a polyglot repo in that it relies on SDL (written in C) and provides an abstraction layer on top of that (written in Python). Here I’ll also remind about the general strategy that I mentioned in the first post and that we’ve carried out in the subsequent posts:

  1. Gather the evolutionary data from a version control system.
  2. Augment that evolutionary data with complexity measures.
  3. Use the combined data to identify hotspots.

So let’s do some rapid movement through what we’ve already learned so we reinforce the general strategy of getting evidence from the crime scene as quickly as possible. First get the crime scene: git clone https://github.com/pygame/pygame.git

Make sure you cd pygame so you’re actually in the crime scene.

Gather Your Clues

Since the code base is still evolving, let’s establish a common starting point: git checkout git rev-list -n 1 --before="2022-01-01" main

At the time I write this post, our range is still large, essentially accounting for the majority of development in the project.

Now let’s get our evolutionary log: git log --all --numstat --date=short --pretty=format:'--%h--%ad--%aN' --no-renames --before=2022-01-01 --after=2021-01-01 -- . ":(exclude).github/" ":(exclude)buildconfig/" ":(exclude)docs/" ":(exclude)examples/" ":(exclude)test/" > evo.log

Here I’m getting the evolution for the duration of one year. I’m also excluding a lot of stuff that we don’t want in our analysis for now. Let’s get our analysis data persisted. java -jar ../maat.jar -l evo.log -c git2 -a revisions > evidence_change.csv

Let’s get our complexity measure of lines of code: cloc . --by-file --csv --quiet --exclude-dir .github,buildconfig,docs,examples,test --report-file=evidence_lines.csv

Notice I’m excluding the same bits for both evidence streams. Now let’s merge the evidence: python ../scripts/merge_evidence.py evidence_change.csv evidence_lines.csv

As the top-most bits of output, those lines of evidence show us:


module, revisions, code

setup.py, 55, 693

src_c\display.c, 33, 2155

src_c\rect.c, 26, 1718

src_py\camera.py, 18, 124

src_c\font.c, 18, 824

src_c\event.c, 18, 2002

src_c\transform.c, 18, 2254

src_c\music.c, 15, 456

src_c\mixer.c, 14, 1646

src_c\base.c, 14, 1953

src_c\rwobject.c, 13, 676

src_c\_pygame.h, 12, 216

src_c\include\_pygame.h, 12, 280

src_py\cursors.py, 12, 689

src_c\math.c, 11, 3524

src_c\_sdl2\video.c, 11, 18855

src_py\sysfont.py, 11, 298

src_c\surface.c, 11, 3274

I took those modules that had more than ten revisions. One thing immediately stands out, I think: there is more general effort in the C-based portion of the repo than in the Python-based portion. Given that SDL is ostensibly a separate library (not part of Pygame), that tells us something inJust to reinforce our indentation complexity work, let’s get the indentation complexity for the top modules of each code portion, saving each to a file: python ../scripts/git_complexity_trend.py --start 61e98d8 --end 489e92c --file src_c/display.c > display.c.data.csv

Check out that rise in complexity over the revisions:

Lines moving upward for the total indicate that this module has accumulated complexity over time. You can also see the standard deviation over time:

The standard deviation should ideally decrease. This means lines get more alike in terms of complexity and that would generally be considered a good thing.

Let’s do the same analysis for the Python file: python ../scripts/git_complexity_trend.py --start 61e98d8 --end 489e92c --file src_py/camera.py > camera.py.data.csv

Here is the total trend:

Here is the trend in the standard deviation:

Create Your Visualizations

When you’re analyzing complexity trends, the interesting thing isn’t the numbers themselves but the shape of the evolutionary curve. Shapes are incredibly important in testing and here I’ll remind readers of my thoughts around considering geometry or topology when it comes to conceptualizing testing.

Now let’s generate our visualizations: java -jar ../maat.jar -l evo.log -c git2 -a revisions > metric_data.csv

You can get a tree map view:

Let’s get our enclosure diagrams: python ../scripts/generate_evidence_json.py --structure evidence_lines.csv --weights evidence_change.csv > data.json

Here’s the Python code representation:

And here’s the C code representation:

Much of the above is what you have done in these posts already. So now let’s head down a side path to refine our notion of analysis and the types of clues we look for.

A Focus on Temporal Coupling

Spatial coupling is something you can see from the structure of the code. You can even see it from the negative space of the code. But there’s another type of coupling that I think is even more important. Temporal coupling is a type of dependency that you really can’t deduce just by looking at the code.

In our analyses to this point, we looked at modules that seemed to be changing together. Modules change together in temporal coupling. This is different from traditional coupling in that there may not be any explicit software or language dependencies between modules. There is instead a hidden, or implicit, dependency. This means a change in one module predictably results in a change in the coupled module.

Let’s dive right in and see how we can generate some clues around this concept. To get started, we’ll use what’s called “sum of coupling” analysis. The idea behind sum of coupling is that we look at how many times each module has been coupled to another one in a commit and then sum that measure up. java -jar ../maat.jar -l evo.log -c git2 -a soc

Here’s some output you’ll see from that:


entity, soc

src_c/event.c, 239

src_c/mixer.c, 219

src_c/display.c, 219

src_c/surface.c, 210

src_c/base.c, 210

src_c/mouse.c, 209

src_c/key.c, 208

src_c/joystick.c, 207

src_c/font.c, 207

src_c/constants.c, 204

What we see here, once again, is that the C portion of the code repository is more coupled than the Python portion. From the above, we can see that the event.c module changes the most with other modules, followed closely by mixer.c and our friend display.c.

Incidentally, on that list our other friend, camera.py, shows up as:


entity, soc

src_py/camera.py, 80

Okay, so we have some top contenders for modules with the most temporal coupling. The next step is to find out which modules they are coupled to. Let’s see if we can figure that out: java -jar ../maat.jar -l evo.log -c git2 -a coupling

What this will do is give you a list of modules (entities) including the module each is most coupled to.


entity, coupled, degree, avg-revs

src_py/midi.py, src_py/pkgdata.py, 90, 6

src_c/video.c, src_c/video_window.c, 84, 10

src_c/video_renderer.c, src_c/video_texture.c, 84, 10

src_c/video_texture.c, src_c/video_window.c, 73, 10

src_c/video_renderer.c, src_c/video_window.c, 66, 9

The degree specifies the percent of shared commits. The higher the number, the stronger the coupling. For example, video.c and video_window.c change together in 84% of all commits. Which probably makes sense. What might make less sense, at least at a glance, is why it’s the case that a change to midi.py seems to be coupled to changes in pkgdata.py — in 90% of all commits! The last column is used to provide a weighted number of total revisions for the modules being reported on. You would generally use this to filter out modules that haven’t been revised enough to be indicative of anything.

If you look at the full list of data, you should see this item in there as well:


entity, coupled, degree, avg-revs

src_py/_camera_opencv.py, src_py/camera.py, 56, 13

That’s our friend camera.py and we see what it’s tightly coupled to. What we don’t see on that list is our other friend, display.c. That suggests that it’s not coupled to anything enough to show up on our report.

Analyze the Coupling Clues

Just as in a crime scene, you often have to follow a trail of clues and certainly you have to look at more than one, the same applies here. It’s really important to look at these lists and parse them. The goal is to come to an understanding of what your data is telling you. By way of example, you can see some other interesting things if you aggregate the data:


entity, coupled, degree, avg-revs

src_c/_sdl2/video.c, src_c/_sdl2/video_window.c, 50, 10

src_c/_sdl2/video.c, src_c/_sdl2/video_renderer.c, 45, 11

src_c/_sdl2/video.c, src_c/_sdl2/video_texture.c, 52, 10

src_c/_sdl2/video.c, src_c/_sdl2/video_image.c, 52, 10

And:


entity, coupled, degree, avg-revs

src_c/_sdl2/video_image.c, src_c/_sdl2/video_renderer.c, 73, 10

src_c/_sdl2/video_image.c, src_c/_sdl2/video_window.c, 70, 9

src_c/_sdl2/video_image.c, src_c/_sdl2/video_texture.c, 87, 8

And:


entity, coupled, degree, avg-revs

src_c/_sdl2/video_renderer.c, src_c/_sdl2/video_texture.c, 84, 10

src_c/_sdl2/video_renderer.c, src_c/_sdl2/video_window.c, 70, 10

And:


entity, coupled, degree, avg-revs

src_c/video_renderer.c, src_c/video_texture.c, 84, 10

src_c/video_renderer.c, src_c/video_window.c, 66, 9

This is giving a very good view of how the video subsystem evolves as part of the code base. Given the apparent names of those modules all of that might make a lot of sense. But what about these?


entity, coupled, degree, avg-revs

src_c/base.c, src_c/mixer.c, 45, 11

src_c/event.c, src_c/mixer.c, 38, 13

src_c/font.c, src_c/mixer.c, 38, 13

The percentages aren’t staggeringly high but, just from the names, it’s unclear why the coupling exists. Why would changing the font handling require me to change the mixer, which handles the playback of sounds?

In any code base, certain things changing together make a lot of sense. Those are likely the places where there is a direct dependency and thus the coupling is explicit. But you can also come across code modules that do not have an explicit dependency between them and, yet, they seem to be changing together more often than you might expect. The latter is the most important part to ferret out in the kind of analysis I just showed you just because it’s something that wouldn’t be obvious just by looking at the code itself.

Further Analysis …

At this point, things get tricky; at least from where we’re at in these posts. What you would likely want to do now is dig in and see why the temporal coupling that the clues are showing you exists in the first place.

Probably what you would do is take the individual files in question and look for the shared commits, performing a diff analysis on the modules to see exactly what’s changing together. In some ways, this would be leveraging the techniques I’ve already shown you in terms of the scripts to run and how to run them.

Doing that, as you can imagine, is best illustrated by working with a repository you know fairly well and digging into the details. What I found is that it would be hard to show that level of analysis in these posts in a way that would be engaging of illustrative. So instead I’m going to try to show of that in the next post but at a higher level.

So what next?

What I want you to get from where we’re at now in these posts is the following:

  1. Start your analysis from the hotspots in the system.
  2. From there, identify other modules that have a temporal coupling to the hotspots.

In the next post in this series, we’ll dive into a much more nuanced and focused way of looking at a particular code repository. We’ll leverage the ideas that I’ve shown in this series so far but we’ll use a different style of analysis to better understand our code and what quality problems it may be hiding.

Share

This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.