AI and Testing: Improving Retrieval Quality, Part 2

In the previous post we set up a test experiment around DeepEval and used DeepEval’s evaluation function to establish a quality baseline. That post ended with the need for experiments to confirm against that baseline, and that’s what we’ll do in this post.

Continue reading AI and Testing: Improving Retrieval Quality, Part 2

AI and Testing: Improving Retrieval Quality, Part 1

In the previous post on Contextual Precision, we diagnosed a critical problem in our RAG system: poor retrieval quality was causing failures that we also observed in the Faithfulness post. In this first of three related posts, we’re going to dig in a bit. This will be our first extended example of what testing a generative AI really looks like.

Continue reading AI and Testing: Improving Retrieval Quality, Part 1

AI and Testing: Evaluation and DeepEval

In previous posts in this series, I’ve largely been talking about how to use local LLMs by writing scripts and, along the way, I’ve been able to shoehorn in some testing ideas. We even wrote a bespoke test script together. In this post, I’m going to focus more specifically on testing by considering the idea of evaluation.

Continue reading AI and Testing: Evaluation and DeepEval

AI and Testing: Personal Marketability

In the posts in this series, I’ve been taking you through a lot of concepts and tooling. That’s going to continue but, for this post, it felt prudent to take a little break and talk about why doing all this can matter. That gets into interviewing and potentially being hired.

Continue reading AI and Testing: Personal Marketability

AI and Testing: Evaluating the Future

As our technocracy continues to grow and as (at least some) technologists continue to push us toward a potentially dehumanized and dehumanizing future, I want to focus on how we can work from within this technocracy to make sure that human experimentation is front and center.

Continue reading AI and Testing: Evaluating the Future

Testing for Quality, Betting on Value

There’s an irony worth noting with my previous posts on Hollywood quality and gaming quality: testing exists, in part, to mitigate risk but only by helping people understand the risks that exist. Yet, quality itself often requires reasoned and reasonable risk-taking! Let’s dig in to this.

Continue reading Testing for Quality, Betting on Value

The Sunk Cost of Quality: Lessons from Gaming’s Biggest Failures

In the first part of this series, I examined how Hollywood’s financial model (commit hundreds of millions upfront, spend it all before release, then discover whether audiences agree with your projections) creates a high-stakes gamble on predicting quality. Studios bet enormous sums on forecasting how diverse audiences will perceive value years in the future, often with not-so-great results. The gaming industry faces a strikingly similar challenge, but with crucial differences that make the quality prediction problem even more complex.

Continue reading The Sunk Cost of Quality: Lessons from Gaming’s Biggest Failures

The Sunk Cost of Quality: Lessons from Hollywood’s Biggest Failures

When we talk about quality assurance in software, we often treat quality as something measurable, testable, and, if we’re honest, somewhat objective. Does the code work? Does it meet requirements? Does it perform under load? However, quality isn’t entirely objective. It’s a shifting perception of value over time, influenced by customer expectations, cultural context, and changing needs. To understand why this matters, let’s step outside software for a moment and look at an industry that bets hundreds of millions of dollars on predicting quality: Hollywood.

Continue reading The Sunk Cost of Quality: Lessons from Hollywood’s Biggest Failures