AI – Stories from a Software Tester

AI

AI Was Already Here: Loud Opinions vs. Precise Understanding

written by Jeff Nyman

Previously I had talked about the idea of personal marketability when it came to learning AI. That was in the context of an AI and Testing series, so I trust what was being learned was obvious. I’m seeing an interesting trend developing with ardent opponents of any and all AI, particularly on social media platforms like LinkedIn. This is my attempt at a social response to that.

Continue reading AI Was Already Here: Loud Opinions vs. Precise Understanding →

AI and TestingDSPy

DSPy and RAG: Grounding Answers in Documents

written by Jeff Nyman

In the previous post we looked at the idea of building up a pipeline with DSPy. In this post, we’ll put that idea into more action with something we looked at in my AI and Testing series: RAG.

Continue reading DSPy and RAG: Grounding Answers in Documents →

DSPy Pipelines: Wiring Steps Without Writing Prompts

written by Jeff Nyman

In the first post, we got everything set up to start exploring DSPy. Here we’ll continue that journey by looking at the idea of pipelines.

Continue reading DSPy Pipelines: Wiring Steps Without Writing Prompts →

DSPy: Declaring Instead of Prompting

written by Jeff Nyman

In my AI and Testing series, which ran for a couple of months, I focused heavily on the testing side of things. I now want to consider AI in some specific contexts that you are likely to come across and show how those contexts work. The first that I’ll focus on is a tool called DSPy.

Continue reading DSPy: Declaring Instead of Prompting →

AI and Testing: Evaluation Synthesis

written by Jeff Nyman

This series has now covered eight metrics across two evaluation paradigms. We’ve applied them to a warp drive paper, a mass extinction paper, a philosophical essay on colliders and cosmology, and an essay on time travel and metaphysics. Each post introduced a metric, showed what it catches, and explained what its scores mean in context. This post does something different: it stops introducing new metrics and instead asks what the full set reveals when used together.

Continue reading AI and Testing: Evaluation Synthesis →

AI and Testing: Evaluating Conversations

written by Jeff Nyman

In the previous posts in the DeepEval series, we built up a diagnostic framework for evaluating RAG systems, covering Faithfulness, Contextual Precision, Contextual Recall, Contextual Relevancy, and G-Eval. All of those metrics operate on single turns: one question, one retrieval context, one response, one score. In this post we’ll move into different territory: conversational evaluation.

Continue reading AI and Testing: Evaluating Conversations →

AI and Testing: Recall, Relevancy, and Richer Evaluation

written by Jeff Nyman

In the previous posts we looked at the Faithfulness and Contextual Precision metrics with DeepEval, and started building an intuition for how retrieval failures cascade into generation failures. Those two metrics told us what went wrong and where in the pipeline. In this post, we’ll add three more tools to the diagnostic kit: Contextual Recall, Contextual Relevancy, and G-Eval.

Continue reading AI and Testing: Recall, Relevancy, and Richer Evaluation →

The Last Useful Animal

written by Jeff Nyman

In my previous posts, I’ve been talking a lot about AI technology and tooling, and any enthusiasm within those posts came from helping people see how and where to test AI, and keep it explainable, interpretable, testable and, thus, ultimately trustable. All that content being said, I have serious concerns about what AI is potentially doing to us: not to our test pipelines, but to us as a civilization. Fair warning: this is a bit of a Thomist indictment of techno-oligarchy.

Continue reading The Last Useful Animal →

Testing the “Yes-Man” in Your Pocket

written by Jeff Nyman

If you’ve been following my recent posts on how to test AI, you know that evaluating Large Language Models (LLMs) requires an entirely different mindset than traditional software testing. We’re no longer just testing for crashes, latency, or even factual hallucinations. As AI becomes deeply integrated into our daily lives, we have to start testing for psychological and behavioral impacts.

Continue reading Testing the “Yes-Man” in Your Pocket →

AI and Testing: From Specification to Story

written by Jeff Nyman

In the previous post, we built a formal ontology from the Z-Machine specification, used it to drive code generation, and then turned it into a test oracle. At the end of that work, I hinted that the next step might be to put an LLM directly in the loop and watch it actually play one of those games. That’s what this post does.

Continue reading AI and Testing: From Specification to Story →

AI and Testing: From Ontology to Implementation

written by Jeff Nyman

In the previous post, we looked at setting up an ontology based on a Z-Machine specification. Our goal was to get this in place so that we could have an LLM generate code to implement the portion of the ontology that we described. In this post, we’ll attempt exactly that.

Continue reading AI and Testing: From Ontology to Implementation →

AI and Testing: From Specification to Ontology

written by Jeff Nyman

In the previous posts we looked at setting up a graph pipeline and auditing that pipeline. All of this was based on an ontology, but one that was minimally constructed. Let’s dig more into ontologies here, specifically in relation to an actual specification for some actual software.

Continue reading AI and Testing: From Specification to Ontology →

AI and Testing: Auditing a Knowledge Graph Pipeline

written by Jeff Nyman

In the previous post we looked at the code for an entire pipeline that uses a lightweight ontology to guide extraction and construct a queryable knowledge graph from unstructured text. Here, we’ll look at auditing what this pipeline is doing.

Continue reading AI and Testing: Auditing a Knowledge Graph Pipeline →

AI and Testing: A Knowledge Graph Pipeline in Practice

written by Jeff Nyman

In the previous post we talked about the conceptual basis of knowledge graphs and ontologies and pointed toward the code we’ll be using. In this post, we’re going to dive into that code and put the concepts into action.

Continue reading AI and Testing: A Knowledge Graph Pipeline in Practice →

AI and Testing: Knowledge Graphs and Ontologies

written by Jeff Nyman

If you’ve been following this series, you’ve seen how local LLMs can be used for everything from basic inference to evaluation frameworks. This post takes a different angle. Rather than asking what a model knows, we’re going to ask how we can take what a model reads and turn it into structured, queryable knowledge.

Continue reading AI and Testing: Knowledge Graphs and Ontologies →

AI and Testing: Using Model Pipelines for Testing

written by Jeff Nyman

In the previous post, we looked at a simple web app and looked to see whether a model could generate test cases from the app, analyze the code of that app, and generate automation based on those test cases. Here we’ll refine that process a bit by considering a source of truth and considering different models working together to create a pipeline. We’ll even sneak DeepEval back in.

Continue reading AI and Testing: Using Model Pipelines for Testing →

AI and Testing: Using Local Models for Testing

written by Jeff Nyman

Writing comprehensive test cases means understanding every component, state transition, and edge case in your application. Can an AI model look at a web application and figure out what needs testing? Well, let’s find out. We’ll give a local AI model the HTML for a bomb defusal simulator, ask it to analyze the code, and see if it can generate meaningful test cases, then convert those into working Playwright scripts.

Continue reading AI and Testing: Using Local Models for Testing →

AI and Testing: Improving Retrieval Quality, Part 4

written by Jeff Nyman

We did a lot of testing to determine retrieval quality issues in parts one, two, and three. Here I’m going to close off this particular thread by considering a particular test variation that we have neglected up to this point.

Continue reading AI and Testing: Improving Retrieval Quality, Part 4 →

AI and Testing: Improving Retrieval Quality, Part 3

written by Jeff Nyman

In the previous post we ran four experiments attempting to improve our RAG system’s retrieval quality through parameter tuning: smaller chunks, more retrieval, both combined, and semantic chunking. Every experiment either maintained the baseline failure or made it worse. Let’s continue investigating!

Continue reading AI and Testing: Improving Retrieval Quality, Part 3 →

AI and Testing: Improving Retrieval Quality, Part 2

written by Jeff Nyman

In the previous post we set up a test experiment around DeepEval and used DeepEval’s evaluation function to establish a quality baseline. That post ended with the need for experiments to confirm against that baseline, and that’s what we’ll do in this post.

Continue reading AI and Testing: Improving Retrieval Quality, Part 2 →