In two earlier posts I traced the path from Aristotelian to Galilean thinking as a way of understanding how testing developed as a discipline: how competing models of quality, and the slow maturation of experimental method, gave rise to something we might actually recognize as testing today. This post sits in that same current of thought, but takes a step back to ask a prior question: what is it that makes any of that development so difficult in the first place?
Author: Jeff Nyman
AI Was Already Here: Loud Opinions vs. Precise Understanding
Previously I had talked about the idea of personal marketability when it came to learning AI. That was in the context of an AI and Testing series, so I trust what was being learned was obvious. I’m seeing an interesting trend developing with ardent opponents of any and all AI, particularly on social media platforms like LinkedIn. This is my attempt at a social response to that.
Continue reading AI Was Already Here: Loud Opinions vs. Precise Understanding
DSPy and RAG: Grounding Answers in Documents
In the previous post we looked at the idea of building up a pipeline with DSPy. In this post, we’ll put that idea into more action with something we looked at in my AI and Testing series: RAG.
Continue reading DSPy and RAG: Grounding Answers in Documents
DSPy Pipelines: Wiring Steps Without Writing Prompts
In the first post, we got everything set up to start exploring DSPy. Here we’ll continue that journey by looking at the idea of pipelines.
Continue reading DSPy Pipelines: Wiring Steps Without Writing Prompts
DSPy: Declaring Instead of Prompting
In my AI and Testing series, which ran for a couple of months, I focused heavily on the testing side of things. I now want to consider AI in some specific contexts that you are likely to come across and show how those contexts work. The first that I’ll focus on is a tool called DSPy.
AI and Testing: Evaluation Synthesis
This series has now covered eight metrics across two evaluation paradigms. We’ve applied them to a warp drive paper, a mass extinction paper, a philosophical essay on colliders and cosmology, and an essay on time travel and metaphysics. Each post introduced a metric, showed what it catches, and explained what its scores mean in context. This post does something different: it stops introducing new metrics and instead asks what the full set reveals when used together.
AI and Testing: Evaluating Conversations
In the previous posts in the DeepEval series, we built up a diagnostic framework for evaluating RAG systems, covering Faithfulness, Contextual Precision, Contextual Recall, Contextual Relevancy, and G-Eval. All of those metrics operate on single turns: one question, one retrieval context, one response, one score. In this post we’ll move into different territory: conversational evaluation.
AI and Testing: Recall, Relevancy, and Richer Evaluation
In the previous posts we looked at the Faithfulness and Contextual Precision metrics with DeepEval, and started building an intuition for how retrieval failures cascade into generation failures. Those two metrics told us what went wrong and where in the pipeline. In this post, we’ll add three more tools to the diagnostic kit: Contextual Recall, Contextual Relevancy, and G-Eval.
Continue reading AI and Testing: Recall, Relevancy, and Richer Evaluation
The Last Useful Animal
In my previous posts, I’ve been talking a lot about AI technology and tooling, and any enthusiasm within those posts came from helping people see how and where to test AI, and keep it explainable, interpretable, testable and, thus, ultimately trustable. All that content being said, I have serious concerns about what AI is potentially doing to us: not to our test pipelines, but to us as a civilization. Fair warning: this is a bit of a Thomist indictment of techno-oligarchy.
Testing the “Yes-Man” in Your Pocket
If you’ve been following my recent posts on how to test AI, you know that evaluating Large Language Models (LLMs) requires an entirely different mindset than traditional software testing. We’re no longer just testing for crashes, latency, or even factual hallucinations. As AI becomes deeply integrated into our daily lives, we have to start testing for psychological and behavioral impacts.
AI and Testing: From Specification to Story
In the previous post, we built a formal ontology from the Z-Machine specification, used it to drive code generation, and then turned it into a test oracle. At the end of that work, I hinted that the next step might be to put an LLM directly in the loop and watch it actually play one of those games. That’s what this post does.
Continue reading AI and Testing: From Specification to Story
AI and Testing: From Ontology to Implementation
In the previous post, we looked at setting up an ontology based on a Z-Machine specification. Our goal was to get this in place so that we could have an LLM generate code to implement the portion of the ontology that we described. In this post, we’ll attempt exactly that.
Continue reading AI and Testing: From Ontology to Implementation
AI and Testing: From Specification to Ontology
In the previous posts we looked at setting up a graph pipeline and auditing that pipeline. All of this was based on an ontology, but one that was minimally constructed. Let’s dig more into ontologies here, specifically in relation to an actual specification for some actual software.
Continue reading AI and Testing: From Specification to Ontology
AI and Testing: Auditing a Knowledge Graph Pipeline
In the previous post we looked at the code for an entire pipeline that uses a lightweight ontology to guide extraction and construct a queryable knowledge graph from unstructured text. Here, we’ll look at auditing what this pipeline is doing.
Continue reading AI and Testing: Auditing a Knowledge Graph Pipeline
AI and Testing: A Knowledge Graph Pipeline in Practice
In the previous post we talked about the conceptual basis of knowledge graphs and ontologies and pointed toward the code we’ll be using. In this post, we’re going to dive into that code and put the concepts into action.
Continue reading AI and Testing: A Knowledge Graph Pipeline in Practice
AI and Testing: Knowledge Graphs and Ontologies
If you’ve been following this series, you’ve seen how local LLMs can be used for everything from basic inference to evaluation frameworks. This post takes a different angle. Rather than asking what a model knows, we’re going to ask how we can take what a model reads and turn it into structured, queryable knowledge.
Continue reading AI and Testing: Knowledge Graphs and Ontologies
AI and Testing: Using Model Pipelines for Testing
In the previous post, we looked at a simple web app and looked to see whether a model could generate test cases from the app, analyze the code of that app, and generate automation based on those test cases. Here we’ll refine that process a bit by considering a source of truth and considering different models working together to create a pipeline. We’ll even sneak DeepEval back in.
Continue reading AI and Testing: Using Model Pipelines for Testing
AI and Testing: Using Local Models for Testing
Writing comprehensive test cases means understanding every component, state transition, and edge case in your application. Can an AI model look at a web application and figure out what needs testing? Well, let’s find out. We’ll give a local AI model the HTML for a bomb defusal simulator, ask it to analyze the code, and see if it can generate meaningful test cases, then convert those into working Playwright scripts.
Continue reading AI and Testing: Using Local Models for Testing
AI and Testing: Improving Retrieval Quality, Part 4
We did a lot of testing to determine retrieval quality issues in parts one, two, and three. Here I’m going to close off this particular thread by considering a particular test variation that we have neglected up to this point.
Continue reading AI and Testing: Improving Retrieval Quality, Part 4
AI and Testing: Improving Retrieval Quality, Part 3
In the previous post we ran four experiments attempting to improve our RAG system’s retrieval quality through parameter tuning: smaller chunks, more retrieval, both combined, and semantic chunking. Every experiment either maintained the baseline failure or made it worse. Let’s continue investigating!
Continue reading AI and Testing: Improving Retrieval Quality, Part 3