In the previous posts in the DeepEval series, we built up a diagnostic framework for evaluating RAG systems, covering Faithfulness, Contextual Precision, Contextual Recall, Contextual Relevancy, and G-Eval. All of those metrics operate on single turns: one question, one retrieval context, one response, one score. In this post we’ll move into different territory: conversational evaluation.
AI and Testing: Recall, Relevancy, and Richer Evaluation
In the previous posts we looked at the Faithfulness and Contextual Precision metrics with DeepEval, and started building an intuition for how retrieval failures cascade into generation failures. Those two metrics told us what went wrong and where in the pipeline. In this post, we’ll add three more tools to the diagnostic kit: Contextual Recall, Contextual Relevancy, and G-Eval.
Continue reading AI and Testing: Recall, Relevancy, and Richer Evaluation
The Last Useful Animal
In my previous posts, I’ve been talking a lot about AI technology and tooling, and any enthusiasm within those posts came from helping people see how and where to test AI, and keep it explainable, interpretable, testable and, thus, ultimately trustable. All that content being said, I have serious concerns about what AI is potentially doing to us: not to our test pipelines, but to us as a civilization. Fair warning: this is a bit of a Thomist indictment of techno-oligarchy.
Testing the “Yes-Man” in Your Pocket
If you’ve been following my recent posts on how to test AI, you know that evaluating Large Language Models (LLMs) requires an entirely different mindset than traditional software testing. We’re no longer just testing for crashes, latency, or even factual hallucinations. As AI becomes deeply integrated into our daily lives, we have to start testing for psychological and behavioral impacts.
AI and Testing: From Specification to Story
In the previous post, we built a formal ontology from the Z-Machine specification, used it to drive code generation, and then turned it into a test oracle. At the end of that work, I hinted that the next step might be to put an LLM directly in the loop and watch it actually play one of those games. That’s what this post does.
Continue reading AI and Testing: From Specification to Story
AI and Testing: From Ontology to Implementation
In the previous post, we looked at setting up an ontology based on a Z-Machine specification. Our goal was to get this in place so that we could have an LLM generate code to implement the portion of the ontology that we described. In this post, we’ll attempt exactly that.
Continue reading AI and Testing: From Ontology to Implementation
AI and Testing: From Specification to Ontology
In the previous posts we looked at setting up a graph pipeline and auditing that pipeline. All of this was based on an ontology, but one that was minimally constructed. Let’s dig more into ontologies here, specifically in relation to an actual specification for some actual software.
Continue reading AI and Testing: From Specification to Ontology
AI and Testing: Auditing a Knowledge Graph Pipeline
In the previous post we looked at the code for an entire pipeline that uses a lightweight ontology to guide extraction and construct a queryable knowledge graph from unstructured text. Here, we’ll look at auditing what this pipeline is doing.
Continue reading AI and Testing: Auditing a Knowledge Graph Pipeline
AI and Testing: A Knowledge Graph Pipeline in Practice
In the previous post we talked about the conceptual basis of knowledge graphs and ontologies and pointed toward the code we’ll be using. In this post, we’re going to dive into that code and put the concepts into action.
Continue reading AI and Testing: A Knowledge Graph Pipeline in Practice
AI and Testing: Knowledge Graphs and Ontologies
If you’ve been following this series, you’ve seen how local LLMs can be used for everything from basic inference to evaluation frameworks. This post takes a different angle. Rather than asking what a model knows, we’re going to ask how we can take what a model reads and turn it into structured, queryable knowledge.
Continue reading AI and Testing: Knowledge Graphs and Ontologies
AI and Testing: Using Model Pipelines for Testing
In the previous post, we looked at a simple web app and looked to see whether a model could generate test cases from the app, analyze the code of that app, and generate automation based on those test cases. Here we’ll refine that process a bit by considering a source of truth and considering different models working together to create a pipeline. We’ll even sneak DeepEval back in.
Continue reading AI and Testing: Using Model Pipelines for Testing
AI and Testing: Using Local Models for Testing
Writing comprehensive test cases means understanding every component, state transition, and edge case in your application. Can an AI model look at a web application and figure out what needs testing? Well, let’s find out. We’ll give a local AI model the HTML for a bomb defusal simulator, ask it to analyze the code, and see if it can generate meaningful test cases, then convert those into working Playwright scripts.
Continue reading AI and Testing: Using Local Models for Testing
AI and Testing: Improving Retrieval Quality, Part 4
We did a lot of testing to determine retrieval quality issues in parts one, two, and three. Here I’m going to close off this particular thread by considering a particular test variation that we have neglected up to this point.
Continue reading AI and Testing: Improving Retrieval Quality, Part 4
AI and Testing: Improving Retrieval Quality, Part 3
In the previous post we ran four experiments attempting to improve our RAG system’s retrieval quality through parameter tuning: smaller chunks, more retrieval, both combined, and semantic chunking. Every experiment either maintained the baseline failure or made it worse. Let’s continue investigating!
Continue reading AI and Testing: Improving Retrieval Quality, Part 3
AI and Testing: Improving Retrieval Quality, Part 2
In the previous post we set up a test experiment around DeepEval and used DeepEval’s evaluation function to establish a quality baseline. That post ended with the need for experiments to confirm against that baseline, and that’s what we’ll do in this post.
Continue reading AI and Testing: Improving Retrieval Quality, Part 2
AI and Testing: Improving Retrieval Quality, Part 1
In the previous post on Contextual Precision, we diagnosed a critical problem in our RAG system: poor retrieval quality was causing failures that we also observed in the Faithfulness post. In this first of three related posts, we’re going to dig in a bit. This will be our first extended example of what testing a generative AI really looks like.
Continue reading AI and Testing: Improving Retrieval Quality, Part 1
AI and Testing: Contextual Precision
In the previous post we looked at the Faithfulness metric with DeepEval and got some intuitions in place about how to start thinking about using metrics in general. In this post, we’ll look at a third metric.
AI and Testing: Faithfulness
In the previous post we looked at the Answer Relevancy metric with DeepEval and got some intuitions in place about how to start thinking about using metrics in general. In this post, we’ll look at a second metric that requires no faith but is all about being faithful.
AI and Testing: Answer Relevancy
In the previous post we got set up with DeepEval. Here we’re going to put that tool to use by looking at our first test case and our first quality metric.
AI and Testing: Evaluation and DeepEval
In previous posts in this series, I’ve largely been talking about how to use local LLMs by writing scripts and, along the way, I’ve been able to shoehorn in some testing ideas. We even wrote a bespoke test script together. In this post, I’m going to focus more specifically on testing by considering the idea of evaluation.