As our technocracy continues to grow and as (at least some) technologists continue to push us toward a potentially dehumanized and dehumanizing future, I want to focus on how we can work from within this technocracy to make sure that human experimentation is front and center.

Moving Beyond the Model: The Art of AI Evaluation
When we talk about “Testing AI,” most people immediately think of the Large Language Model (LLM) itself. They focus on the engine while ignoring the vehicle. However, in the context of custom enterprise applications, we aren’t just testing a model; we’re testing a system.
This is important! More businesses are relying on AI to be infused into their applications. But they (usually) aren’t just building a simple architecture where users are “chatting” with a bot. They’re building complex architectures where the AI is one component in a larger machine. Specifically, enterprises are leveraging AI through what have come to be very defined patterns.
- Retrieval-Augmented Generation (RAG): Think of this as the AI’s “Library.” The model doesn’t just guess based on its training; it’s forced to look up your specific company documents (the source of truth) before it speaks.
- Agentic Workflows: This is the AI as a “Project Manager.” It’s given a goal and allowed to use tools (like querying a SQL database, searching the web, or calling an API) to complete a multi-step task.
- Structured Output Pipelines: Here, the AI acts as a “Data Translator,” taking messy, unstructured human text and turning it into clean, actionable code or JSON that other parts of the business software can understand.
In traditional software testing, we tend to look for determinism: “If I click X, Y must happen.” In AI testing, we deal with stochasticity (probabilistic outcomes). Because an AI might describe the same truth in three different ways, our tests can’t just look for an exact string of text. We have to test the intent, the accuracy, and the logic.
This is where the concept of evaluators comes in. If the LLM is the witness giving testimony, the evaluator is the forensic expert checking that testimony against the physical evidence. We aren’t testing if the AI is “smart”; we’re testing if the system is faithful to the data and relevant to the user’s need.
The Distinction: Potential vs. Actuality
It’s helpful to distinguish between the model’s general capability and its specific application.
- The Model (Potential): A raw LLM is a vast reservoir of statistical probabilities. Testing it in isolation tells you what it can do in a vacuum.
- The Application (Actuality): Your enterprise application is the “implementation layer.” It includes your prompts, your Retrieval-Augmented Generation (RAG) pipelines, and your business logic.
If I had to go with an analogy here, I would say that testing a raw LLM is like testing the quality of a chef’s knife. Testing an AI application is like tasting the final Beef Bourguignon. A sharp knife is necessary (or at least really helpful!), but it doesn’t guarantee the stew isn’t too salty.
The Wider Ambit of Evaluation
Evaluating these AI-infused applications requires us to look at the “Three Pillars” of output quality for this context.
- Answer Relevance: Does the response actually address the user’s intent, or is it a “hallucination” of a correct-sounding but irrelevant answer?
- Faithfulness (Groundedness): Is the answer derived strictly from the provided context (the “source of truth”), or is the model drawing on external, potentially outdated training data?
- Contextual Precision: How well did the system retrieve the right information to begin with?
The “LLM-as-a-Judge”
To handle this at scale, we can use specialized tooling like DeepEval or RAGAS. These tools employ a meta-cognitive approach often called LLM-as-a-Judge. I’m not sure that’s a great name, actually. I prefer to think of this as a forensic audit of a conversation. By this I mean, we use a secondary, often more powerful model to act as an objective evaluator. It looks at the “evidence” (the retrieved context) and the “verdict” (the generated answer) to determine if the logic holds up. This allows us to turn qualitative human intuition into quantitative, reproducible metrics like contextual recall and answer correctness.
Setting the Stage
There’s a lot more I could say here, however what I want to do in this series of posts is show rather than just tell. What I can say is that we’re moving from a world of “vibe-based testing” (where we just ask the AI a few questions and see if we like the answers) to a world of rigorous evaluators. While you may not be testing the LLMs directly, you are testing the stewardship of a given model’s power within your specific domain.
This, I would argue, is a large part of ethical mandate of test specialists who are working with companies that are choosing to leverage AI as part of their operations, whether those are customer-facing or not.
I happen to be a Thomist in orientation and, in Thomistic terms, the LLM is the material cause (the raw potential), but your application is the formal cause that gives it specific shape and purpose. Testing just the LLM is like studying the properties of marble when you should be evaluating the structural integrity of the cathedral built from it.
So, to set the stage for what’s coming, What I want to do in this series of posts is introduce the concepts and tooling that help us look at the cathedral. This will require readers digging in a bit. I will be using Python primarily, if not exclusively, and I will be showing testers how to actually use these tools in light of the broader concepts.
I should note that I find this is an area entirely lacking in the modern testing space. I see a lot of vocal worried about AI (perhaps justified) but I see very few of them actually learning to work with these tools in a way that gives them marketability in a future that is already here.
Previously I had talked about the ethical mandate around a mistake specialty. As we all know, AI can make mistakes. In fact, Claude tells you that right at the bottom of any prompt you care to make.

So does ChatGPT.

And that’s wonderful! Self-awareness is a great thing. But that leaves a question: how do we “double-check the responses” and “check the important info” at scale? That is a great bit of land for testers who are keeping ahead of the curve to stake out.
Thus, our ethical mandate, far from being removed or being replaced, has actually expanded. The landscape of our mistake specialist focus has increased. Along with that mandate and that focus, this series will be an adjunct to my existing “Testing AI” posts.
If this all sounds interesting, join me in this new series where I will cover quite a bit of the landscape, from Ollama to LangChain to DeepEval, with a specific emphasis on how testers can stake out their own territory in the rapidly evolving world of AI adoption.