AI and Testing: Evaluation and DeepEval

In previous posts in this series, I’ve largely been talking about how to use local LLMs by writing scripts and, along the way, I’ve been able to shoehorn in some testing ideas. We even wrote a bespoke test script together. In this post, I’m going to focus more specifically on testing by considering the idea of evaluation.

AI Evaluation vs. Traditional Testing

When we talk about “traditional testing” in software, we usually mean deterministic checks: does this function return the expected output for a given input? Did the API call succeed? Is the data structure valid? Pass or fail.

AI systems generally don’t work that way. The same prompt can produce different responses across runs. There’s no single “correct” answer to “explain quantum entanglement.” To be sure, there are better and worse explanations. There are more or less accurate explanations There are clearer or more confusing explanations. But there’s no single “correct” response. So, instead of binary pass/fail tests, we need evaluation: scoring outputs along multiple dimensions like relevance, accuracy, coherence, or helpfulness.

This is where frameworks like DeepEval and RAGAS come in. They use an “LLM-as-a-Judge” approach, essentially asking a language model to assess whether another model’s output meets certain quality criteria. Think of it as peer review where one AI examines another’s work against specific rubrics: Is this answer relevant to the question? Does it hallucinate facts? Is the tone appropriate?

These frameworks are particularly valuable for RAG-based applications (where models retrieve external information before answering) and AI agents (where models take actions based on reasoning). In both cases, you need to know not just whether the system ran, but whether it produced outputs that were actually useful, accurate, and aligned with your intentions.

The Judge Paradox: “Who Tests the Tester?”

A reasonable objection comes in: if we’re using an LLM to evaluate another LLM’s outputs, haven’t we just kicked the problem up one level? Now we need to trust the judge model, which means we would presumably need yet another judge to evaluate that one, and so on forever.

Yes, this feels like a paradox, but it’s actually closer to something we already accept in other fields: reference standards.

Think about how we measure things in the physical world. To verify a scale’s accuracy, what do you use? Well, you use calibrated weights. Okay, but, how do you know those weights are accurate? You compare them against higher-precision standards at a metrology lab. And how do you trust those standards? They’re compared against national standards. Eventually you hit a physical constant.

The kilogram, for example, used to be defined by a platinum-iridium cylinder in Paris; now it’s defined by Planck’s constant.

The point isn’t infinite regress. It’s establishing a baseline you’re willing to trust at some level. With LLM evaluation, here is how I would frame the practical answer.

Human spot-checking establishes ground truth. You don’t trust the judge blindly; you validate its judgments against human evaluation on a sample of cases. If the judge agrees with human raters, say, 85 to 90 percent of the time across diverse examples, you’ve calibrated it.
The judge doesn’t need to be perfect; it needs to be consistently useful. If the judge catches 80 percent of hallucinations, helps you identify weak retrieval, and flags tone problems reliably, it’s providing value, even if it occasionally misjudges edge cases.
You’re trading exhaustive human review for scalable automated screening. Testing ten outputs manually is feasible. Testing ten thousand generally isn’t. The judge lets you process volume, then you investigate failures or low-scoring cases yourself.
Multiple judges can cross-validate. You’re not stuck with one arbiter. Run the same outputs through DeepEval, then RAGAS, possibly using different models, and then human review on a subset. Where they agree, you have confidence. Where they diverge, you investigate.

The “infinite regress” problem is a philosophical concern, but in practice, one level of meta-evaluation (humans validating the judge on representative samples) is usually sufficient to establish some level of trust. After that, it’s just engineering: monitoring performance, adjusting thresholds, and iterating when the judge’s behavior drifts.

Think of it less like “proving the judge is perfect” and more like “calibrating an instrument.” You check it against known standards periodically, but you don’t re-verify every single measurement. If you think about it, this is largely in line with how we often consider “regression” in the context of testing.

What Is DeepEval?

I’m going to focus on DeepEval in the coming posts, so let’s get a handle on what that actually is. DeepEval is an open-source Python framework designed to evaluate LLM outputs using customizable metrics. Unlike what we might call “traditional testing frameworks” that check for exact matches or type correctness, DeepEval assesses quality in terms of whether responses are relevant, accurate, coherent, faithful to source material, and contextually appropriate.

DeepEval is particularly well-suited for testing RAG pipelines, where you need to verify not just that your system retrieved documents, but that it used them correctly and didn’t hallucinate facts. It also works for evaluating conversational agents, summarization tasks, and, really, any scenario where LLM or generative behavior needs systematic quality checks at scale.

For our purposes here, the nice thing is that DeepEval integrates with local models via Ollama, meaning you can run evaluations entirely on your own infrastructure without sending data to external APIs. That’s a key advantage not just for learning purposes but also for saving costs! In a work context, this also aids privacy-sensitive applications or when working with proprietary information.

In this post, I’ll walk through setting up DeepEval with a local model. This is all in service to the upcoming posts of showing you how to build evaluation test cases that actually tell you whether your AI system is working as intended.

Getting DeepEval

To start with DeepEval, there are a few steps to go through. We have to install DeepEval as a dependency. That’s easy enough:

  python -m pip install deepeval

Remember that, for Python, you generally want to be in your virtual environment. I long for the day when Python will have something equivalent to Node and NPM’s node_modules.

Now that we have it, how do we use it? This is where a lot of tutorials will tell you that you need to get an OpenAI API key and possibly buy some credits in order to use the public GPT servers. The reason these tutorials tell you this is because the default operating mode of DeepEval is to use GPT 4 for all of its evaluation metrics.

However, in keeping with the general trend of the posts in this series, we’re going to keep everything local and free.

Getting Our Models

For this next series of posts, I’ve provided a set of models we can use: a TesterStories Reasoner model and a TesterStories Evaluator model.

I should note that these models are not strictly necessary for the posts. You can use other models, including those you have already downloaded. I’ll cover specifics like that when it becomes relevant. That said, these models were tuned a bit to assist with the upcoming material.

If you want to get my models, just execute the following two commands:

  ollama pull jeffnyman/ts-reasoner
  ollama pull jeffnyman/ts-evaluator

Those should be downloaded to your machine, just as the other models we’ve used in this series were. (You can check by running ollama list.) Let’s make sure the models work.

  ollama run jeffnyman/ts-reasoner

Keep in mind all the resource usage points we talked about in previous posts apply here. How quickly things run depend on your local machine resources. My models don’t do anything better or worse than models you’ve been using up to this point.

Once you get the prompt, type in this:


Who are you?

You should probably get something along the lines of the model telling you that it is “TesterStories Execution.” You can exit the model with:


/bye

Let’s try the same for the other model:

  ollama run jeffnyman/ts-evaluator

And once again, have it self-identify:


Who are you?
/bye

This one will probably tell you that it is “TesterStories Judge.”

Model Breakdown

These are two specialized models that work together to demonstrate different aspects of AI testing with DeepEval.

The ts-reasoner is your thinking partner. Built on Gemma 3, it approaches problems step-by-step, making its reasoning visible and explicit. When you ask it a question, it’s designed to walk you through its thought process: identifying assumptions, considering edge cases, and building toward conclusions methodically. This transparency makes it decent for exploring how LLMs construct answers and where they might go wrong.

The ts-evaluator is your quality assessor. Based on Qwen 2.5, it’s designed to judge outputs rigorously and fairly. Give it something to evaluate (an answer, a test result, a piece of reasoning), and it breaks down strengths, weaknesses, and gaps systematically. It provides balanced feedback with clear justification, mimicking the kind of critical analysis a good code reviewer or test lead would apply.

Think of these models as two sides of the experimentalist testing coin: one creates, one critiques. Throughout some of these next posts, we’ll use ts-reasoner to generate responses, then ts-evaluator to assess them using DeepEval’s metrics. This separation lets us see how different models approach generation versus evaluation, and demonstrates why having specialized tools for different parts of your testing workflow matters.

Both models are configured with system prompts that emphasize evidence-based thinking, risk awareness, and clear communication. The idea is that you can use them together for comprehensive testing workflows, or independently for specific tasks.

I’m still tuning these models and that brings up a good point. The reality of software construction with AI is that new models are dropping all the time. The prompts that worked a certain way in a previous model may differ with the new. For these posts, (1) I will not deploy backward-breaking changes for reasoning and (2) since you’re running locally, you would only get model tunings if you re-downloaded the models, which you will not have to do.

Model Considerations

You might remember that back in Ollama and Models, I talked about how models are configured. For ts-reasoner, the basis of the model is:

  parameters: 4.3 billion
  quantization: Q4_K_M
  temperature: 1,
  top_k: 64,
  top_p: 0.95

This model signature is used because testing workflows need models that are practical to run repeatedly. When you’re iterating through test cases, generating multiple responses, or running evaluation suites, you want something lightweight that still produces quality outputs. These settings strike that balance: it’s small enough to run efficiently on typical hardware, but sophisticated enough to demonstrate genuine reasoning patterns. This mirrors real-world testing scenarios where you optimize for speed and repeatability without sacrificing utility.

For ts-evaluator, the basis of that model is:

  parameters: 7.62 billion
  quantization: Q4_K_M
  temperature: 0.7
  top_k: 0.8
  top_p: 20

Here, I needed more analytical horsepower. This model signature brings stronger reasoning capabilities to the evaluation task, which matters because judging outputs is cognitively harder than generating them. An evaluator needs to understand context deeply, identify subtle flaws, compare against quality standards, and provide nuanced feedback. Think of it like code review versus code writing: reviewing well requires at least as much expertise as the original work, often more. Using a more capable model for evaluation ensures the assessments are thorough and reliable, which is exactly what you want when measuring the quality of your testing workflows.

This asymmetry of lightweight generation with robust evaluation reflects a practical testing philosophy: make your execution fast and repeatable, but ensure your quality gates are strong enough to catch real issues.

For the purposes of these posts, I’ll ask you to imagine that the ts-reasoner is the model you’re company is going to use as a customer-facing integration into your platform. It’s effectively the model you are testing.

Change the DeepEval Default

Here’s where we get into some interesting logistics. I mentioned before that DeepEval counts on using GPT 4 for its default evaluator. When you are working locally, you want to change that. So, let’s walk through what it means to change the default evaluator.

If you’ve gone through previous posts in this series, you have various models installed, like Qwen 3 and Qwen 2.5. The next instructions assume you have one of those. That said, it doesn’t matter which model you use here for this next part.

Execute the following command (and replace your model name with whatever one you want; this is just for illustration):

  deepeval set-ollama --model qwen2.5:latest --save=dotenv:.env

You should see some output like this:


Congratulations! You're now using a local Ollama model qwen2.5:latest for all evals that require an LLM.

That command is setting the Qwen 2.5 model, running via Ollama, as your judge model. You’re telling DeepEval: “For any metrics that need an LLM to evaluate outputs, use this local Ollama model instead of the default public GPT models.”

Now, why am I having you do that if we just installed my specific models? This is mainly to show you what configuration is possible but I’ll revisit this concept in the next post when we write a test script. If you want DeepEval to go back to its default model settings at any time, you could do this:

  deepeval unset-ollama

DeepEval Configuration

That set-ollama command we just ran also creates and/or updates an .env file with the Ollama model configuration, which DeepEval will read when you run your scripts. Note that if you already had this file, with your LangSmith properties in there from the previous posts, that’s fine: they won’t be deleted. That said, you might not want to be conflating all your settings. If that’s the case for you, you could set things up with a different file. For example:

  deepeval set-ollama --model qwen2.5:latest --save=dotenv:.env.local

You can technically use whatever filename you want, but do note that DeepEval autoloads an .env.local, if present. If that file is not present, it attempts to autoload an existing .env file.

What will also get created is a .deepeval directory and, within that, a .deepeval file. DeepEval does not add one key to your .env file, which you will need to add:


LOCAL_MODEL_NAME=qwen2.5:latest

If you used a different model, you would just make sure to use the model name of whatever you used.

There is yet another problem you may run into, so let’s head it off. In that .deepveval file, you must have this:


"LOCAL_MODEL_API_KEY": "ollama"

You can just plop it right down at the end of all the other keys. My file looks like this, by way of example:


{"USE_ANTHROPIC_MODEL": "NO", "USE_AWS_BEDROCK_MODEL": "NO", "USE_AZURE_OPENAI": "NO", "USE_DEEPSEEK_MODEL": "NO", "USE_GEMINI_MODEL": "NO", "USE_GROK_MODEL": "NO", "USE_LITELLM": "NO", "USE_LOCAL_MODEL": "YES", "USE_MOONSHOT_MODEL": "NO", "USE_OPENAI_MODEL": "NO", "USE_PORTKEY_MODEL": "NO", "USE_OPENROUTER_MODEL": "NO", "LOCAL_MODEL_BASE_URL": "http://localhost:11434/", "OLLAMA_MODEL_NAME": "qwen2.5:latest", "LOCAL_MODEL_API_KEY": "ollama"}

The reason you need that last one is because otherwise DeepEval will default to looking for an API key for OpenAI.

What all of this has done is treat Qwen 2.5 as the default model for evaluation. There is an alternative, however, to doing all of this, and it’s one I’m going to use in upcoming posts. That alternative is that you can specify your model directly in code and, if you do that, DeepEval will override the default model.

If that’s the case, why go through all this if there’s such an easy alternative? The nice thing is that if you forget to include a model in your code, DeepEval will now default to another local model and not GPT 4. You could, for example, have used one of my models as your default. Beyond that, knowing how DeepEval is configured does not hurt. I find a lot of tutorials don’t necessarily cover this all that well.

Possible Timeout Issues

That’s about it for configuration but I’m going to interrupt the flow here to anticipate an issue you may run into. As we go along, you might get something like this at some point during executing your scripts:


tenacity.RetryError: RetryError[< ... raised TimeoutError>]

That tenacity.RetryError with a raised TimeoutError means DeepEval tried to get a response from your Ollama model, didn’t get one in time, kept retrying (that’s what the tenacity library does; it retries failed operations), and eventually gave up. Assuming you’ve confirmed that Ollama is running and the model is loaded, the likely issue here is that your model is slow and/or overloaded on your particular machine.

This is a particular concern with reasoning models like DeepSeek R1 or Qwen 3, where inference can take a long time, potentially exceeding DeepEval’s timeout.

Should this happen, one thing you can try is to add a timeout value to your .env file like this:


DEEPEVAL_PER_TASK_TIMEOUT_SECONDS_OVERRIDE=900

You can set this to a high value to increase the total time allowed for a single metric or test case to finish. A value of fifteen minutes (the 900 seconds above) should hopefully be plenty, at least for the examples I’ll be going through, but, again, it’s hard to predict since it’s so hardware dependent.

A point worth repeating here is that AI is compute-heavy. Certain reasoning models are particularly slow because they “think” through problems step-by-step. Hardware factors that can affect this are primarily CPU vs GPU. If Ollama is running entirely on CPU (no GPU detected), inference is much slower. A GPU can be ten to fifty times faster depending on the model. Also relevant is GPU VRAM. Models need to fit in GPU memory. If a model is too large, it might spill over to CPU (slow), swap between GPU/RAM (very slow), or just not run at all.

The point being that timeout issues are often hardware-dependent. Someone running DeepEval on a machine with a modern GPU (like an RTX 4090) might never encounter timeouts, while someone on a laptop CPU might hit them frequently with the exact same code running against the exact same model.

A Possible Proxy Issue

You might find that you get an error about “SSL: CERTIFICATE_VERIFY_FAILED” with something called us.i.posthog.com when running DeepEval. This is typically caused by a network environment issue, such as a corporate firewall or proxy that intercepts HTTPS traffic using a self-signed certificate, or an outdated local Python certificate bundle. The error originates from the telemetry component (PostHog) within DeepEval. Disabling telemetry is the simplest, and often recommended, solution if the connection issue persists.

The easiest way to deal with this is to set an environment variable called DEEPEVAL_TELEMETRY_OPT_OUT to 1. How you do this depends on your operating system:

Windows (Command Prompt): set DEEPEVAL_TELEMETRY_OPT_OUT=1
Windows (PowerShell): $env:DEEPEVAL_TELEMETRY_OPT_OUT = "1"
macOS / Linux: export DEEPEVAL_TELEMETRY_OPT_OUT=1

Note that setting the variable this way only lasts for the current terminal session. To make it permanent, add the export line to your shell profile (e.g., ~/.bashrc or ~/.zshrc on macOS/Linux), or add it to your system’s environment variables on Windows.

If you prefer to handle this in code rather than at the shell level, you can set the environment variable at the top of your script using Python’s os module, before importing DeepEval, since the telemetry client is initialized at import time:

import os
os.environ["DEEPEVAL_TELEMETRY_OPT_OUT"] = "1"

# Now import DeepEval
from deepeval import evaluate

import os

os.environ["DEEPEVAL_TELEMETRY_OPT_OUT"] = "1"

# Now import DeepEval

from deepeval import evaluate

The order matters here: if you import DeepEval first and set the variable afterward, the telemetry client will already have started and the opt-out will have no effect.

You can read more about this setting on the DeepEval data privacy page.

The best advice I can offer now is to just keep this in mind for now for future posts, if you want to go the script route, or set the environment variable and never worry about it.

What About OpenAI?

If you do want to run against OpenAI, you can follow the steps at DeepEval’s OpenAI section.

To do this, you will need to get an OpenAI key (that part is free). Head to the OpenAI platform and login and/or signup. Then, from their interface, create an API key. It never hurts to have it in case you want to play around with distributed models, but do note that you will likely have to purchase credits for anything useful.

As far as costs, you must add a payment method and purchase credits (current minimum $5) before your API key will function. As you make API calls, costs are deducted from your prepaid balance in real-time. If your credits run out, the service stops; you won’t be billed extra automatically. You can enable an auto-recharge feature to automatically add more credits when your balance falls below a certain threshold, but I don’t recommend that.

All this said, my posts are entirely going to be focused on using local models. However, nothing at all with the code I will show you would fundamentally change if you did decide you wanted to switch to another model, whether that be OpenAI or Anthropic (Claude) or Grok or, really, anything else.

Next Steps!

This was a relatively simple post, just getting you set up. In the next post, we’ll start digging into some of the metrics that DeepEval provides. These metrics are what make DeepEval an evaluator and, as such, a test-supporting tool.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …