In the previous post we got set up with DeepEval. Here we’re going to put that tool to use by looking at our first test case and our first quality metric.

Before we get cracking, let me state at the outset that evaluation frameworks can feel abstract at first. There’s often this gap between “Hey, look at me, I built something with an LLM!” and “Wait, how do I know if it’s giving me reliable, relevant, or accurate responses?”
DeepEval is like a test harness for LLM behavior. Not unit tests for code, but repeatable checks for system output quality. You’re basically doing testing, but the system under test is a probabilistic text generator. This means your tests need to be tolerant, metric-driven, and diagnostic rather than strictly pass/fail. At least at first.
In the previous posts, we built our own infrastructure to run tests. That was instructive, perhaps, but DeepEval provides that infrastructure for us and focuses our attention on defining what quality means for each test case. Our first tests will focus on the metric of relevancy.
Relevant Answers Only, Please!
I think it’s safe to assume that when you ask someone a question, you expect an answer that actually addresses what you asked. If you ask “What time is the meeting?” and someone responds with “Meetings are important for collaboration,” they’ve said something related to meetings, but they haven’t answered your question.
LLMs can fall into the same trap, especially because they’re trained to generate coherent, contextually related text. However, “related to the topic” and “answers the specific question” aren’t the same thing. The LLM might generate fluent, coherent text that’s topically relevant but doesn’t actually come anywhere near to answering what you asked. This is where answer relevancy comes in.
I’m going to introduce two specific DeepEval concepts to you here: LLMTestCase and AnswerRelevancyMetric. Let’s approach this gradually.
Initial Test Case
Go ahead and create a Python script and let’s start it off like this:
|
1 2 3 4 5 6 7 8 9 10 |
from deepeval.test_case import LLMTestCase question = "What does the Higgs boson explain in particle physics?" good_case = LLMTestCase( input=question, actual_output=( "It explains how particles acquire mass via the Higgs field." ) ) |
In DeepEval, an LLMTestCase is how you represent what you’re testing; specifically, it represents a single test scenario. At minimum, it contains an input (the question or prompt) and an output (what your LLM generated). You can see those reflected above as input and actual_output.
We can add different scenarios and, since this is going to be a test that looks at answer relevancy, let’s add a test scenario that focuses on a certain amount of irrelevancy.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
from deepeval.test_case import LLMTestCase question = "What does the Higgs boson explain in particle physics?" good_case = LLMTestCase( input=question, actual_output=( "It explains how particles acquire mass via the Higgs field." ) ) rambly_case = LLMTestCase( input=question, actual_output=( "It explains how particles acquire mass via the Higgs field. " "Particle physics is a branch of physics. " "The Large Hadron Collider is in Europe." ) ) |
What we’re doing here is asking a question about the Higgs boson and testing two different responses: one that’s focused and direct, and another that starts well but then wanders into related but irrelevant territory.
With our tests in place, let’s add our metric and a model for that metric.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
from deepeval.metrics import AnswerRelevancyMetric from deepeval.models import OllamaModel from deepeval.test_case import LLMTestCase judge_model = OllamaModel(model="jeffnyman/ts-evaluator") metric = AnswerRelevancyMetric(model=judge_model) question = "What does the Higgs boson explain in particle physics?" good_case = LLMTestCase( input=question, actual_output=( "It explains how particles acquire mass via the Higgs field." ) ) rambly_case = LLMTestCase( input=question, actual_output=( "It explains how particles acquire mass via the Higgs field. " "Particle physics is a branch of physics. " "The Large Hadron Collider is in Europe." ) ) |
The AnswerRelevancyMetric is provided a judge model as an argument and, in this case, we’re using my ts-evaluator model. The idea here is that the metric will evaluate whether the actual output (for each test case) truly answers the input question. How this happens is that the metric breaks down the actual response output into individual statements and assesses each one’s relevance to the original question.
You might be saying, “Wait, what actual response output? It looks like we provided the output.” Yes, we did. More on that to come!
Now, let’s put in some logic to actually execute our tests. Add the following to the bottom of the script.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
for label, case in [ ("GOOD", good_case), ("RAMBLY", rambly_case) ]: metric.measure(case) print() print("*" * 60) print(f"{label} ANSWER") print("*" * 60) print("score: ", metric.score) print("pass: ", metric.is_successful()) print("reason:", metric.reason) print("-" * 40) print("Extracted statements:") print("-" * 40) for i, stmt in enumerate(metric.statements, start=1): print(f"{i}. {stmt}") print("-" * 40) print("Verdicts:") print("-" * 40) for verdict in metric.verdicts: print(verdict) |
The particular statement to note there is the call to the measure() function. The measure function in DeepEval is the core function that evaluates an LLMTestCase by applying specific metrics to calculate a score (which ranges from 0 to 1), determine success based on a threshold (default 0.5), and provide reasoning.
After running measure(), you can inspect the list of individual claims the model extracted from your answer, which is what metric.statements gets you. You can also check what are called the verdicts (metric.verdicts). These contain the individual “yes/no” relevance decisions for each statement, which is essentially what the breakdown represents for the specific metric being executed.
Executing the Script
When you run this, you will see something like this at the start:
You're running DeepEval's latest Answer Relevancy Metric! (using jeffnyman/ts-evaluator (Ollama), strict=False, async_mode=True)
Notice how it’s using ts-evaluator. Yet, in the previous post I had shown you how to set up DeepEval with Qwen 2.5 as the default. What this shows you is that your code can include a specific model and, if that’s the case, the model specified in your code will override the DeepEval configuration.
The tests we’re running in these posts will potentially be a bit more intensive on your machine, depending on hardware.
While this is running, keep in mind what’s going on here: this code sets up a judge model (ts-evaluator running locally), creates the answer relevancy metric, and then tests two responses to the same physics question. The first response (which we provided, remember) is concise and directly answers what was asked. The second response (which we also provided) starts with the same correct information but then adds tangential facts about particle physics and the LHC. True statements, but not what the question asked.
For each test case, we’re going to examine four things: the relevancy score (which will always be a value between 0 to 1, inclusive), whether it passes the threshold (again, 0.5 being the default), the judge’s reasoning for that score, and crucially the breakdown showing which specific statements were deemed relevant or not.
Think of it like a courtroom witness. If the lawyer asks “Where were you on Tuesday night?” and the witness responds “I was at the library. Libraries have books. Books are important for society,” the judge would instruct them to only answer the question. Mind you, the last two statements aren’t false. They’re just not responsive to what was asked. Answer relevancy is that type of judge, keeping the LLM’s testimony on track.
Test Result
Here is some likely output for the first test case:
************************************************************
GOOD ANSWER
************************************************************
score: 1.0
pass: True
reason: The score is 1.00 because there are no irrelevant statements in the actual output, making it a perfect score.
----------------------------------------
Extracted statements:
----------------------------------------
1. It explains how particles acquire mass.
2. Particles acquire mass via the Higgs field.
----------------------------------------
Verdicts:
----------------------------------------
verdict='yes' reason=None
verdict='yes' reason=None
For the second test case, you might get something like this:
************************************************************
RAMBLY ANSWER
************************************************************
score: 0.6666666666666666
pass: True
reason: The score is 0.67 because while the definition of particle physics provided is relevant, it does not directly address what the Higgs boson explains within that context. This makes the response slightly less than fully relevant to the question asked.
----------------------------------------
Extracted statements:
----------------------------------------
1. Particles acquire mass via the Higgs field.
2. Particle physics is a branch of physics.
3. The Large Hadron Collider is in Europe.
----------------------------------------
Verdicts:
----------------------------------------
verdict='yes' reason=None
verdict='no' reason='This statement does not directly explain what the Higgs boson explains, but rather defines particle physics.'
verdict='idk' reason='While this statement provides information about a relevant location for particle research, it is not directly related to explaining the role of the Higgs boson in particle physics.'
Let’s unpack what the judge model found. The output shows us not just scores, but the reasoning process behind those scores, which is invaluable for understanding what’s happening.
The Good Case
If you’ll forgive a stretched analogy here, when DeepEval assessed the “good” answer, it essentially performed a three-stage archaeological dig through the response, but instead of excavating layers of soil, it excavated layers of meaning. This occurred in stages.
The first stage is statement extraction. First, the metric broke down the response into its fundamental claims. What we might call its “atomic propositions.” Think of this like a forensic analyst separating a mixed substance into its pure chemical components. The single sentence yielded two distinct factual statements:
- “It explains how particles acquire mass”
- “Particles acquire mass via the Higgs field”
This decomposition is crucial because relevancy can only be judged at the statement level. You can’t assess whether a paragraph is relevant; you must assess whether each claim within that paragraph is relevant.
The second stage is verdict assignment. Here, the judge model evaluated each extracted statement against the original question: “What does the Higgs boson explain in particle physics?” Both statements received a “yes” verdict, meaning both directly address what was asked. Notice the reason=None fields? That’s actually significant: when a statement is clearly relevant, the judge doesn’t need to justify itself. It’s like a straightforward court verdict where the evidence is so unambiguous that no deliberation minutes need recording.
The third, and final, stage is score calculation. DeepEval computed the score using a simple ratio:
Score = (relevant statements) / (total statements) = 2/2 = 1.0
The perfect score of 1.0 means 100% of the extracted statements were relevant: nothing more, nothing less. The response exhibited what we might call “semantic efficiency,” by which I mean every claim contributed to answering the question.
What’s the pedagogical takeaway here? Well, this illustrates the metric’s core assumption: good answers contain only relevant statements. The “GOOD” case demonstrates the baseline: what happens when an LLM stays focused and disciplined, addressing exactly what was asked without wandering into adjacent topics, no matter how interesting or factually correct those tangents might be.
The Rambly Case
What about the second test case? When DeepEval assessed the “rambly” answer, it uncovered something obvious but worth calling out: not all true statements are relevant statements. This case demonstrates how an LLM can be factually correct while simultaneously being semantically unfocused.
Applying our stages, the metric decomposed the response into three atomic claims:
- “Particles acquire mass via the Higgs field”
- “Particle physics is a branch of physics”
- “The Large Hadron Collider is in Europe”
Think of this like a detective sorting evidence at a crime scene. Some items are directly relevant to the case at hand, some are tangentially related, and some just happened to be in the room. All three of the above statements are true, but truth and relevance are orthogonal dimensions.
When we get to the second stage, here’s where the evaluation gets interesting, and where we see the judge model’s discernment. Statement 1 is given a “yes” verdict: clean and direct. This addresses the question head-on, just like in the “GOOD” case. No justification needed.
The second statement is a “no” verdict. The judge model recognized this as definitional context that doesn’t actually answer what was asked. The reason field captures the distinction beautifully: “does not directly explain what the Higgs boson explains, but rather defines particle physics.” This is like asking “What does a hammer do?” and receiving “A hammer is a tool” as part of the answer. True? Absolutely. Relevant? Not really. It’s taxonomic information masquerading as functional explanation.
The third statement has an “idk” verdict, which means “I don’t know.” This is particularly instructive. The judge model deployed a third category: epistemic uncertainty. The Large Hadron Collider is where the Higgs boson was discovered, so there’s an associative connection. But does mentioning that tool’s geographic location explain what the Higgs boson that was found there does? The judge essentially said: “This feels adjacent to the topic, but I can’t confidently say it answers the question.”
Think of this like an archaeological find that might be from the period you’re studying, but the stratigraphy is ambiguous. You don’t want to say “definitely yes” or “definitely no.” You acknowledge the uncertainty.
You might wonder: shouldn’t this be obvious to the judge? Meaning, shouldn’t the LHC mention be an obvious “no”? Well, maybe. That’s what fine tuning would be used for once you do testing like this!
Then, for the third stage, DeepEval reveals its evaluative philosophy. The formula treats uncertainty charitably:
Score = (yes verdicts + idk verdicts) / (total statements) Score = (1 + 1) / 3 = 0.667
The “idk” verdict gets counted as relevant for scoring purposes, which is actually quite generous. This reflects a design decision: benefit of the doubt. The metric assumes that if relevance is ambiguous enough that even an LLM judge can’t decide, we shouldn’t penalize too harshly.
Despite only being 67% relevant, this test case still passed (is_successful() = True). Why? Because DeepEval’s default threshold for Answer Relevancy is 0.5. You can change the threshold. If you set your threshold, for example, to 0.8, then this second test case would have failed. Just for information’s sake, if you did want to change the threshold, you could do this:
|
1 |
metric = AnswerRelevancyMetric(model=judge_model, threshold=0.8) |
All of this reveals something important about the metric’s calibration: it’s designed to catch egregious irrelevance, not to enforce perfection. Think of it like a minimum competency standard rather than an excellence benchmark. A 0.67 says: “This answer has problems, but it’s not fundamentally broken.”
What’s the pedagogical takeaway? Well, we have a few. One is that relevance is compositional. The “RAMBLY” case shows that overall relevance emerges from the proportion of relevant statements. Two true statements can produce three verdicts (yes/no/idk), demonstrating that semantic parsing is more granular than syntactic structure.
We also see that the judge reasons contextually. Notice how the judge provided reasoning for the “no” and “idk” verdicts but not the “yes”. This mirrors how human reviewers work: we explain rejections and uncertainties, but acceptances are (or at least tend to be) self-evident.
We also see that truth does not (necessarily) equal relevance. All three statements are factually accurate. The metric isn’t fact-checking; it’s measuring topical coherence. This is the difference between a knowledgeable speaker and a focused speaker.
Finally, we see the “cost” of rambling. By adding two sentences of tangentially related content, the LLM degraded its score by 33%. This quantifies the intuitive principle: saying more doesn’t mean saying better. In fact, verbosity can dilute quality: a lesson as old as rhetoric itself.
A lesson that still bites me in just about every blog post I write!
Judging and Execution Models
One thing to note about our current setup: we’re using ts-evaluator as the judge model, but we haven’t actually specified what model generated the responses we’re testing. In this example, I manually wrote the actual_output text for both test cases to illustrate the concept clearly. In practice, you would have your LLM generate these responses.
We’ll hook up an actual execution model and see how to test its live outputs momentarily. And, yes, you guessed it: this is where my ts-reasoner model will come in.
This separation is intentional in DeepEval’s design. The execution model (what generates responses) and the judge model (what evaluates them) are independent concerns. You might be testing GPT-4’s outputs using a local Llama model as judge, or vice versa. You might use Grok to evaluate the output of Claude, or vice versa.
And note that what you’re really testing is not Grok or Claude, per se, but the model underlying them at a given time, whether that be Grok 4 Heavy or Claude Opus 4.6, and so on.
For our purposes right now, I wanted you to focus on understanding how the evaluation works, so I kept the responses simple and controlled.
This is, in fact, showing an important point from a testing perspective: we can gauge how our test harness works by executing controlled test cases, where we specify possible outputs and then see how our evaluator treats those.
Execution Model
Now, let’s introduce our actual execution model to our script. We’ll use ts-reasoner to generate a response to our Higgs boson question, then have ts-evaluator judge its relevancy, just as we did before.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
from langchain_ollama import ChatOllama from deepeval.metrics import AnswerRelevancyMetric from deepeval.models import OllamaModel from deepeval.test_case import LLMTestCase execution_model = ChatOllama(model="jeffnyman/ts-reasoner") judge_model = OllamaModel(model="jeffnyman/ts-evaluator") metric = AnswerRelevancyMetric(model=judge_model) question = "What does the Higgs boson explain in particle physics?" response = execution_model.invoke(question).content good_case = LLMTestCase( input=question, actual_output = response ) rambly_case = LLMTestCase( input=question, actual_output=( "It explains how particles acquire mass via the Higgs field. " "Particle physics is a branch of physics. " "The Large Hadron Collider is in Europe." ) ) for label, case in [ ("GOOD", good_case), # ("RAMBLY", rambly_case) ]: metric.measure(case) print() print("*" * 60) print(f"{label} ANSWER") print("*" * 60) print("score: ", metric.score) print("pass: ", metric.is_successful()) print("reason:", metric.reason) print("-" * 40) print("Extracted statements:") print("-" * 40) for i, stmt in enumerate(metric.statements, start=1): print(f"{i}. {stmt}") print("-" * 40) print("Verdicts:") print("-" * 40) for verdict in metric.verdicts: print(verdict) |
We’ll still keep our manually crafted rambly example as a comparison point for the moment but notice that I commented out its execution.
I want to emphasize what we’ve done here with this new code. We’re now doing something similar to what we did in previous posts: we’re calling invoke() on our model and getting the content of the response.
Again, I’ll point out that we’re scaling up our computing here. We’re now using two models. Hardware speed issues are at least relevant to consider, since I don’t know your personal setup and can’t tell you exactly how long things will take.
I’m going to show you two possible outputs I got for this. First, here’s one of them:
************************************************************
GOOD ANSWER
************************************************************
score: 1.0
pass: True
reason: The score is 1.00 because the output perfectly addresses the question about the Higgs boson's explanation in particle physics, with no irrelevant statements present.
----------------------------------------
Extracted statements:
----------------------------------------
1. The Higgs boson is a fundamental particle.
2. It plays a crucial role in explaining how certain particles acquire mass.
3. It provides insight into a mechanism known as the Higgs mechanism.
4. The Higgs mechanism addresses why some elementary particles have mass while others do not.
5. Particles interact with the Higgs field.
6. As they move through this field, their interactions give them mass.
7. The Standard Model of particle physics relies on symmetry principles.
8. The Higgs mechanism breaks these symmetries.
9. Without such a mechanism, the theory would predict all particles should be massless or have identical masses.
10. The theory of electroweak interactions combines the weak nuclear force and electromagnetism into one unified force.
11. At low energies, these forces appear quite different.
12. The Higgs mechanism provides a way to explain this transition from symmetry to observable differences in particle behavior.
13. The discovery of the Higgs boson was announced by CERN's Large Hadron Collider (LHC) in 2012.
14. This discovery completed the Standard Model’s framework.
15. Some aspects of particle physics remain unexplained.
----------------------------------------
Verdicts:
----------------------------------------
verdict='yes' reason=None
verdict='yes' reason=None
verdict='yes' reason=None
verdict='yes' reason=None
verdict='yes' reason=None
verdict='yes' reason=None
verdict='idk' reason=None
verdict='idk' reason=None
verdict='yes' reason=None
verdict='idk' reason=None
verdict='idk' reason=None
verdict='idk' reason=None
verdict='yes' reason=None
verdict='yes' reason=None
verdict='idk' reason=None
Interestingly, even though this scored 1.0 and passed, look at the verdicts closely. You’ll see “yes” for statements directly about mass acquisition, but also several “idk” verdicts. This means the judge model was uncertain whether those statements directly answered the question.
The execution model gave us a thorough, educational response: it explained the Higgs mechanism, symmetry breaking, and how this fits into the Standard Model. In a classroom, this would be an A+ answer. But the judge is uncertain: are statements about symmetry breaking and electroweak theory directly answering “what does the Higgs boson explain,” or are they contextual background?
Keep in mind that a non-specialist human might be in the same boat!
The score came out to 1.0 because, again, DeepEval’s default behavior is generous with “idk” verdicts. Meaning, uncertainty doesn’t count against you, only definite irrelevance does. Yet, this reveals something important that I already brought up: comprehensive doesn’t always mean focused. The model answered the question thoroughly, but also provided scholarly context that the judge wasn’t confident counted as direct explanation.
Here’s an example from another test run:
************************************************************
GOOD ANSWER
************************************************************
score: 0.5833333333333334
pass: True
reason: The score is 0.58 because several statements discuss unrelated topics such as incorporating gravity into the Standard Model, assumptions about the completeness of the Standard Model, and ways to falsify the Higgs mechanism, which do not directly address what the Higgs boson explains in particle physics.
----------------------------------------
Extracted statements:
----------------------------------------
1. The Standard Model of particle physics describes all known fundamental particles and their interactions (electromagnetic, weak, and strong).
2. However, when physicists tried to incorporate gravity, the math broke down.
3. The Standard Model predicted that all fundamental particles should be massless – which clearly wasn't true (electrons, quarks, etc., do have mass). This was a major inconsistency.
4. In 1964, several physicists – including Peter Higgs – proposed a solution: the Higgs mechanism. This wasn't about a single particle, but a field that permeates all of space.
5. The 'Higgs field' has a non-zero value everywhere, even in a vacuum. It's like an invisible molasses.
6. Particles gain mass by interacting with the Higgs field. The stronger a particle interacts with the field, the more massive it becomes.
7. Particles that don’t interact with the Higgs field remain massless (like photons, the particles of light).
8. The Higgs boson is the quantum excitation of the Higgs field. Just like a wave is an excitation of an electromagnetic field, the Higgs boson is an excitation of the Higgs field.
9. Its discovery in 2012 at the Large Hadron Collider (LHC) was crucial: it provided direct evidence that the Higgs field does exist.
10. The Standard Model, and therefore the Higgs mechanism, is complete. This is a huge assumption!
11. There are many known phenomena the Standard Model doesn’t explain (dark matter, dark energy, neutrino masses, etc.).
12. The Higgs boson's properties (mass, spin, interactions) are precisely what needs to be measured and compared against predictions. Any deviation could indicate new physics beyond the Standard Model.
13. The most direct way to falsify the Higgs mechanism is to find evidence of new particles or interactions that don’t fit within its framework. Furthermore, precise measurements of the Higgs boson’s properties are continuously scrutinized for any anomalies.
----------------------------------------
Verdicts:
----------------------------------------
verdict='yes' reason=None
verdict='no' reason='This statement discusses a different issue (incorporating gravity into the Standard Model) that is not directly related to explaining what the Higgs boson explains.'
verdict='no' reason='While this statement provides context for why the Higgs mechanism was necessary, it does not directly explain what the Higgs boson itself explains in particle physics.'
verdict='yes' reason=None
verdict='yes' reason=None
verdict='yes' reason=None
verdict='yes' reason=None
verdict='yes' reason=None
verdict='no' reason='This statement makes an assumption about the completeness of the Standard Model, which is not directly related to explaining what the Higgs boson explains.'
verdict='no' reason='While this statement discusses other phenomena that are not explained by the Standard Model, it does not explain what the Higgs boson itself explains in particle physics.'
verdict='yes' reason=None
verdict='no' reason='This statement discusses potential ways to falsify the Higgs mechanism, which is not directly related to explaining what the Higgs boson explains.'
This is fascinating! The contrast is striking. The most obvious is the score drop from 1.0 to 0.58. Despite still passing, this response lost nearly half its score. The second response included critical thinking about the Standard Model’s limitations, epistemic humility (acknowledging what we don’t know, like dark matter, etc.), falsifiability considerations (how we could test/disprove the mechanism), and scientific context about gravity integration problems. Yet, this time I got five explicit “no” verdicts with reasons:
- Statement 2 (gravity math breakdown) – “not directly related”
- Statement 3 (massless prediction problem) – “provides context … but does not directly explain”
- Statement 10 (completeness assumption) – “not directly related”
- Statement 11 (dark matter, etc.) – “does not explain what the Higgs boson itself explains”
- Statement 13 (falsification methods) – “not directly related”
Here’s the irony: the second response is arguably more scientifically sophisticated. It contextualizes the Higgs within broader theoretical problems, acknowledges limitations, and discusses falsifiability (a hallmark of good science). However, the judge penalized it for being less narrowly focused on the direct question. This reveals the metric’s bias: it values precision over pedagogical richness. Precision is a measure of relevancy.
This also brings up a good point. Our controlled test cases let us explore the ideas relatively consistently. Once you hook up an actual model, what output you specifically get can be very different.
Refactors and Cleanup
In our first version of this code, we crafted both responses by hand: a focused answer and a rambling one. This was useful for understanding how the metric works in controlled conditions. We knew what we expected to see, and the metric confirmed our intuitions. But there’s a limitation in that first version: we were essentially testing our ability to write good and bad examples, not testing an actual LLM’s behavior.
So, we evolved this. We replaced our handcrafted “good” response with whatever ts-reasoner actually generated. Now we’re doing real evaluation: we don’t know in advance whether the model will be focused, thorough, rambling, or something else entirely. The metric will tell us.
What this means is we can remove the “rambly” case at this point and I say that because with the output of our last test run, we see that the evaluator was acting consistently in comparison to our control example. Further, it doesn’t make sense to call our test case “good_case” any more. After all, we don’t know if it’s “good” since we’re testing the real thing and “good” has yet to be determined. Here’s a refactored version of our code:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
from langchain_ollama import ChatOllama from deepeval.metrics import AnswerRelevancyMetric from deepeval.models import OllamaModel from deepeval.test_case import LLMTestCase execution_model = ChatOllama(model="jeffnyman/ts-reasoner") judge_model = OllamaModel(model="jeffnyman/ts-evaluator") metric = AnswerRelevancyMetric(model=judge_model) question = "What does the Higgs boson explain in particle physics?" response = execution_model.invoke(question).content generated = LLMTestCase( input=question, actual_output = response ) metric.measure(generated) print() print("*" * 60) print(f"{question}") print("*" * 60) print("*" * 60) print(f"{response}") print("*" * 60) print("score: ", metric.score) print("pass: ", metric.is_successful()) print("reason:", metric.reason) print("-" * 40) print("Extracted statements:") print("-" * 40) for i, stmt in enumerate(metric.statements, start=1): print(f"{i}. {stmt}") print("-" * 40) print("Verdicts:") print("-" * 40) for verdict in metric.verdicts: print(verdict) |
The changes there are essentially removing the rambly test case and the logic that executed both cases. But I also added a few key print statements, which are marked. A key thing you might have noticed is that prior to this, we never actually saw the response from the reasoner model. We were just told what statements were extracted and what the verdicts were. Now the question and the response are printed as part of the output.
Yet, this still feels like a lot of extra work. After all, if DeepEval is a test harness, shouldn’t it provide much of this information for us? Here’s a way to refactor the test and have DeepEval generate most of the info we need:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
from langchain_ollama import ChatOllama from deepeval.metrics import AnswerRelevancyMetric from deepeval.models import OllamaModel from deepeval.test_case import LLMTestCase execution_model = ChatOllama(model="jeffnyman/ts-reasoner") judge_model = OllamaModel(model="jeffnyman/ts-evaluator") metric = AnswerRelevancyMetric(model=judge_model, verbose_mode=True) question = "What does the Higgs boson explain in particle physics?" response = execution_model.invoke(question).content generated = LLMTestCase( input=question, actual_output = response ) metric.measure(generated) |
The addition of verbose_mode gives us the general output from the execution. Notice I removed all the print statements at the end.
Do note that even verbose output does not give you the full response of the model. You would still have to print that out if you wanted to see it.
Evaluating
Let’s make one more change here and it’s a simple one but impactful:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
from langchain_ollama import ChatOllama from deepeval import evaluate from deepeval.metrics import AnswerRelevancyMetric from deepeval.models import OllamaModel from deepeval.test_case import LLMTestCase execution_model = ChatOllama(model="jeffnyman/ts-reasoner") judge_model = OllamaModel(model="jeffnyman/ts-evaluator") metric = AnswerRelevancyMetric(model=judge_model, verbose_mode=True) question = "What does the Higgs boson explain in particle physics?" response = execution_model.invoke(question).content generated = LLMTestCase( input=question, actual_output = response ) evaluate(test_cases=[generated], metrics=[metric]) |
Here I’m calling the evaluate() function rather than measure(). But why? In DeepEval, the evaluate function and the measure function serve two distinct purposes in the evaluation workflow. The measure function is a low-level function belonging to individual metric objects, like we saw in our original code. It calculates the score for a single metric on a single test case. You call it manually when you want to see the score of one specific metric without running a full test suite.
The evaluate function is a high-level orchestration utility that manages the entire testing process. It can run multiple metrics against multiple test cases simultaneously. It also includes built-in speed improvements, caching, and parallel execution.
Go ahead and run the script and see the output that evaluate gives you. You can probably see that my initial “measure” approach made it easier for me to treat the metric pedagogically and show you what it was doing. The evaluate is more what you do when you understand the metric.
If you run the version of the code with evaluate(), you might notice something in the output about how you can analyze and save testing results on Confident AI. This refers to an observability platform, similar to LangSmith that we used when learning LangChain. I’ll revisit this in further posts. I’ll also come back to evaluate.
Recapping
I like parallels to other disciplines. (Years ago, I wrote about being cross-discipline associative.) Given that, consider this: in archaeology, a trench can contain artifacts from the right time period, the wrong time period, and natural debris. And that can all be in the same bucket. Stratigraphy helps us separate what belongs to each layer. Answer Relevancy does the same for LLM responses: it separates claims that directly address your question (the target layer) from contextually related information that belongs to adjacent layers of knowledge. Both might be valuable, but only one answers what you asked.
Another parallel occurs in biblical studies: it’s the difference between answering “What does Paul argue in Romans 5?” versus explaining the entire historical context of Second Temple Judaism, Greco-Roman honor-shame culture, and the political situation in Rome. All of that context might enrich understanding, but if someone asks what Paul argues in a specific chapter, they want his actual argument. Not a doctoral seminar on background material.
This is why Answer Relevancy matters as a metric: without it, you might have an LLM that’s knowledgeable and fluent but consistently gives you the academic lecture when you asked for the specific answer. The metric helps you catch that pattern before your users do.
What Does the Testing Tell Us?
Our testing reveals something crucial about the distinction between comprehensiveness and focus in LLM responses. When we evaluated ts-reasoner against our Higgs boson question, we observed two notably different responses across runs: both technically correct, but with vastly different relevancy profiles.
The first run (scoring 1.0) produced a thorough educational response that the judge model treated generously. Despite multiple “idk” verdicts on statements about symmetry breaking and electroweak theory, the response passed with a perfect score.
Test Finding: This tells us that ts-reasoner can generate comprehensive answers that walk the line between direct explanation and contextual enrichment.
The second run (scoring 0.58) was arguably more scientifically sophisticated. It addressed gravity integration problems, acknowledged the Standard Model’s limitations with dark matter and neutrinos, and discussed falsifiability: all hallmarks of rigorous scientific thinking. Yet the judge penalized it heavily, issuing five explicit “no” verdicts for statements deemed “not directly related” to the core question.
Test Finding: This variance across runs, from 1.0 to 0.58, reveals something important about our execution model: ts-reasoner doesn’t have a fixed “style” for answering questions. The probabilistic nature of LLM generation means the same prompt can yield responses that prioritize different things: direct explanation versus critical context, focused precision versus intellectual breadth.
What this testing taught us is that Answer Relevancy functions as a focus detector, not a quality detector. The 0.58 response wasn’t wrong or unhelpful; it was contextually rich and epistemically honest. But it sacrificed narrow focus for broader understanding. The metric quantified that trade-off: roughly half the statements wandered beyond the immediate question into adjacent theoretical territory.
For our purposes, this means we now have a baseline understanding of ts-reasoner’s tendency to provide scholarly context alongside direct answers. Whether that’s desirable depends entirely on our use case. If we’re building a system for quick factual queries, we might want to tune toward the 1.0-style responses. If we’re building an educational tool where deeper context matters, the 0.58 response might actually be preferable, even though the metric penalizes it.
Test Finding: The testing also exposed the limitations of generous scoring: treating “idk” verdicts as passing can mask responses that are technically on-topic but semantically diffuse. This suggests we might want to experiment with stricter thresholds or even create custom metrics that distinguish between “directly answers” and “provides relevant context.”
Next Steps
Now that you have the general basics of how DeepEval operates, and some ideas of how to permute the test logic should you wish to, the next post, which investigates another metric, should be a bit more smooth sailing.
“Are the blog posts from Jeff Nyman too verbose”?
************************************************************
GOOD ANSWER
************************************************************
score: 1.0
pass: True
reason: The score is 1.00 because there are no irrelevant statements in the actual output that detract from addressing whether Jeff Nyman’s blog posts are too verbose.
—————————————-
Extracted statements:
—————————————-
1. There is a lot to explain
2. Lots of words are needed
—————————————-
Verdicts:
—————————————-
verdict=’yes’ reason=None
verdict=’yes’ reason=None
************************************************************
RAMBLY ANSWER
************************************************************
score: 0.6666666666666666
pass: True
reason: The score is 0.67 because the response does not directly address whether the blog posts are verbose, focusing instead on irrelevant aspects.
—————————————-
Extracted statements:
—————————————-
1. It helps the understanding.
2. The Python scripts in the blogs work.
3. Verbosity can dilute quality.
—————————————-
Verdicts:
—————————————-
verdict=’idk’ reason=None
verdict=’no’ reason=’Does not directly address verbosity.’
verdict=’yes’ reason=None
Thanks as always for the great content, Jeff.
I was getting this error when running Deepeval:
error uploading: HTTPSConnectionPool(host=’us.i.posthog.com’, port=443): Max retries exceeded with url: /batch/ (Caused by SSLError(SSLCertVerificationError(1, ‘[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1032)’)))”
I’ve solved it by disabling telemetry from Deepeval, adding this in the .env file:
DEEPEVAL_TELEMETRY_OPT_OUT=1
Thank you for flagging this! I added a section on this to the first post, covering DeepEval (AI and Testing: Evaluation and DeepEval), so hopefully that will help people as they come to this.