12 April 2026 – Stories from a Software Tester

AI and Testing: Evaluating Conversations

written by Jeff Nyman

In the previous posts in the DeepEval series, we built up a diagnostic framework for evaluating RAG systems, covering Faithfulness, Contextual Precision, Contextual Recall, Contextual Relevancy, and G-Eval. All of those metrics operate on single turns: one question, one retrieval context, one response, one score. In this post we’ll move into different territory: conversational evaluation.

Continue reading AI and Testing: Evaluating Conversations →

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …

Day: April 12, 2026

AI and Testing: Evaluating Conversations