AI and Testing: Evaluating Conversations

In the previous posts in the DeepEval series, we built up a diagnostic framework for evaluating RAG systems, covering Faithfulness, Contextual Precision, Contextual Recall, Contextual Relevancy, and G-Eval. All of those metrics operate on single turns: one question, one retrieval context, one response, one score. In this post we’ll move into different territory: conversational evaluation.

Continue reading AI and Testing: Evaluating Conversations