AI and Testing: Recall, Relevancy, and Richer Evaluation

In the previous posts we looked at the Faithfulness and Contextual Precision metrics with DeepEval, and started building an intuition for how retrieval failures cascade into generation failures. Those two metrics told us what went wrong and where in the pipeline. In this post, we’ll add three more tools to the diagnostic kit: Contextual Recall, Contextual Relevancy, and G-Eval.

Continue reading AI and Testing: Recall, Relevancy, and Richer Evaluation