Evaluation
Last updated
Last updated
Evaluate Recipe involves using "LLM as a Judge" technique to validate the recipe responses using various evaluation metrics and judging criteria. This process includes measuring performance, analyzing results, and using findings to decide on the recipe's refinement or further improvement. The aim is to ensure the recipe meets desired standards and achieves its objectives.
In the right-side panel of the Agent 2.0 recipe, under the Evaluation dropdown, the following key elements are available.
You can upload an evaluation dataset in a CSV format which acts as a ground truth for recipe evaluation. The evaluation dataset must contain two columns. First column should include the input questions and the second column contains the ground-truth answer to the question. Here is an example of an evaluation dataset.
The uploaded dataset can be accessed by selecting the View Dataset option, which is located below the Upload Dataset section. The following screenshot illustrates its appearance.
The Default Evaluation utilizes the RAGAs framework for assessment, leveraging NLA models that are configured within the Organization Page. They have the flexibility to apply all available metrics or select specific ones based on their evaluation requirements, allowing for a tailored assessment approach.
It supports the following metrics:
Answer Relevancy: Measures how well the AI-generated response aligns with the input query.
Faithfulness: Ensures that the response remains accurate and factually grounded.
Context Recall: Assesses the model’s ability to retrieve relevant contextual information.
Context Precision: Evaluates the accuracy of the retrieved contextual information.
Context Entity Recall: Checks whether the model correctly recalls key entities from the provided context.
Answer Similarity: Compares the generated response with a reference answer to determine similarity.
Answer Correctness: Validates whether the response is accurate and logically sound.
After selecting the Default Evaluation and the desired metrics, click "Run Evaluation" to initiate the assessment. The system processes test queries through the model, generating responses that are analyzed using RAGAs evaluation criteria. It then calculates performance scores for each selected metric and compiles a comprehensive performance report, providing insights into the model’s effectiveness in handling queries. This structured evaluation process supports continuous optimization and fine-tuning, enhancing the model’s accuracy, relevance, and contextual precision.
The evaluation summary is presented as a dashboard which includes all the metrics defined in the evaluation prompt, and their scores as mean, median and standard deviation.
Custom Evaluation enables users to select an evaluation prompt from the existing list and offers the flexibility to choose the model for the evaluation.
You can select from the existing evaluation prompts in the prompt playground to add to the recipe. A prompt is associated with a LLM and Model parameters.
This prompt serves as the basis for assessing model behavior across multiple metrics.
Clicking "Run Evaluation" initiates the assessment. The system evaluates the model’s responses against the defined criteria, generating a performance report that helps in optimizing model outputs for accuracy, coherence, and adherence to required guidelines.
By enabling Custom Evaluation, users gain granular control over the assessment process, allowing for precise model tuning and validation based on specific business, compliance, or research needs.
The evaluation summary is presented as a dashboard which includes all the metrics defined in the evaluation prompt, and their scores as mean, median and standard deviation.
Once the evaluation run is complete, you will see :
Last Evaluation Run: Timestamp of the most recent evaluation.
Last Evaluation Status: Current status of the last evaluation.
View Last Evaluation Results: Access to detailed evaluation results.
Alternatively, you can also access evaluation runs details from the main left-hand-side panel under Recipe Evaluation Runs.
Input: The questions used in the dataset for evaluation. This is the input question supplied from the evaluation dataset.
Ground Truth: Actual expected result or correct answer used for evaluation. This is the ground-truth answer supplied from the evaluation dataset.
Output: The generated output or response produced by the recipe.
Metric: The evaluation metric used to assess the quality of the output.
Score: The numerical score or result obtained based on the evaluation metric. It quantifies the effectiveness or accuracy of the output.
Justification: An explanation or reasoning behind the obtained score or evaluation result. It may include details regarding the evaluation process, model performance, or specific observations.
Error: An error message if the evaluation process encounters an error.
A detailed table for each evaluation run for the recipe includes the following information: