Evaluate Recipe
Last updated
Last updated
Evaluate Recipe involves using "LLM as a Judge" technique to validate the recipe responses using various evaluation metrics and judging criteria. This process includes measuring performance, analyzing results, and using findings to decide on the recipe's refinement or further improvement. The aim is to ensure the recipe meets desired standards and achieves its objectives.
A recipe can be tested by bringing and configuring the Evaluate element into the recipe canvas.
You can select from the existing evaluation prompts in the prompt playground to add to the recipe. A prompt is associated with a LLM and Model parameters.
You can provide an evaluation dataset in a CSV format which acts as a ground truth for recipe evaluation. The evaluation dataset must contain two columns. First column should include the input questions and the second column contains the ground-truth answer to the question. Here is an example of evaluation dataset.
Link the recipe's RAG Prompt element to the Evaluation element in the recipe canvas. This link allows the evaluation prompt to access and utilize the recipe's RAG prompt output for evaluation.
invoke the recipe's RAG pipeline for each of the question the evaluation dataset, obtain the responses and evaluate each response against the metrics and criteria defined in the evaluation prompt.
Evaluation run invokes the recipe's RAG pipeline for each of the question the evaluation dataset, obtain the responses and evaluate each response against the metrics and criteria defined in the evaluation prompt.
Once the evaluation run is complete, you will see a status message indicating a Completed run, and a button to View Evaluations which directs you to Evaluation Runs page. Alternatively, you can also access evaluation runs details from the main left-hand-side panel under Recipe Evaluation Runs.
The evaluation summary is presented as a dashboard which includes all the metrics defined in the evaluation prompt, and their scores as mean, median and standard deviation.
A detailed table for each evaluation run for the recipe includes the follwoing information:
Input: The questions used in the dataset for evaluation. This is the input question supplied from the evaluation dataset.
Ground Truth: Actual expected result or correct answer used for evaluation. This is the ground-truth answer supplied from the evaluation dataset.
Output: The generated output or response produced by the recipe.
Metric: The evaluation metric used to assess the quality of the output.
Score: The numerical score or result obtained based on the evaluation metric. It quantifies the effectiveness or accuracy of the output.
Justification: An explanation or reasoning behind the obtained score or evaluation result. It may include details regarding the evaluation process, model performance, or specific observations.
Error: An error message if the evaluation process encounters an error.