Karini AI Documentation
Go Back to Karini AI
  • Introduction
  • Installation
  • Getting Started
  • Organization
  • User Management
    • User Invitations
    • Role Management
  • Model Hub
    • Embeddings Models
    • Large Language Models (LLMs)
  • Prompt Management
    • Prompt Templates
    • Create Prompt
    • Test Prompt
      • Test & Compare
      • Prompt Observability
      • Prompt Runs
    • Agentic Prompts
      • Create Agent Prompt
      • Test Agent Prompt
    • Prompt Task Types
    • Prompt Versions
  • Datasets
  • Recipes
    • QnA Recipe
      • Data Storage Connectors
      • Connector Credential Setup
      • Vector Stores
      • Create Recipe
      • Run Recipe
      • Test Recipe
      • Evaluate Recipe
      • Export Recipe
      • Recipe Runs
      • Recipe Actions
    • Agent Recipe
      • Agent Recipe Configuration
      • Set up Agentic Recipe
      • Test Agentic Recipe
      • Agentic Evaluation
    • Databricks Recipe
  • Copilots
  • Observability
  • Dashboard Overview
    • Statistical Overview
    • Cost & Usage Summary
      • Spend by LLM Endpoint
      • Spend by Generative AI Application
    • Model Endpoints & Datasets Distribution
    • Dataset Dashboard
    • Copilot Dashboard
    • Model Endpoints Dashboard
  • Catalog Schemas
    • Connectors
    • Catalog Schema Import and Publication Process
  • Prompt Optimization Experiments
    • Set up and execute experiment
    • Optimization Insights
  • Generative AI Workshop
    • Agentic RAG
    • Intelligent Document Processing
    • Generative BI Agentic Assistant
  • Release Notes
Powered by GitBook
On this page
  • Prompt
  • Evaluation Dataset
  • Run Evaluation
  • View Evaluation
  1. Recipes
  2. QnA Recipe

Evaluate Recipe

PreviousTest RecipeNextExport Recipe

Last updated 10 months ago

Evaluate Recipe involves using "LLM as a Judge" technique to validate the recipe responses using various evaluation metrics and judging criteria. This process includes measuring performance, analyzing results, and using findings to decide on the recipe's refinement or further improvement. The aim is to ensure the recipe meets desired standards and achieves its objectives.

A recipe can be tested by bringing and configuring the Evaluate element into the recipe canvas.

Prompt

You can select from the existing in the to add to the recipe. A prompt is associated with a LLM and Model parameters.

Evaluation Dataset

You can provide an evaluation dataset in a CSV format which acts as a ground truth for recipe evaluation. The evaluation dataset must contain two columns. First column should include the input questions and the second column contains the ground-truth answer to the question. Here is an example of evaluation dataset.

Link the recipe's RAG Prompt element to the Evaluation element in the recipe canvas. This link allows the evaluation prompt to access and utilize the recipe's RAG prompt output for evaluation.

invoke the recipe's RAG pipeline for each of the question the evaluation dataset, obtain the responses and evaluate each response against the metrics and criteria defined in the evaluation prompt.

Run Evaluation

Evaluation run invokes the recipe's RAG pipeline for each of the question the evaluation dataset, obtain the responses and evaluate each response against the metrics and criteria defined in the evaluation prompt.

View Evaluation

Once the evaluation run is complete, you will see a status message indicating a Completed run, and a button to View Evaluations which directs you to Evaluation Runs page. Alternatively, you can also access evaluation runs details from the main left-hand-side panel under Recipe Evaluation Runs.

The evaluation summary is presented as a dashboard which includes all the metrics defined in the evaluation prompt, and their scores as mean, median and standard deviation.

A detailed table for each evaluation run for the recipe includes the follwoing information:

  • Input: The questions used in the dataset for evaluation. This is the input question supplied from the evaluation dataset.

  • Ground Truth: Actual expected result or correct answer used for evaluation. This is the ground-truth answer supplied from the evaluation dataset.

  • Output: The generated output or response produced by the recipe.

  • Metric: The evaluation metric used to assess the quality of the output.

  • Score: The numerical score or result obtained based on the evaluation metric. It quantifies the effectiveness or accuracy of the output.

  • Justification: An explanation or reasoning behind the obtained score or evaluation result. It may include details regarding the evaluation process, model performance, or specific observations.

  • Error: An error message if the evaluation process encounters an error.

prompt playground
evaluation prompts