Karini AI Documentation
Go Back to Karini AI
  • Introduction
  • Installation
  • Getting Started
  • Organization
  • User Management
    • User Invitations
    • Role Management
  • Model Hub
    • Embeddings Models
    • Large Language Models (LLMs)
  • Prompt Management
    • Prompt Templates
    • Create Prompt
    • Test Prompt
      • Test & Compare
      • Prompt Observability
      • Prompt Runs
    • Agentic Prompts
      • Create Agent Prompt
      • Test Agent Prompt
    • Prompt Task Types
    • Prompt Versions
  • Datasets
  • Recipes
    • QnA Recipe
      • Data Storage Connectors
      • Connector Credential Setup
      • Vector Stores
      • Create Recipe
      • Run Recipe
      • Test Recipe
      • Evaluate Recipe
      • Export Recipe
      • Recipe Runs
      • Recipe Actions
    • Agent Recipe
      • Agent Recipe Configuration
      • Set up Agentic Recipe
      • Test Agentic Recipe
      • Agentic Evaluation
    • Databricks Recipe
  • Copilots
  • Observability
  • Dashboard Overview
    • Statistical Overview
    • Cost & Usage Summary
      • Spend by LLM Endpoint
      • Spend by Generative AI Application
    • Model Endpoints & Datasets Distribution
    • Dataset Dashboard
    • Copilot Dashboard
    • Model Endpoints Dashboard
  • Catalog Schemas
    • Connectors
    • Catalog Schema Import and Publication Process
  • Prompt Optimization Experiments
    • Set up and execute experiment
    • Optimization Insights
  • Generative AI Workshop
    • Agentic RAG
    • Intelligent Document Processing
    • Generative BI Agentic Assistant
  • Release Notes
Powered by GitBook
On this page
  • Overview
  • Evaluation
  • Trials
  1. Prompt Optimization Experiments

Optimization Insights

Upon the completion of the Prompt Optimization Experiment, the execution results are categorized into three sections:

  • Overview

  • Evaluation

  • Trials

A comprehensive explanation of each section is provided below:

Overview

You can access detailed information for each element by clicking the "View" button associated with the respective section.

Best Candidate

It highlights the Best LLM candidate with its parameter specifications along with the refined prompt. You have the flexibility to either redirect to the Prompt Playground to explore the refined prompt further or copy it directly. Furthermore, this includes a justification for the selection, outlining the rationale behind choosing the candidate as the best-performing option.

Refined prompt:

**Task:** Classify the intent of a user's question into one of two categories: "DomainRelated" or "OutOfDomain."

**Categories:**
- **DomainRelated:** Questions that are sensible, factual, educational, or related to any general knowledge or broad topic.
- **OutOfDomain:** Includes greetings, non-sensical or irrelevant content, expressions of dissatisfaction, or indications of ending the conversation.

**Guidelines:**
1. **DomainRelated:**
   - If the question is factual, educational, or pertains to general knowledge, classify it as "DomainRelated."
   - Examples: "How do you make a paper airplane?", "What is the lifespan of a butterfly?", "What are the benefits of exercise?"
   - Consider the context of the previous conversation if it helps, but prioritize the nature of the question itself.

2. **OutOfDomain:**
   - If the input is a greeting, small talk, or doesn't make sense, classify it as "OutOfDomain."
   - Examples: "Hello!", "I don't like this.", "Goodbye."
   - If the question is not seeking information or is irrelevant to any general topic, classify it as "OutOfDomain."

**Instructions:**
- Review the conversation history if provided, but focus on the question's content.
- Use the examples provided to guide your classification.
- Output the classification as a single word: "DomainRelated" or "OutOfDomain."

**Response Format:** Provide the classification as a single word—either "DomainRelated" or "OutOfDomain."

**Domain Information:** Any general subject or field relevant to sensible questions and queries.

**Example Conversation:**
- Human: What is the significance of the Patterns tab on the Logs Insights page?
- AI: The Patterns tab on the CloudWatch Logs Insights page is a feature that automatically identifies recurring patterns in your query results. This allows you to analyze these patterns in detail, making it easier to find what you're looking for and drill down into new or unexpected content in your logs. For example, when you run a query on a log group, the Patterns tab will display a list of recurring patterns found in the results. By clicking on a pattern, you can view all the log events that match it, as well as the frequency and distribution of those events over time. This can help you quickly identify trends and anomalies in your logs, and gain insights into the behavior of your applications and systems. Additionally, the Patterns tab includes a Compare button in the time interval selector, which lets you quickly compare the query result for the selected time range to a previous period, such as the previous day, week, or month. This makes it easy to see what has changed compared to a previous stable scenario, and can help you identify potential issues or areas for optimization. Overall, the Patterns tab is a powerful tool for analyzing log data in CloudWatch, and can help you quickly identify patterns and trends that might otherwise go unnoticed.

**Question:** {question}
**Classification:**

Initial Prompt

It presents the original, unrefined prompt that serves as the starting point for optimization. This prompt is used as the basis for refinement to enhance clarity, accuracy, and effectiveness in generating structured responses.

Determine the intent behind the following input. Use these general rules:

Categories:
- 'DomainRelated': Use this label if the input seems related to some broad topic or area.
- 'OutOfDomain': Use this label for everything else, including greetings, irrelevant content, or anything not specific.

Guidelines:
- If the input is a general question or something that could fit into a broad topic, it’s probably 'DomainRelated'.
- If the input is just small talk, nonsense, or doesn't fit into any topic, go with 'OutOfDomain'.
- Consider the conversation before the input if it helps, but it’s not mandatory.

'DomainRelated' Criteria:
- If it looks like a question that could relate to anything specific or broad, label it as 'DomainRelated'.
- The context of the previous chat might help, but you can mostly ignore it.

'OutOfDomain' Criteria:
- If the input is just a greeting, doesn't make sense, or is off-topic, it’s 'OutOfDomain'.
- Anything not directly asking for information or continuing a discussion might be 'OutOfDomain'.
- If it’s not a question or feels irrelevant, it's probably 'OutOfDomain'.

Response Format: Provide the classification as a single word—either 'DomainRelated' or 'OutOfDomain'.

Domain Information: Any general subject or field relevant to sensible questions and queries.

Here is a chat discussion between you and a human. Examine the discussion to grasp the background of the previous conversation and answer the question based on the Discussion. In the chat discussion, you are referred to as AI and the user is referred to as Human.
- Human: What is the significance of the Patterns tab on the Logs Insights page?
- AI: The Patterns tab on the CloudWatch Logs Insights page is a feature that automatically identifies recurring patterns in your query results. This allows you to analyze these patterns in detail, making it easier to find what you're looking for and drill down into new or unexpected content in your logs. For example, when you run a query on a log group, the Patterns tab will display a list of recurring patterns found in the results. By clicking on a pattern, you can view all the log events that match it, as well as the frequency and distribution of those events over time. This can help you quickly identify trends and anomalies in your logs, and gain insights into the behavior of your applications and systems. Additionally, the Patterns tab includes a Compare button in the time interval selector, which lets you quickly compare the query result for the selected time range to a previous period, such as the previous day, week, or month. This makes it easy to see what has changed compared to a previous stable scenario, and can help you identify potential issues or areas for optimization. Overall, the Patterns tab is a powerful tool for analyzing log data in CloudWatch, and can help you quickly identify patterns and trends that might otherwise go unnoticed.

Question: {question}
Classification:

Prompt description

It displays a description of the prompt provided during the setup of the prompt optimization experiment. It outlines the initial intent and context of the prompt to guide the refinement process

.

Evaluation dataset

This section displays a tabular representation of the uploaded CSV file, containing fields for each prompt input variable along with their corresponding ground truth answers.

Requested improvements

This section outlines the specific enhancements defined during the prompt optimization experiment setup. These improvements aim to refine the prompt by enhancing clarity, accuracy, structure, and relevance for better performance.

Judge LLM

It displays the LLM selected as the evaluation judge, responsible for assessing and determining the Best Candidate based on predefined criteria.

Max iterations

It displays the maximum number of iterations specified for the prompt optimization process, defining the limit for refinement cycles.

Candidate LLM

It contains the list of LLM candidates selected for evaluation and optimization during the prompt refinement process.

Evaluation

The Evaluation feature systematically assesses multiple candidate Large Language Models (LLMs) based on their total score, which is derived from key performance metrics, including cost efficiency, execution performance, and output quality. The model achieving the highest total score is designated as the best candidate for deployment.

By selecting a specific model and clicking the View button, you can access a comprehensive set of evaluation metrics.

You can access the specific model's page in the Model Hub by clicking the hover link associated with the models listed in the left panel.

The evaluation is categorized into the following key areas:

  • Cost Efficiency Score :This metric, accounting for 25% of the evaluation, measures how effectively the prompt-model combination balances cost and performance, including Token Usage Efficiency and Cost-Performance Ratio. High scores indicate cost-effective resource use with strong performance, while low scores suggest inefficiency.

    • Token usage efficiency: Evaluates how efficiently the combination uses tokens. A score of 0-2 indicates excessive token use, while a 9-10 reflects minimal waste and efficient usage.

    • Cost performance ratio: Assesses the value provided relative to cost. A score of 0-2 reflects poor value with high costs and low performance, while a 9-10 indicates excellent value with strong performance at a reasonable cost.

  • Performance Score: Contributing 35% to the overall score, this metric assesses the prompt-model combination’s responsiveness and reliability. High scores reflect quick, consistent responses, while low scores suggest delays or inconsistency.

    • Latency: Evaluates the model’s response time. A score of 0-2 indicates slow responses, unsuitable for most use cases, while a 9-10 reflects fast responses ideal for real-time applications.

    • Consistency: Measures reliability across different inputs. A score of 0-2 suggests inconsistency, while a 9-10 indicates reliable performance across scenarios.

  • Output Quality Score: Making up 40% of the total score, this metric evaluates the quality of outputs based on accuracy, relevance, clarity, and comprehensiveness. High scores reflect outputs closely aligned with task requirements, while low scores indicate potential issues in output quality.

    • Accuracy : Assesses alignment with expected results. A score of 0-2 indicates significant discrepancies, while a 9-10 reflects near-perfect alignment.

    • Relevance : Evaluates how closely outputs address the task. A score of 0-2 suggests outputs are off-topic, while a 9-10 indicates high relevance and alignment.

    • Clarity : Measures how understandable the outputs are. A score of 0-2 suggests confusing results, while a 9-10 reflects clear and easily interpretable outputs.

    • Comprehensiveness : Evaluates whether outputs cover all necessary aspects. A score of 0-2 indicates significant omissions, while a 9-10 suggests thorough and complete coverage.

  • Total Score: Represents the overall evaluation of the model, combining cost efficiency, performance, and output quality metrics.

  • Refined prompt and justification : The system presents the refined prompt generated by the model, accompanied by a justification explaining the rationale behind the model's selection and prompt optimization.

Trials

The Trial Overview section provides a structured summary of all trials conducted, offering key details that facilitate model evaluation and comparison.

  • Each trial is uniquely identified by a Trial Name, allowing you to distinguish between different test instances.

  • The Model Provider specifies the platform hosting the AI model, such as Amazon Bedrock or Azure OpenAI,

  • The Model Name & ID uniquely define the specific AI system being tested.T

  • The Iteration ID denotes different versions or reruns of the same modell, ensuring traceability across multiple testing phases.

  • This section also tracks the completion status of critical trial stages, including Generate Outputs, Evaluate Prompt, and Refine Prompt, indicating whether each step has been successfully completed, skipped, or requires further action.

  • Additionally, the Total Score provides a quantitative assessment of the model’s performance, helping users compare effectiveness across trials.

  • The Best Trial flag acts as a performance indicator, marking the most optimal trial based on predefined criteria.

  • Finally, the Start & End Time fields log the duration of each trial, enabling users to monitor execution time and optimize testing efficiency.

This comprehensive overview ensures transparency in the model evaluation process, facilitating data-driven decision-making.

Each trial consists of a comprehensive assessment of the model, including key aspects such as the refined prompt, initial prompt, trial results, evaluation results, and execution status. The Trial Results section evaluates the model’s performance using an uploaded dataset, comparing the generated output against the ground truth to assess accuracy and effectiveness.

This trails result captures key performance metrics:

  • Input Tokens :Number of tokens processed from the input query.

  • Output Tokens :Number of tokens generated by the model.

  • Response Time :Time taken to generate the output.

  • Errors :Any issues encountered during execution.

The Evaluation Results section further refines this analysis by computing a total score, which is derived from:

  • Prompt Analysis – Assessing the effectiveness of the input prompt.

  • Output Analysis – Evaluating the relevance, coherence, and correctness of the generated response.

Additionally, this section provides recommendations for improvement and highlights underperforming samples, enabling iterative enhancements to the model's performance. The execution status of the trial is determined by the completion of key tasks Generate Outputs, Evaluate Prompt, and Refine Prompt which are systematically recorded in the Trial Table.

Any errors encountered during the process are logged under Error Analysis, ensuring transparency and facilitating a structured approach to continuous model refinement.

PreviousSet up and execute experimentNextGenerative AI Workshop

Last updated 2 months ago

Individual trail
Trial results
Evaluation results