Optimization Insights
Upon the completion of the Prompt Optimization Experiment, the execution results are categorized into three sections:
Overview
Evaluation
Trials
A comprehensive explanation of each section is provided below:
Overview
You can access detailed information for each element by clicking the "View" button associated with the respective section.
Best Candidate
It highlights the Best LLM candidate with its parameter specifications along with the refined prompt. You have the flexibility to either redirect to the Prompt Playground to explore the refined prompt further or copy it directly. Furthermore, this includes a justification for the selection, outlining the rationale behind choosing the candidate as the best-performing option.
Refined prompt:
Initial Prompt
It presents the original, unrefined prompt that serves as the starting point for optimization. This prompt is used as the basis for refinement to enhance clarity, accuracy, and effectiveness in generating structured responses.
Prompt description
It displays a description of the prompt provided during the setup of the prompt optimization experiment. It outlines the initial intent and context of the prompt to guide the refinement process
.
Evaluation dataset
This section displays a tabular representation of the uploaded CSV file, containing fields for each prompt input variable along with their corresponding ground truth answers.
Requested improvements
This section outlines the specific enhancements defined during the prompt optimization experiment setup. These improvements aim to refine the prompt by enhancing clarity, accuracy, structure, and relevance for better performance.
Judge LLM
It displays the LLM selected as the evaluation judge, responsible for assessing and determining the Best Candidate based on predefined criteria.
Max iterations
It displays the maximum number of iterations specified for the prompt optimization process, defining the limit for refinement cycles.
Candidate LLM
It contains the list of LLM candidates selected for evaluation and optimization during the prompt refinement process.
Evaluation
The Evaluation feature systematically assesses multiple candidate Large Language Models (LLMs) based on their total score, which is derived from key performance metrics, including cost efficiency, execution performance, and output quality. The model achieving the highest total score is designated as the best candidate for deployment.
By selecting a specific model and clicking the View button, you can access a comprehensive set of evaluation metrics.
You can access the specific model's page in the Model Hub by clicking the hover link associated with the models listed in the left panel.
The evaluation is categorized into the following key areas:
Cost Efficiency Score :This metric, accounting for 25% of the evaluation, measures how effectively the prompt-model combination balances cost and performance, including Token Usage Efficiency and Cost-Performance Ratio. High scores indicate cost-effective resource use with strong performance, while low scores suggest inefficiency.
Token usage efficiency: Evaluates how efficiently the combination uses tokens. A score of 0-2 indicates excessive token use, while a 9-10 reflects minimal waste and efficient usage.
Cost performance ratio: Assesses the value provided relative to cost. A score of 0-2 reflects poor value with high costs and low performance, while a 9-10 indicates excellent value with strong performance at a reasonable cost.
Performance Score: Contributing 35% to the overall score, this metric assesses the prompt-model combination’s responsiveness and reliability. High scores reflect quick, consistent responses, while low scores suggest delays or inconsistency.
Latency: Evaluates the model’s response time. A score of 0-2 indicates slow responses, unsuitable for most use cases, while a 9-10 reflects fast responses ideal for real-time applications.
Consistency: Measures reliability across different inputs. A score of 0-2 suggests inconsistency, while a 9-10 indicates reliable performance across scenarios.
Output Quality Score: Making up 40% of the total score, this metric evaluates the quality of outputs based on accuracy, relevance, clarity, and comprehensiveness. High scores reflect outputs closely aligned with task requirements, while low scores indicate potential issues in output quality.
Accuracy : Assesses alignment with expected results. A score of 0-2 indicates significant discrepancies, while a 9-10 reflects near-perfect alignment.
Relevance : Evaluates how closely outputs address the task. A score of 0-2 suggests outputs are off-topic, while a 9-10 indicates high relevance and alignment.
Clarity : Measures how understandable the outputs are. A score of 0-2 suggests confusing results, while a 9-10 reflects clear and easily interpretable outputs.
Comprehensiveness : Evaluates whether outputs cover all necessary aspects. A score of 0-2 indicates significant omissions, while a 9-10 suggests thorough and complete coverage.
Total Score: Represents the overall evaluation of the model, combining cost efficiency, performance, and output quality metrics.
Refined prompt and justification : The system presents the refined prompt generated by the model, accompanied by a justification explaining the rationale behind the model's selection and prompt optimization.
Trials
The Trial Overview section provides a structured summary of all trials conducted, offering key details that facilitate model evaluation and comparison.
Each trial is uniquely identified by a Trial Name, allowing you to distinguish between different test instances.
The Model Provider specifies the platform hosting the AI model, such as Amazon Bedrock or Azure OpenAI,
The Model Name & ID uniquely define the specific AI system being tested.T
The Iteration ID denotes different versions or reruns of the same modell, ensuring traceability across multiple testing phases.
This section also tracks the completion status of critical trial stages, including Generate Outputs, Evaluate Prompt, and Refine Prompt, indicating whether each step has been successfully completed, skipped, or requires further action.
Additionally, the Total Score provides a quantitative assessment of the model’s performance, helping users compare effectiveness across trials.
The Best Trial flag acts as a performance indicator, marking the most optimal trial based on predefined criteria.
Finally, the Start & End Time fields log the duration of each trial, enabling users to monitor execution time and optimize testing efficiency.
This comprehensive overview ensures transparency in the model evaluation process, facilitating data-driven decision-making.
Each trial consists of a comprehensive assessment of the model, including key aspects such as the refined prompt, initial prompt, trial results, evaluation results, and execution status. The Trial Results section evaluates the model’s performance using an uploaded dataset, comparing the generated output against the ground truth to assess accuracy and effectiveness.
This trails result captures key performance metrics:
Input Tokens :Number of tokens processed from the input query.
Output Tokens :Number of tokens generated by the model.
Response Time :Time taken to generate the output.
Errors :Any issues encountered during execution.
The Evaluation Results section further refines this analysis by computing a total score, which is derived from:
Prompt Analysis – Assessing the effectiveness of the input prompt.
Output Analysis – Evaluating the relevance, coherence, and correctness of the generated response.
Additionally, this section provides recommendations for improvement and highlights underperforming samples, enabling iterative enhancements to the model's performance. The execution status of the trial is determined by the completion of key tasks Generate Outputs, Evaluate Prompt, and Refine Prompt which are systematically recorded in the Trial Table.
Any errors encountered during the process are logged under Error Analysis, ensuring transparency and facilitating a structured approach to continuous model refinement.
Last updated