Karini AI Documentation
Go Back to Karini AI
  • Introduction
  • Installation
  • Getting Started
  • Organization
  • User Management
    • User Invitations
    • Role Management
  • Model Hub
    • Embeddings Models
    • Large Language Models (LLMs)
  • Prompt Management
    • Prompt Templates
    • Create Prompt
    • Test Prompt
      • Test & Compare
      • Prompt Observability
      • Prompt Runs
    • Agentic Prompts
      • Create Agent Prompt
      • Test Agent Prompt
    • Prompt Task Types
    • Prompt Versions
  • Datasets
  • Recipes
    • QnA Recipe
      • Data Storage Connectors
      • Connector Credential Setup
      • Vector Stores
      • Create Recipe
      • Run Recipe
      • Test Recipe
      • Evaluate Recipe
      • Export Recipe
      • Recipe Runs
      • Recipe Actions
    • Agent Recipe
      • Agent Recipe Configuration
      • Set up Agentic Recipe
      • Test Agentic Recipe
      • Agentic Evaluation
    • Databricks Recipe
  • Copilots
  • Observability
  • Dashboard Overview
    • Statistical Overview
    • Cost & Usage Summary
      • Spend by LLM Endpoint
      • Spend by Generative AI Application
    • Model Endpoints & Datasets Distribution
    • Dataset Dashboard
    • Copilot Dashboard
    • Model Endpoints Dashboard
  • Catalog Schemas
    • Connectors
    • Catalog Schema Import and Publication Process
  • Prompt Optimization Experiments
    • Set up and execute experiment
    • Optimization Insights
  • Generative AI Workshop
    • Agentic RAG
    • Intelligent Document Processing
    • Generative BI Agentic Assistant
  • Release Notes
Powered by GitBook
On this page
  • Prerequisites
  • Step 1: Create EC2 Instance Role with STS AssumeRole Policy
  • Step 2: Assign IAM Pass Role Policy to Databricks Stack Role
  • Step 3: Setup Databricks Unity Catalog:
  • Step 4: Set up Databricks Credentials in Karini AI Organization:
  • Create a recipe
  • Source
  • Dataset
  • Knowledge base
  • Data Processing Preferences
  • Prompt
  • Context Generation using Vector Search
  • Output
  • Saving a Recipe
  • Save and publish recipe
  • Run Recipe
  • Test recipe
  • Evaluate Recipe
  • Export recipe
  • Recipe runs
  • Copilots
  1. Recipes

Databricks Recipe

PreviousAgentic EvaluationNextCopilots

Last updated 8 months ago

In the Databricks recipe, your Databricks workspace is utilized for tasks such as data ingestion, creating embeddings, Databricks foundation models and running the recipe's data ingestion pipeline as Databricks workflow job.

For the general overview of how to create a recipe, refer to .

Prerequisites

Before you start using the Databricks workspace in Karini AI, you need to complete following steps to ensure everything is set up correctly.

Step 1: Create EC2 Instance Role with STS AssumeRole Policy

Create an AWS IAM role for Databricks EC2 instance profile. The IAM role needs to include permissions to your following AWS resources:

  1. Amazon Textract

  2. Amazon Comprehend

  3. Amazon S3

  4. Other AWS services as necessary

  5. STS assume role policy to obtain temporary security credentials to make API calls to AWS services.

Here is an example policy for this role.

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": [
				"textract:DetectDocumentText",
				"textract:AnalyzeDocument"
			],
			"Resource": "*"
		},
		{
			"Effect": "Allow",
			"Action": [
				"comprehend:DetectEntities",
				"comprehend:DetectKeyPhrases",
				"comprehend:DetectSentiment",
				"comprehend:DetectSyntax",
				"comprehend:DetectDominantLanguage",
				"comprehend:ClassifyDocument"
			],
			"Resource": "*"
		},
		{
			"Effect": "Allow",
			"Action": [
				"s3:ListBucket",
				"s3:GetObject",
				"s3:PutObject",
				"s3:DeleteObject",
				"s3:GetBucketLocation"
			],
			"Resource": [
				"arn:aws:s3:::*",
				"arn:aws:s3:::/"
			]
		}
	]
}

Step 2: Assign IAM Pass Role Policy to Databricks Stack Role

In AWS IAM console, locate the stack role that was created for the Databricks workspace deployment. Attach a iam:PassRole policy to this role which includes permissions to pass the role to the EC2 instance profile created in the earlier step.

Here is an example policy

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": "iam:PassRole",
			"Resource": "arn:aws:iam::111111111111:role/databricks_karini_ec2_instance_profile"
		}
	]
}

Step 3: Setup Databricks Unity Catalog:

Step 4: Set up Databricks Credentials in Karini AI Organization:

Create a recipe

To create a new recipe, go to the Recipe Page, click Add New and select Databricks as the runtime option. Provide recipe name and description.

To create a knowledge base using Databricks vector store, select recipe Type as QNA.

Source

Databricks recipe additionally provides Databricks Unity Catalog Volume connector and requires following credentials

Connector Type :

Databricks

Databricks Conector Type :

Volume (Unity Catalog Volume)

Catalog:

Name of your Databricks Unity Catalog where the source data files are located.

Schema:

Schema name from your Databricks Unity Catalog where the source data files are located.

Volume name:

Databricks Unity Catalog Volume name which contains your source data files.

Subfolder path:

If applicable, provide the subfolder path within the Databricks Unity Catalog Volume.

Dataset

Configure following options Databricks for data preprocessing:

Bronze Table:

Catalog, schema and table name of the bronze table in your Databricks workspace. Karini AI will create these resource if they don't already exist.

Silver Table:

Catalog, schema and table name of the silver table in your Databricks workspace. Karini AI will create these resource if they don't already exist.

Preprocessing Options:

Karini AI provides various options for data preprocessing.

OCR:

For source data which contains files of types pdf or image, you can perform Optical Character Recognition (OCR) by selecting one of the following options:

  • Unstructured IO with Extract Images: This method is used for extracting images from unstructured data sources. It processes unstructured documents, identifying and extracting images that can be further analyzed or used in different applications.

  • PyMuPDF with Fallback to Amazon Textract: This approach utilizes PyMuPDF to extract text and images from PDF documents. If PyMuPDF fails or is insufficient, the process falls back to Amazon Textract, ensuring a comprehensive extraction by leveraging Amazon's advanced OCR capabilities.

  • Amazon Textract with Extract Table: Amazon Textract is used to extract structured data, such as tables, from documents. This method specifically focuses on identifying and extracting tabular data, making it easier to analyze and use structured information from scanned documents or PDFs.

PII:

Link your Source element to Dataset element in the recipe canvas to start creating your data ingestion pipeline.

Knowledge base

Integrate Databricks Vector Search in to your data ingestion workflow, start by placing the Knowledge base element onto your recipe canvas.

Knowledge base provider:

Select Databricks Vector Search as the knowledge base provider.

Pipeline Type:

Select appropriate sync mode for updating the Databricks vector search index. The most cost-effective option for updating a vector search index is Triggered. Only select Continuous if you need to incrementally sync the index to changes in the source table with a latency of seconds. Both sync modes perform incremental updates – only data that has changed since the last sync is processed.

Vector endpoint:

Provide the existing the vector search endpoint that you want to use or new vector search endpoint is provisioned.

Vector Index Name:

Specify the name of the Databricks vector search index to be created or updated.

Catalog:

Provide the name of the catalog for the vector search index.

Schema:

Provide the name of the schema for the vector search index.

Link the Dataset element to the Knowledge base element in the recipe canvas to link your data with the vector store.

Data Processing Preferences

Prompt

Link the Knowledge base element to the Prompt element in the recipe canvas, establishing a link that allows the Prompt to access and utilize the Databricks vector search index for context retrieval.

Context Generation using Vector Search

When you link the Knowledge base element to the Prompt element, you can set the context generation preferences that would define how your prompt is obtaining context for the user query.

These options provide several techniques to improve the relevance and quality of your vector search.

Use Embedding chunks:

Choosing this feature conducts a semantic search to retrieve the top_k most similar vector-embedded document chunks and uses these chunks to create a contextual prompt for the Large Language Model (LLM).

Summarize chunks:

Choosing this feature conducts a semantic search to retrieve the top_k most similar vector-embedded document chunks and then summarizes these chunks to create a contextual prompt for the Large Language Model (LLM).

Top_k:

Maximum number of top matching vectors to retrieve.

Enable Reranker:

  • Top-N: Maximum number of top-ranking vectors to retrieve. This number must be lesser than the top_k parameter.

  • Reranker Threshold: A threshold for relevancy score. The reranker model will select Top-N vectors that are over the set the threshold.

Output

By adding an Output element into the recipe you can test the recipe and analyze the responses.

Link the Prompt element to the Output element in the recipe canvas, establishing a link that allows the Output element to access the response generated by the LLM using the prompt.

Saving a Recipe

You can save the recipe at any point during the creation. Saving the recipe preserves all configurations and connections made in the recipe setup.

Save and publish recipe

Once a recipe is created and saved, you need to publish it to assign it a version number. A Run button is enabled after the recipe has been published.

Run Recipe

Configurations related to the prompt and output elements are not relevant for the recipe run.

After the recipe run, you can view the following processing metrics on the recipe dashboard, highlighting each processing task.

  • X-axis: Tasks

    • OCR (Optical Character Recognition): Extraction of text from images or scanned documents.

    • PII (Personally Identifiable Information): Identification and handling of sensitive personal data such as names, addresses, or social security numbers.

    • Chunking: Division of text into smaller, meaningful parts or "chunks" for analysis or processing.

    • Embeddings: Conversion of text data into numerical format for machine learning algorithms by mapping words or phrases to vectors in a high-dimensional space

  • Y-axis: Count of processed items.

If errors occur during the recipe run, error messages are displayed in the recipe panel and can also be visualized as error counts in the dashboard.

You can also review the summary of the run, including a list of connectors with embedded items and chunks.

Task Details

Under the graph, you can find a detailed count of processed items alongside a comprehensive summary of tasks executed within the Databricks workflow.

You can clink the link icon to access and review the respective job runs within your Databricks workspace.

Test recipe

After creating the recipe, it can be tested for its performance.

Evaluate Recipe

Export recipe

Recipe runs

Copilots

This instance profile will be used by Databricks during the job cluster launch. Refer to Set up for details.

The Databricks Unity Catalog is essential for managing your data and its associated metadata. Make sure that you have setup a Unity Catalog. Follow this for more details. Karini AI offers integrations with Databricks hosted models including foundation models, custom models and external models. Refer to databricks documentation for more details

To setup access to your databricks resources, you need to configure your Databricks credentials in . Include the EC2 instance role created in earlier step along with Databricks SQL Warehouse HTTP path. Provide the cluster ID only if plan to use dedicated cluster. Ensure the cluster is created with ML Runtime 14.2.

To build a RAG recipe using agents, refer to .

Source defines your data storage connector. Drag and drop the source element onto the recipe canvas. You can select from an appropriate data connector type from a list of available connectors. More information about connectors is .

A dataset functions as an internal collection of dataset items, acting as references to the knowledge base. For more details, refer to .

If you need to mask Personally Identifiable Information (PII) within your dataset, you can enable the PII Masking option. You can select from the list of entities that you want masked for data pre-processing. To learn more about the entities refer to this .

Refer to for details.

When you link the Dataset element to the Knowledge base element, you can set the data processing options to aid the vector embeddings creation process. Refer to for details.

At this point, you can and start the to push down data ingestion and vector index creation to your Databricks workspace. Once the run is complete, the knowledge base can be accessed using the Databricks vector search index.

You can select from the existing prompts in the to add to the recipe. A prompt is associated with a LLM and Model parameters.

Re-ranking improves search relevance by reordering the result set based on the relevancy score. In order to enable reranker, you must set the reranker model in the . You can configure following options for the reranker.

For configurations on Output element, refer to .

Recipe "run" process starts the data ingestion pipeline, executing tasks to connect to data source, pre-process the data and create vector embedding as per the configurations in the recipe elements. For Databricks runtime, the data ingestion and vector index creation process is carried out in the Databricks workspace as per the configurations set in the previous sections. Refer to and for details.

Refer to section for a detailed guide on how to test recipe.

Refer to the for details.

After successful run and testing, the recipe can be exported to deploy the copilot. For details, refer to

For Details, refer to .

After exporting the recipe, you have the opportunity to explore and experiment with copilots. For detailed information on copilots and their features, refer to the .

Create Recipe
guide
Model serving with Databricks
Agent Recipe
here
documentation
Databricks vector search best practices
prompt playground
Evaluate Recipe
Export Recipe.
Recipe Runs
Copilots
Save and publish recipe
Run
Dataset
Knowledge base
Test Recipe
Dataset
Data Processing Preferences
Output
Databricks recipe
Databricks Credentials in Organization settings
Karini AI's organization setting
Organization