Databricks Recipe

In the Databricks recipe, your Databricks workspace is utilized for tasks such as data ingestion, creating embeddings, Databricks foundation models and running the recipe's data ingestion pipeline as Databricks workflow job.

For the general overview of how to create a recipe, refer to Create Recipe.

Prerequisites

Before you start using the Databricks workspace in Karini AI, you need to complete following steps to ensure everything is set up correctly.

Step 1: Create EC2 Instance Role with STS AssumeRole Policy

Create an AWS IAM role for Databricks EC2 instance profile. The IAM role needs to include permissions to your following AWS resources:

  1. Amazon Textract

  2. Amazon Comprehend

  3. Amazon S3

  4. Other AWS services as necessary

  5. STS assume role policy to obtain temporary security credentials to make API calls to AWS services.

Here is an example policy for this role.

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": [
				"textract:DetectDocumentText",
				"textract:AnalyzeDocument"
			],
			"Resource": "*"
		},
		{
			"Effect": "Allow",
			"Action": [
				"comprehend:DetectEntities",
				"comprehend:DetectKeyPhrases",
				"comprehend:DetectSentiment",
				"comprehend:DetectSyntax",
				"comprehend:DetectDominantLanguage",
				"comprehend:ClassifyDocument"
			],
			"Resource": "*"
		},
		{
			"Effect": "Allow",
			"Action": [
				"s3:ListBucket",
				"s3:GetObject",
				"s3:PutObject",
				"s3:DeleteObject",
				"s3:GetBucketLocation"
			],
			"Resource": [
				"arn:aws:s3:::*",
				"arn:aws:s3:::/"
			]
		}
	]
}

This instance profile will be used by Databricks during the job cluster launch. Refer to Set up Databricks Credentials in Organization settings for details.

Step 2: Assign IAM Pass Role Policy to Databricks Stack Role

In AWS IAM console, locate the stack role that was created for the Databricks workspace deployment. Attach a iam:PassRole policy to this role which includes permissions to pass the role to the EC2 instance profile created in the earlier step.

Here is an example policy

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": "iam:PassRole",
			"Resource": "arn:aws:iam::111111111111:role/databricks_karini_ec2_instance_profile"
		}
	]
}

Step 3: Setup Databricks Unity Catalog:

The Databricks Unity Catalog is essential for managing your data and its associated metadata. Make sure that you have setup a Unity Catalog. Follow this guide for more details. Karini AI offers integrations with Databricks hosted models including foundation models, custom models and external models. Refer to databricks documentation Model serving with Databricks for more details

Step 4: Set up Databricks Credentials in Karini AI Organization:

To setup access to your databricks resources, you need to configure your Databricks credentials in Karini AI's organization setting. Include the EC2 instance role created in earlier step along with Databricks SQL Warehouse HTTP path. Provide the cluster ID only if plan to use dedicated cluster. Ensure the cluster is created with ML Runtime 14.2.

Create a recipe

To create a new recipe, go to the Recipe Page, click Add New and select Databricks as the runtime option. Provide recipe name and description.

To create a knowledge base using Databricks vector store, select recipe Type as QNA.

To build a RAG recipe using agents, refer to Agent Recipe.

Source

Source defines your data storage connector. Drag and drop the source element onto the recipe canvas. You can select from an appropriate data connector type from a list of available connectors. More information about connectors is here.

Databricks recipe additionally provides Databricks Unity Catalog Volume connector and requires following credentials

Connector Type :

Databricks

Databricks Conector Type :

Volume (Unity Catalog Volume)

Catalog:

Name of your Databricks Unity Catalog where the source data files are located.

Schema:

Schema name from your Databricks Unity Catalog where the source data files are located.

Volume name:

Databricks Unity Catalog Volume name which contains your source data files.

Subfolder path:

If applicable, provide the subfolder path within the Databricks Unity Catalog Volume.

Dataset

A dataset functions as an internal collection of dataset items, acting as references to the knowledge base. For more details, refer to Dataset.

Configure following options Databricks for data preprocessing:

Bronze Table:

Catalog, schema and table name of the bronze table in your Databricks workspace. Karini AI will create these resource if they don't already exist.

Silver Table:

Catalog, schema and table name of the silver table in your Databricks workspace. Karini AI will create these resource if they don't already exist.

Preprocessing Options:

Karini AI provides various options for data preprocessing.

OCR:

For source data which contains files of types pdf or image, you can perform Optical Character Recognition (OCR) by selecting one of the following options:

  • Unstructured IO with Extract Images: This method is used for extracting images from unstructured data sources. It processes unstructured documents, identifying and extracting images that can be further analyzed or used in different applications.

  • PyMuPDF with Fallback to Amazon Textract: This approach utilizes PyMuPDF to extract text and images from PDF documents. If PyMuPDF fails or is insufficient, the process falls back to Amazon Textract, ensuring a comprehensive extraction by leveraging Amazon's advanced OCR capabilities.

  • Amazon Textract with Extract Table: Amazon Textract is used to extract structured data, such as tables, from documents. This method specifically focuses on identifying and extracting tabular data, making it easier to analyze and use structured information from scanned documents or PDFs.

PII:

If you need to mask Personally Identifiable Information (PII) within your dataset, you can enable the PII Masking option. You can select from the list of entities that you want masked for data pre-processing. To learn more about the entities refer to this documentation.

Link your Source element to Dataset element in the recipe canvas to start creating your data ingestion pipeline.

Knowledge base

Integrate Databricks Vector Search in to your data ingestion workflow, start by placing the Knowledge base element onto your recipe canvas.

Knowledge base provider:

Select Databricks Vector Search as the knowledge base provider.

Pipeline Type:

Select appropriate sync mode for updating the Databricks vector search index. The most cost-effective option for updating a vector search index is Triggered. Only select Continuous if you need to incrementally sync the index to changes in the source table with a latency of seconds. Both sync modes perform incremental updates – only data that has changed since the last sync is processed.

Refer to Databricks vector search best practices for details.

Vector endpoint:

Provide the existing the vector search endpoint that you want to use or new vector search endpoint is provisioned.

Vector Index Name:

Specify the name of the Databricks vector search index to be created or updated.

Catalog:

Provide the name of the catalog for the vector search index.

Schema:

Provide the name of the schema for the vector search index.

Link the Dataset element to the Knowledge base element in the recipe canvas to link your data with the vector store.

Data Processing Preferences

When you link the Dataset element to the Knowledge base element, you can set the data processing options to aid the vector embeddings creation process. Refer to Data Processing Preferences for details.

At this point, you can Save and publish recipe and start the Run to push down data ingestion and vector index creation to your Databricks workspace. Once the run is complete, the knowledge base can be accessed using the Databricks vector search index.

Prompt

You can select from the existing prompts in the prompt playground to add to the recipe. A prompt is associated with a LLM and Model parameters.

Link the Knowledge base element to the Prompt element in the recipe canvas, establishing a link that allows the Prompt to access and utilize the Databricks vector search index for context retrieval.

When you link the Knowledge base element to the Prompt element, you can set the context generation preferences that would define how your prompt is obtaining context for the user query.

These options provide several techniques to improve the relevance and quality of your vector search.

Use Embedding chunks:

Choosing this feature conducts a semantic search to retrieve the top_k most similar vector-embedded document chunks and uses these chunks to create a contextual prompt for the Large Language Model (LLM).

Summarize chunks:

Choosing this feature conducts a semantic search to retrieve the top_k most similar vector-embedded document chunks and then summarizes these chunks to create a contextual prompt for the Large Language Model (LLM).

Top_k:

Maximum number of top matching vectors to retrieve.

Enable Reranker:

Re-ranking improves search relevance by reordering the result set based on the relevancy score. In order to enable reranker, you must set the reranker model in the Organization. You can configure following options for the reranker.

  • Top-N: Maximum number of top-ranking vectors to retrieve. This number must be lesser than the top_k parameter.

  • Reranker Threshold: A threshold for relevancy score. The reranker model will select Top-N vectors that are over the set the threshold.

Output

By adding an Output element into the recipe you can test the recipe and analyze the responses.

For configurations on Output element, refer to Output.

Link the Prompt element to the Output element in the recipe canvas, establishing a link that allows the Output element to access the response generated by the LLM using the prompt.

Saving a Recipe

You can save the recipe at any point during the creation. Saving the recipe preserves all configurations and connections made in the recipe setup.

Save and publish recipe

Once a recipe is created and saved, you need to publish it to assign it a version number. A Run button is enabled after the recipe has been published.

Run Recipe

Recipe "run" process starts the data ingestion pipeline, executing tasks to connect to data source, pre-process the data and create vector embedding as per the configurations in the recipe elements. For Databricks runtime, the data ingestion and vector index creation process is carried out in the Databricks workspace as per the configurations set in the previous sections. Refer to Dataset and Knowledge base for details.

Configurations related to the prompt and output elements are not relevant for the recipe run.

After the recipe run, you can view the following processing metrics on the recipe dashboard, highlighting each processing task.

  • X-axis: Tasks

    • OCR (Optical Character Recognition): Extraction of text from images or scanned documents.

    • PII (Personally Identifiable Information): Identification and handling of sensitive personal data such as names, addresses, or social security numbers.

    • Chunking: Division of text into smaller, meaningful parts or "chunks" for analysis or processing.

    • Embeddings: Conversion of text data into numerical format for machine learning algorithms by mapping words or phrases to vectors in a high-dimensional space

  • Y-axis: Count of processed items.

If errors occur during the recipe run, error messages are displayed in the recipe panel and can also be visualized as error counts in the dashboard.

You can also review the summary of the run, including a list of connectors with embedded items and chunks.

Task Details

Under the graph, you can find a detailed count of processed items alongside a comprehensive summary of tasks executed within the Databricks workflow.

You can clink the link icon to access and review the respective job runs within your Databricks workspace.

Test recipe

After creating the recipe, it can be tested for its performance.

Refer to Test Recipe section for a detailed guide on how to test recipe.

Evaluate Recipe

Refer to the Evaluate Recipe for details.

Export recipe

After successful run and testing, the recipe can be exported to deploy the copilot. For details, refer to Export Recipe.

Recipe runs

For Details, refer to Recipe Runs.

Copilots

After exporting the recipe, you have the opportunity to explore and experiment with copilots. For detailed information on copilots and their features, refer to the Copilots.

Last updated