Databricks Recipe

Last updated 8 months ago

Databricks Recipe

In the Databricks recipe, your Databricks workspace is utilized for tasks such as data ingestion, creating embeddings, Databricks foundation models and running the recipe's data ingestion pipeline as Databricks workflow job.

For the general overview of how to create a recipe, refer to .

Prerequisites

Before you start using the Databricks workspace in Karini AI, you need to complete following steps to ensure everything is set up correctly.

Step 1: Create EC2 Instance Role with STS AssumeRole Policy

Create an AWS IAM role for Databricks EC2 instance profile. The IAM role needs to include permissions to your following AWS resources:

Amazon Textract
Amazon Comprehend
Amazon S3
Other AWS services as necessary
STS assume role policy to obtain temporary security credentials to make API calls to AWS services.

Here is an example policy for this role.

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": [
				"textract:DetectDocumentText",
				"textract:AnalyzeDocument"
			],
			"Resource": "*"
		},
		{
			"Effect": "Allow",
			"Action": [
				"comprehend:DetectEntities",
				"comprehend:DetectKeyPhrases",
				"comprehend:DetectSentiment",
				"comprehend:DetectSyntax",
				"comprehend:DetectDominantLanguage",
				"comprehend:ClassifyDocument"
			],
			"Resource": "*"
		},
		{
			"Effect": "Allow",
			"Action": [
				"s3:ListBucket",
				"s3:GetObject",
				"s3:PutObject",
				"s3:DeleteObject",
				"s3:GetBucketLocation"
			],
			"Resource": [
				"arn:aws:s3:::*",
				"arn:aws:s3:::/"
			]
		}
	]
}

Step 2: Assign IAM Pass Role Policy to Databricks Stack Role

In AWS IAM console, locate the stack role that was created for the Databricks workspace deployment. Attach a iam:PassRole policy to this role which includes permissions to pass the role to the EC2 instance profile created in the earlier step.

Here is an example policy

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": "iam:PassRole",
			"Resource": "arn:aws:iam::111111111111:role/databricks_karini_ec2_instance_profile"
		}
	]
}

Step 3: Setup Databricks Unity Catalog:

Step 4: Set up Databricks Credentials in Karini AI Organization:

Create a recipe

To create a new recipe, go to the Recipe Page, click Add New and select Databricks as the runtime option. Provide recipe name and description.

To create a knowledge base using Databricks vector store, select recipe Type as QNA.

Source

Databricks recipe additionally provides Databricks Unity Catalog Volume connector and requires following credentials

Connector Type :

Databricks

Databricks Conector Type :

Volume (Unity Catalog Volume)

Catalog:

Name of your Databricks Unity Catalog where the source data files are located.

Schema:

Schema name from your Databricks Unity Catalog where the source data files are located.

Volume name:

Databricks Unity Catalog Volume name which contains your source data files.

Subfolder path:

If applicable, provide the subfolder path within the Databricks Unity Catalog Volume.

Dataset

Configure following options Databricks for data preprocessing:

Bronze Table:

Catalog, schema and table name of the bronze table in your Databricks workspace. Karini AI will create these resource if they don't already exist.

Silver Table:

Catalog, schema and table name of the silver table in your Databricks workspace. Karini AI will create these resource if they don't already exist.

Preprocessing Options:

Karini AI provides various options for data preprocessing.

OCR:

For source data which contains files of types pdf or image, you can perform Optical Character Recognition (OCR) by selecting one of the following options:

Unstructured IO with Extract Images: This method is used for extracting images from unstructured data sources. It processes unstructured documents, identifying and extracting images that can be further analyzed or used in different applications.
PyMuPDF with Fallback to Amazon Textract: This approach utilizes PyMuPDF to extract text and images from PDF documents. If PyMuPDF fails or is insufficient, the process falls back to Amazon Textract, ensuring a comprehensive extraction by leveraging Amazon's advanced OCR capabilities.
Amazon Textract with Extract Table: Amazon Textract is used to extract structured data, such as tables, from documents. This method specifically focuses on identifying and extracting tabular data, making it easier to analyze and use structured information from scanned documents or PDFs.

PII:

Link your Source element to Dataset element in the recipe canvas to start creating your data ingestion pipeline.

Knowledge base

Integrate Databricks Vector Search in to your data ingestion workflow, start by placing the Knowledge base element onto your recipe canvas.

Knowledge base provider:

Select Databricks Vector Search as the knowledge base provider.

Pipeline Type:

Select appropriate sync mode for updating the Databricks vector search index. The most cost-effective option for updating a vector search index is Triggered. Only select Continuous if you need to incrementally sync the index to changes in the source table with a latency of seconds. Both sync modes perform incremental updates – only data that has changed since the last sync is processed.

Vector endpoint:

Provide the existing the vector search endpoint that you want to use or new vector search endpoint is provisioned.

Vector Index Name:

Specify the name of the Databricks vector search index to be created or updated.

Catalog:

Provide the name of the catalog for the vector search index.

Schema:

Provide the name of the schema for the vector search index.

Link the Dataset element to the Knowledge base element in the recipe canvas to link your data with the vector store.

Data Processing Preferences

Prompt

Link the Knowledge base element to the Prompt element in the recipe canvas, establishing a link that allows the Prompt to access and utilize the Databricks vector search index for context retrieval.

Context Generation using Vector Search

When you link the Knowledge base element to the Prompt element, you can set the context generation preferences that would define how your prompt is obtaining context for the user query.

These options provide several techniques to improve the relevance and quality of your vector search.

Use Embedding chunks:

Choosing this feature conducts a semantic search to retrieve the top_k most similar vector-embedded document chunks and uses these chunks to create a contextual prompt for the Large Language Model (LLM).

Summarize chunks:

Choosing this feature conducts a semantic search to retrieve the top_k most similar vector-embedded document chunks and then summarizes these chunks to create a contextual prompt for the Large Language Model (LLM).

Top_k:

Maximum number of top matching vectors to retrieve.

Enable Reranker:

Top-N: Maximum number of top-ranking vectors to retrieve. This number must be lesser than the top_k parameter.
Reranker Threshold: A threshold for relevancy score. The reranker model will select Top-N vectors that are over the set the threshold.

Output

By adding an Output element into the recipe you can test the recipe and analyze the responses.

Link the Prompt element to the Output element in the recipe canvas, establishing a link that allows the Output element to access the response generated by the LLM using the prompt.

Saving a Recipe

You can save the recipe at any point during the creation. Saving the recipe preserves all configurations and connections made in the recipe setup.

Save and publish recipe

Once a recipe is created and saved, you need to publish it to assign it a version number. A Run button is enabled after the recipe has been published.

Run Recipe

Configurations related to the prompt and output elements are not relevant for the recipe run.

After the recipe run, you can view the following processing metrics on the recipe dashboard, highlighting each processing task.

X-axis: Tasks
- OCR (Optical Character Recognition): Extraction of text from images or scanned documents.
- PII (Personally Identifiable Information): Identification and handling of sensitive personal data such as names, addresses, or social security numbers.
- Chunking: Division of text into smaller, meaningful parts or "chunks" for analysis or processing.
- Embeddings: Conversion of text data into numerical format for machine learning algorithms by mapping words or phrases to vectors in a high-dimensional space
Y-axis: Count of processed items.

If errors occur during the recipe run, error messages are displayed in the recipe panel and can also be visualized as error counts in the dashboard.

You can also review the summary of the run, including a list of connectors with embedded items and chunks.

Task Details

Under the graph, you can find a detailed count of processed items alongside a comprehensive summary of tasks executed within the Databricks workflow.

You can clink the link icon to access and review the respective job runs within your Databricks workspace.

Prerequisites

Step 1: Create EC2 Instance Role with STS AssumeRole Policy