Datasets
Last updated
Last updated
In the recipe building process, datasets play a crucial role, serving as the foundation for various data processing operations.
Datasets can be added as follow.
On the recipe canvas, drag and drop the Dataset element into the recipe.
Provide a user-friendly name and description.
Choose the dataset type: text, image, audio, or video.
Save the dataset to view it on the dashboard with the latest updates.
The metadata feature displays default attributes extracted from a dataset after it has been processed through a recipe or workflow. In this case, fields such as source_ref
, checksum
, file_type
, and other metadata related to the file are extracted automatically. These help identify and validate the dataset's integrity and origin.
This feature allows for customized metadata extraction by using a specific prompt. Users can specify the relevant keys they want to extract, offering flexibility for tailored metadata extractions. This is useful for datasets that may have additional custom attributes or unique fields not covered by the default extraction.
Access Control List (ACL) tags are shown when the recipe is processed with ACLs enabled. These tags define the permissions and access control for the dataset, ensuring that only authorized users or processes can interact with certain data. The ACL information is displayed to ensure proper data governance and security protocols are followed during dataset processing.
This feature specifies the technical details of the embedding model used for text processing.
The chart offers a clear visual summary of the processing status for tasks. It tracks the success and failure of various stages in the data processing pipeline, providing insights into the overall performance.
Total Items: Displays the total number of items being processed.
Processing Tasks: Tracks the following stages:
OCR (Optical Character Recognition): Converts images or scanned documents into machine-readable text.
PII (Personally Identifiable Information): Detects and manages sensitive personal data within the dataset.
Chunking: Breaking down larger pieces of data into smaller, more manageable chunks.
Embeddings: Transforms data into numerical representations for use in machine learning models.
Processing Status: Indicates the success or failure of each task:
Success: Tasks marked with green indicate successful completion.
Error: Tasks with orange bars represent errors encountered during processing.