Data Storage Connectors

Karini AI supports out-of-the-box integration with the following data connectors. This gives you the flexibility to access data from disparate data sources.

The access credentials for the data connectors must be set in the Organization.

Amazon S3

  • Amazon Simple Storage Service (S3) is a scalable object storage service provided by Amazon Web Services (AWS).

  • In order to setup access to your datasource in S3, you need to specify the path to your S3 bucket or folder within the bucket in the recipe's storage connector. You can also use the recursive option to access data from the bucket path and all of it's subfolders.

Azure Cloud Storage

  • Azure Storage is a Microsoft-managed cloud service that provides scalable and secure storage solutions.

  • In order to setup access to your datasource in Azure Cloud Storage, you need to specify the Azure Cloud Storage Container Path in the recipe's storage connector.

Google Cloud Storage

  • Google Cloud Storage is a service provided by Google Cloud Platform that offers highly durable and available object storage.

  • To access you data source from Google Cloud Storage, you need to specify the full Google Cloud Storage bucket path in the recipe's storage connector.

Confluence

  • Confluence is a collaboration and content management tool used by teams to create, share, and manage their work in one place. It's often used for documentation, project planning, and team collaboration.

  • In Confluence, a space is a designated area where users can organize and manage related content, such as pages, documents, and discussions. To access you data from Confluence, you need to specify the confluence space name in the recipe's storage connector.

Dropbox

  • Dropbox is a file hosting service that provides cloud storage, file synchronization, personal cloud, and client software. It allows users to create a special folder on their computers, which Dropbox then synchronizes so that it appears to be the same folder (with the same contents) regardless of which device is used to view it. Dropbox is often used for file sharing and collaboration.

  • To access you data from Dropbox, you need to specify the dropbox folder name in the recipe's storage connector.

Box

  • Box is a cloud-based file storage and collaboration service that allows users to store, access, and share files from anywhere

  • To access you data from Box, you need to specify the Box credentials json as global credentials

Google Drive

  • Google Drive is a file storage and synchronization service developed by Google. It allows users to store files in the cloud, synchronize files across devices, and share files. Google Drive includes Google Docs, Sheets, and Slides, which enable collaborative editing of documents, spreadsheets, and presentations.

  • To access you data from Google drive, you need to specify the Google drive folder id in the recipe's storage connector. Google drive folder id refers to the specific path or location within your Google Drive where the files or folders you want to access are stored.

Website

A Website connector typically allows you to extract and manage data directly from websites. This can include scraping data, integrating with APIs provided by websites, or embedding website content into other applications.

Karini AI's website connectors enables you to crawl your website data source using following options.

Source Type

  1. URLs: Add up to 10 seed/starting point URLs of the websites you want to crawl. You can also include website subdomains.

  2. Sitemap: Add up to 3 sitemap URLs of the websites you want to crawl. Sitemaps help in systematically crawling and extracting data from all pages listed in the sitemap file.

  3. Source URL Files: Add up to 100 seed/starting point URLs listed in a text file in Amazon S3, or as http, https link. Each URL should be on a separate line in the text file. You can also upload from a local device.

  4. Source Sitemap Files: Add up to 3 sitemap XML files stored in Amazon S3 or local device. Upload a file containing multiple sitemap URLs to crawl and extract data from.

Configuration Settings

  • Crawl Depth: The depth, or number, of levels from the seed level to crawl. For example, the seed URL page is depth 1 and any hyperlinks on this page that are also crawled are depth 2.

  • Maximum File Size (MB): The maximum size in MB of a webpage or attachment to crawl.

  • Maximum Number of URLs Crawled per Minute per Host: Limits the rate at which the connector accesses URLs on the same host.

  • Include files in web page links: Choose to crawl files that the webpages link to.

  • Include URL Patterns: Add regular expression patterns to include crawling specific URLs, and indexing any hyperlinks on these URL webpages.

  • Exclude URL Patterns: Add regular expression patterns to exclude crawling specific URLs, and indexing any hyperlinks on these URL webpages.

Manifest

You can provide a S3 manifest file as a data source in recipe storage connector. The manifest file is expected to be in CVS format, with each line containing a url as source.

Last updated